libai.engine

libai.engine.default module

libai.engine.default.default_setup(cfg, args)[source]

Perform some basic common setups at the beginning of a job, including:

  1. Set up the libai logger

  2. Log basic information about environment, cmdline arguments, and config

  3. Setup the distributed environment

  4. Setup tokenizer if it’s an NLP related task

  5. Check batch_size

  6. Backup the config to the output directory

  7. Compile dependencies

Parameters

args (argparse.NameSpace) – the command line arguments to be logged

class libai.engine.default.DefaultTrainer(cfg)[source]

A trainer with default training logic. Compared to TrainerBase, it also contains the following logic:

  1. Create model, optimizer, scheduler, dataloader from the given config.

  2. Load a checkpoint or cfg.MODEL.WEIGHTS, if exists.

  3. Register a few common hooks defined by the config.

With standard features, it is created to simplify the standard model training workflow and reduce code boilerplate for users who only need the standard training workflow.

It means this class makes many assumptions about your training logic that may easily become invalid in a new research. In fact, any assumptions beyond those made in the TrainerBase are too much for research.

The code of this class has been annotated about restrictive assumptions it made. When they do not work for you, you’re encouraged to:

  1. Overwrite methods of this class, OR:

  2. Use TrainerBase, which only does minimal SGD training and nothing else. You can then add your own hooks if needed. OR:

  3. Write your own training loop similar to tools/train_net.py.

Also note that the behavior of this class, like other functions/classes in this file, is not stable, since it is meant to represent the “common default behavior”. It is only guaranteed to work well with the standard models and training workflow in libai. To obtain more stable behavior, write your own training logic with other public APIs.

Examples:

trainer = DefaultTrainer(cfg)
trainer.resume_or_load()  # load last checkpoint or MODEL.WEIGHTS
trainer.train()
scheduler
checkpointer
Type

Checkpointer

cfg
Type

omegaconf.dictconfig.DictConfig

resume_or_load(resume=True)[source]

If resume==True and cfg.train.output_dir contains the last checkpoint (defined by a last_checkpoint file), resume from the file. Resuming means loading all available states (eg. optimizer and scheduler) and update iteration counter from the checkpoint. cfg.train.load_weight will not be used. Otherwise, this is considered as an independent training. The method will load model weights from the file cfg.train.load_weight (but will not load other states) and start from iteration 0.

Parameters

resume (bool) – whether to do resume or not

build_hooks()[source]

Build a list of default hooks, including timing, evaluation, checkpointing, lr scheduling, precise BN, writing events.

Returns

Return type

list[HookBase]

build_writers()[source]

Build a list of writers to be used. By default it contains writers that write metrics to the screen, a json file, and a tensorboard event file respectively. If you’d like a different list of writers, you can overwrite it in your trainer.

Returns

a list of EventWriter objects.

Return type

list[EventWriter]

It is now implemented by:

return [
    CommonMetricPrinter(self.global_batch_size, self.max_iter),
    JSONWriter(os.path.join(self.cfg.train.output_dir, "metrics.json")),
    TensorboardXWriter(self.cfg.train.output_dir),
]
train()[source]

Run training.

Returns

OrderedDict of results, if evaluation is enabled. Otherwise None.

classmethod get_batch(data: libai.data.structures.Instance, input_placement_device: str = 'cuda', mixup_func: Optional[Callable] = None)[source]

Convert batched local tensor to distributed tensor for model step running.

If you want to do something with batched data before model, (e.g. mixup), you can rewrite this function.

classmethod build_tokenizer(cfg)[source]
Returns

Return type

libai.tokenizer.PreTrainedTokenizer

It now calls libai.tokenizer.build_tokenizer().

classmethod build_model(cfg)[source]
Returns

Return type

flow.nn.Module

It now calls libai.models.build_model(). Overwrite it if you’d like a different model.

classmethod build_optimizer(cfg, model)[source]
Returns

Return type

flow.optim.Optimizer

It now calls libai.optim.build_optimizer(). Overwrite it if you’d like a different optimizer.

classmethod build_lr_scheduler(cfg, optimizer)[source]

It now calls libai.scheduler.build_lr_scheduler(). Overwrite it if you’d like a different scheduler.

classmethod build_train_loader(cfg, tokenizer=None)[source]
Returns

iterable

It now calls libai.data.build_train_valid_test_loader(). Overwrite it if you’d like a different data loader.

classmethod build_test_loader(cfg, tokenizer=None)[source]
Returns

iterable

It now calls libai.data.build_image_test_loader() for CV tasks or libai.data.build_nlp_test_loader() for NLP tasks. Overwrite it if you’d like a different data loader.

classmethod test(cfg, test_loaders, model, evaluator=None)[source]

Evaluate the given model. The given model is expected to already contain weights to evaluate.

Parameters
  • cfg (CfgNode) –

  • test_loaders – list [dataloader1, dataloader2, …]

  • model (nn.Graph) –

  • evaluators (list[DatasetEvaluator] or None) – if None, will call build_evaluator(). Otherwise, must have the same length as cfg.DATASETS.TEST.

Returns

a dict of result metrics

Return type

dict

libai.engine.hooks module

class libai.engine.hooks.CallbackHook(*, before_train=None, after_train=None, before_step=None, after_step=None)[source]

Create a hook using callback functions provided by the user.

before_train()[source]

Called before the first iteration.

after_train()[source]

Called after the last iteration.

before_step()[source]

Called before each iteration.

after_step()[source]

Called after each iteration.

class libai.engine.hooks.IterationTimer(warmup_iter=3)[source]

Track the time spent for each iteration (each run_step call in the trainer). Print a summary in the end of training. This hook uses the time between the call to its before_step() and after_step() methods. Under the convention that before_step() of all hooks should only take negligible amount of time, the IterationTimer hook should be placed at the beginning of the list of hooks to obtain accurate timing.

before_train()[source]

Called before the first iteration.

after_train()[source]

Called after the last iteration.

before_step()[source]

Called before each iteration.

after_step()[source]

Called after each iteration.

class libai.engine.hooks.PeriodicWriter(writers, period=20)[source]

Write events to EventStorage periodically. It is executed every period iterations and after the last iteration.

after_step()[source]

Called after each iteration.

after_train()[source]

Called after the last iteration.

class libai.engine.hooks.PeriodicCheckpointer(checkpointer: libai.utils.checkpoint.Checkpointer, period: int, max_iter: Optional[int] = None, max_to_keep: Optional[int] = None, file_prefix: str = 'model')[source]

Same as libai.utils.checkpoint.PeriodicCheckpointer, but as a hook. Note that when used as a hook, it is unable to save additional data other than what’s defined by the given checkpointer. It is executed every period iterations and after the last iteration.

before_train()[source]

Called before the first iteration.

after_step()[source]

Called after each iteration.

class libai.engine.hooks.BestCheckpointer(eval_period: int, checkpointer: libai.utils.checkpoint.Checkpointer, val_metric: str, mode: str = 'max', file_prefix: str = 'model_best')[source]

Checkpoints best weights based off given metric. This hook should be used in conjunction to and executed after the hook that produces the metric, e.g. EvalHook.

after_step()[source]

Called after each iteration.

after_train()[source]

Called after the last iteration.

class libai.engine.hooks.EvalHook(eval_period, eval_function)[source]

Run an evaluation function periodically, and at the end of training. It is executed every eval_period iterations and after the last iteration.

after_step()[source]

Called after each iteration.

after_train()[source]

Called after the last iteration.

class libai.engine.hooks.LRScheduler(optimizer=None, scheduler=None)[source]

A hook which executes a oneflow builtin LR scheduler and summarizes the LR. It is executed after every iteration.

before_train()[source]

Called before the first iteration.

after_step()[source]

Called after each iteration.

libai.engine.trainer module

class libai.engine.trainer.HookBase[source]

Base class for hooks that can be registered with TrainerBase.

Each hook can implement 4 methods. The way they are called is demonstrated in the following snippet:

hook.before_train()
for iter in range(start_iter, max_iter):
    hook.before_step()
    trainer.run_step()
    hook.after_step()
iter += 1
hook.after_train()

Notes

  1. In the hook method, users can access self.trainer to access more properties about the context (e.g., model, current iteration, or config if using DefaultTrainer).

  2. A hook that does something in before_step() can often be implemented equivalently in after_step(). If the hook takes non-trivial time, it is strongly recommended to implement the hook in after_step() instead of before_step(). The convention is that before_step() should only take negligible time.

    Following this convention will allow hooks that do care about the difference between before_step() and after_step() (e.g., timer) to function properly.

trainer: libai.engine.trainer.TrainerBase = None

A weak reference to the trainer object. Set by the trainer when the hook is registered.

before_train()[source]

Called before the first iteration.

after_train()[source]

Called after the last iteration.

before_step()[source]

Called before each iteration.

after_step()[source]

Called after each iteration.

class libai.engine.trainer.TrainerBase[source]

Base class for iterative trainer with hooks. The only assumption we made here is: the training runs in a loop. A subclass can implement what the loop is. We made no assumptions about the existence of dataloader, optimizer, model, etc.

iter

The current iteration.

Type

int

start_iter

The iteration to start with. By convention the minimum possible value is 0.

Type

int

max_iter

The iteration to end training.

Type

int

storage

An EventStorage that’s opened during the course of training.

Type

EventStorage

register_hooks(hooks)[source]

Register hooks to the trainer. The hooks are executed in the order they are registered.

Parameters

hooks (list[Optional[HookBase]]) – list of hooks

train(start_iter: int, max_iter: int)[source]
Parameters
  • start_iter (int) – See docs above

  • max_iter (int) – See docs above

static write_metrics(loss_dict: Mapping[str, oneflow.Tensor], data_time: float, prefix: str = '')None[source]
Parameters
  • loss_dict (dict) – dict of scalar losses

  • data_time (float) – time taken by the dataloader iteration

  • prefix (str) – prefix for logging keys

class libai.engine.trainer.EagerTrainer(model, data_loader, optimizer, grad_acc_steps=1)[source]

A simple eager trainer for the most common type of task: single-cost single-optimizer single-data-source iterative optimization, optionally using data-parallelism. It assumes that in every step, you:

  1. Compute the loss with a data from the data_loader.

  2. Compute the gradients with the above loss.

  3. Update the model with the optimizer.

All other tasks during training (checkpointing, logging, evaluation, LR schedule) are maintained by hooks, which can be registered by TrainerBase.register_hooks(). If you want to do anything fancier than this, either subclass TrainerBase and implement your own run_step, or write your own training loop.

run_step(get_batch: Callable, input_placement_device: str = 'cuda')[source]

Implement the standard training logic described above.

class libai.engine.trainer.GraphTrainer(graph, data_loader, grad_acc_steps=1)[source]

A simple graph trainer for training and evaluating models in a static graph mode.

run_step(get_batch: Callable, input_placement_device: str = 'cuda')[source]

Implement the standard training logic described above.