libai.engine¶

libai.engine.default module¶

libai.engine.default.default_setup(cfg, args)[source]¶

Perform some basic common setups at the beginning of a job, including:

Set up the libai logger
Log basic information about environment, cmdline arguments, and config
Setup the distributed environment
Setup tokenizer if it’s an NLP related task
Check batch_size
Backup the config to the output directory
Compile dependencies

Parameters: args (argparse.NameSpace) – the command line arguments to be logged

class libai.engine.default.DefaultTrainer(cfg)[source]¶

A trainer with default training logic. Compared to TrainerBase, it also contains the following logic:

Create model, optimizer, scheduler, dataloader from the given config.
Load a checkpoint or cfg.MODEL.WEIGHTS, if exists.
Register a few common hooks defined by the config.

With standard features, it is created to simplify the standard model training workflow and reduce code boilerplate for users who only need the standard training workflow.

It means this class makes many assumptions about your training logic that may easily become invalid in a new research. In fact, any assumptions beyond those made in the TrainerBase are too much for research.

The code of this class has been annotated about restrictive assumptions it made. When they do not work for you, you’re encouraged to:

Overwrite methods of this class, OR:
Use TrainerBase, which only does minimal SGD training and nothing else. You can then add your own hooks if needed. OR:
Write your own training loop similar to tools/train_net.py.

Also note that the behavior of this class, like other functions/classes in this file, is not stable, since it is meant to represent the “common default behavior”. It is only guaranteed to work well with the standard models and training workflow in libai. To obtain more stable behavior, write your own training logic with other public APIs.

Examples:

trainer = DefaultTrainer(cfg)
trainer.resume_or_load()  # load last checkpoint or MODEL.WEIGHTS
trainer.train()

scheduler¶

checkpointer¶

Type: Checkpointer

cfg¶

Type: omegaconf.dictconfig.DictConfig

resume_or_load(resume=True)[source]¶

If resume==True and cfg.train.output_dir contains the last checkpoint (defined by a last_checkpoint file), resume from the file. Resuming means loading all available states (eg. optimizer and scheduler) and update iteration counter from the checkpoint. cfg.train.load_weight will not be used. Otherwise, this is considered as an independent training. The method will load model weights from the file cfg.train.load_weight (but will not load other states) and start from iteration 0.

Parameters: resume (bool) – whether to do resume or not

build_hooks()[source]¶

Build a list of default hooks, including timing, evaluation, checkpointing, lr scheduling, precise BN, writing events.

Returns
Return type: list[HookBase]

build_writers()[source]¶

Build a list of writers to be used. By default it contains writers that write metrics to the screen, a json file, and a tensorboard event file respectively. If you’d like a different list of writers, you can overwrite it in your trainer.

Returns: a list of EventWriter objects.
Return type: list[EventWriter]

It is now implemented by:

return [
    CommonMetricPrinter(self.global_batch_size, self.max_iter),
    JSONWriter(os.path.join(self.cfg.train.output_dir, "metrics.json")),
    TensorboardXWriter(self.cfg.train.output_dir),
]

train()[source]¶

Run training.

Returns: OrderedDict of results, if evaluation is enabled. Otherwise None.

classmethod get_batch(data: libai.data.structures.Instance, input_placement_device: str = 'cuda', mixup_func: Optional[Callable] = None)[source]¶

Convert batched local tensor to distributed tensor for model step running.

If you want to do something with batched data before model, (e.g. mixup), you can rewrite this function.

classmethod build_tokenizer(cfg)[source]¶

Returns
Return type: libai.tokenizer.PreTrainedTokenizer

It now calls libai.tokenizer.build_tokenizer().

classmethod build_model(cfg)[source]¶

Returns
Return type: flow.nn.Module

It now calls libai.models.build_model(). Overwrite it if you’d like a different model.

classmethod build_optimizer(cfg, model)[source]¶

Returns
Return type: torch.optim.Optimizer

It now calls libai.optim.build_optimizer(). Overwrite it if you’d like a different optimizer.

classmethod build_lr_scheduler(cfg, optimizer)[source]¶: It now calls libai.scheduler.build_lr_scheduler(). Overwrite it if you’d like a different scheduler.

classmethod build_train_loader(cfg, tokenizer=None)[source]¶

Returns: iterable

It now calls libai.data.build_train_valid_test_loader(). Overwrite it if you’d like a different data loader.

classmethod build_test_loader(cfg, tokenizer=None)[source]¶

Returns: iterable

It now calls libai.data.build_image_test_loader() for CV tasks or libai.data.build_nlp_test_loader() for NLP tasks. Overwrite it if you’d like a different data loader.

classmethod test(cfg, test_loaders, model, evaluator=None)[source]¶

Evaluate the given model. The given model is expected to already contain weights to evaluate.

Parameters

cfg (CfgNode) –
test_loaders – list [dataloader1, dataloader2, …]
model (nn.Graph) –
evaluators (list[DatasetEvaluator] or None) – if None, will call build_evaluator(). Otherwise, must have the same length as cfg.DATASETS.TEST.

Returns

a dict of result metrics

Return type

dict

libai.engine.hooks module¶

class libai.engine.hooks.CallbackHook(*, before_train=None, after_train=None, before_step=None, after_step=None)[source]¶

Create a hook using callback functions provided by the user.

before_train()[source]¶: Called before the first iteration.

after_train()[source]¶: Called after the last iteration.

before_step()[source]¶: Called before each iteration.

after_step()[source]¶: Called after each iteration.

class libai.engine.hooks.IterationTimer(warmup_iter=3)[source]¶

Track the time spent for each iteration (each run_step call in the trainer). Print a summary in the end of training. This hook uses the time between the call to its before_step() and after_step() methods. Under the convention that before_step() of all hooks should only take negligible amount of time, the IterationTimer hook should be placed at the beginning of the list of hooks to obtain accurate timing.

before_train()[source]¶: Called before the first iteration.

after_train()[source]¶: Called after the last iteration.

before_step()[source]¶: Called before each iteration.

after_step()[source]¶: Called after each iteration.

class libai.engine.hooks.PeriodicWriter(writers, period=20)[source]¶

Write events to EventStorage periodically. It is executed every period iterations and after the last iteration.

after_step()[source]¶: Called after each iteration.

after_train()[source]¶: Called after the last iteration.

class libai.engine.hooks.PeriodicCheckpointer(checkpointer: libai.utils.checkpoint.Checkpointer, period: int, max_iter: Optional[int] = None, max_to_keep: Optional[int] = None, file_prefix: str = 'model')[source]¶

Same as libai.utils.checkpoint.PeriodicCheckpointer, but as a hook. Note that when used as a hook, it is unable to save additional data other than what’s defined by the given checkpointer. It is executed every period iterations and after the last iteration.

before_train()[source]¶: Called before the first iteration.

after_step()[source]¶: Called after each iteration.

class libai.engine.hooks.BestCheckpointer(eval_period: int, checkpointer: libai.utils.checkpoint.Checkpointer, val_metric: str, mode: str = 'max', file_prefix: str = 'model_best')[source]¶

Checkpoints best weights based off given metric. This hook should be used in conjunction to and executed after the hook that produces the metric, e.g. EvalHook.

after_step()[source]¶: Called after each iteration.

after_train()[source]¶: Called after the last iteration.

class libai.engine.hooks.EvalHook(eval_period, eval_function)[source]¶

Run an evaluation function periodically, and at the end of training. It is executed every eval_period iterations and after the last iteration.

after_step()[source]¶: Called after each iteration.

after_train()[source]¶: Called after the last iteration.

class libai.engine.hooks.LRScheduler(optimizer=None, scheduler=None)[source]¶

A hook which executes a torch builtin LR scheduler and summarizes the LR. It is executed after every iteration.

before_train()[source]¶: Called before the first iteration.

after_step()[source]¶: Called after each iteration.

libai.engine.trainer module¶

class libai.engine.trainer.HookBase[source]¶

Base class for hooks that can be registered with TrainerBase.

Each hook can implement 4 methods. The way they are called is demonstrated in the following snippet:

hook.before_train()
for iter in range(start_iter, max_iter):
    hook.before_step()
    trainer.run_step()
    hook.after_step()
iter += 1
hook.after_train()

Notes

In the hook method, users can access self.trainer to access more properties about the context (e.g., model, current iteration, or config if using DefaultTrainer).
A hook that does something in before_step() can often be implemented equivalently in after_step(). If the hook takes non-trivial time, it is strongly recommended to implement the hook in after_step() instead of before_step(). The convention is that before_step() should only take negligible time.

Following this convention will allow hooks that do care about the difference between before_step() and after_step() (e.g., timer) to function properly.

trainer: libai.engine.trainer.TrainerBase = None¶: A weak reference to the trainer object. Set by the trainer when the hook is registered.

before_train()[source]¶: Called before the first iteration.

after_train()[source]¶: Called after the last iteration.

before_step()[source]¶: Called before each iteration.

after_step()[source]¶: Called after each iteration.

class libai.engine.trainer.TrainerBase[source]¶

Base class for iterative trainer with hooks. The only assumption we made here is: the training runs in a loop. A subclass can implement what the loop is. We made no assumptions about the existence of dataloader, optimizer, model, etc.

iter¶

The current iteration.

Type: int

start_iter¶

The iteration to start with. By convention the minimum possible value is 0.

Type: int

max_iter¶

The iteration to end training.

Type: int

storage¶

An EventStorage that’s opened during the course of training.

Type: EventStorage

register_hooks(hooks)[source]¶

Register hooks to the trainer. The hooks are executed in the order they are registered.

Parameters: hooks (list[Optional[HookBase]]) – list of hooks

train(start_iter: int, max_iter: int)[source]¶

Parameters

start_iter (int) – See docs above
max_iter (int) – See docs above

static write_metrics(loss_dict: Mapping[str, oneflow.Tensor], data_time: float, prefix: str = '') → None[source]¶

Parameters

loss_dict (dict) – dict of scalar losses
data_time (float) – time taken by the dataloader iteration
prefix (str) – prefix for logging keys

class libai.engine.trainer.EagerTrainer(model, data_loader, optimizer, grad_acc_steps=1)[source]¶

A simple eager trainer for the most common type of task: single-cost single-optimizer single-data-source iterative optimization, optionally using data-parallelism. It assumes that in every step, you:

Compute the loss with a data from the data_loader.
Compute the gradients with the above loss.
Update the model with the optimizer.

All other tasks during training (checkpointing, logging, evaluation, LR schedule) are maintained by hooks, which can be registered by TrainerBase.register_hooks(). If you want to do anything fancier than this, either subclass TrainerBase and implement your own run_step, or write your own training loop.

run_step(get_batch: Callable, input_placement_device: str = 'cuda')[source]¶: Implement the standard training logic described above.

class libai.engine.trainer.GraphTrainer(graph, data_loader, grad_acc_steps=1)[source]¶

A simple graph trainer for training and evaluating models in a static graph mode.

run_step(get_batch: Callable, input_placement_device: str = 'cuda')[source]¶: Implement the standard training logic described above.