libai.engine¶
libai.engine.default module¶
-
libai.engine.default.
default_setup
(cfg, args)[source]¶ Perform some basic common setups at the beginning of a job, including:
Set up the libai logger
Log basic information about environment, cmdline arguments, and config
Setup the distributed environment
Setup tokenizer if it’s an NLP related task
Check batch_size
Backup the config to the output directory
Compile dependencies
- Parameters
args (argparse.NameSpace) – the command line arguments to be logged
-
class
libai.engine.default.
DefaultTrainer
(cfg)[source]¶ A trainer with default training logic. Compared to TrainerBase, it also contains the following logic:
Create model, optimizer, scheduler, dataloader from the given config.
Load a checkpoint or cfg.MODEL.WEIGHTS, if exists.
Register a few common hooks defined by the config.
With standard features, it is created to simplify the standard model training workflow and reduce code boilerplate for users who only need the standard training workflow.
It means this class makes many assumptions about your training logic that may easily become invalid in a new research. In fact, any assumptions beyond those made in the
TrainerBase
are too much for research.The code of this class has been annotated about restrictive assumptions it made. When they do not work for you, you’re encouraged to:
Overwrite methods of this class, OR:
Use
TrainerBase
, which only does minimal SGD training and nothing else. You can then add your own hooks if needed. OR:Write your own training loop similar to
tools/train_net.py
.
Also note that the behavior of this class, like other functions/classes in this file, is not stable, since it is meant to represent the “common default behavior”. It is only guaranteed to work well with the standard models and training workflow in libai. To obtain more stable behavior, write your own training logic with other public APIs.
Examples:
trainer = DefaultTrainer(cfg) trainer.resume_or_load() # load last checkpoint or MODEL.WEIGHTS trainer.train()
-
scheduler
¶
-
checkpointer
¶ - Type
-
cfg
¶ - Type
omegaconf.dictconfig.DictConfig
-
resume_or_load
(resume=True)[source]¶ If resume==True and cfg.train.output_dir contains the last checkpoint (defined by a last_checkpoint file), resume from the file. Resuming means loading all available states (eg. optimizer and scheduler) and update iteration counter from the checkpoint.
cfg.train.load_weight
will not be used. Otherwise, this is considered as an independent training. The method will load model weights from the filecfg.train.load_weight
(but will not load other states) and start from iteration 0.- Parameters
resume (bool) – whether to do resume or not
-
build_hooks
()[source]¶ Build a list of default hooks, including timing, evaluation, checkpointing, lr scheduling, precise BN, writing events.
- Returns
- Return type
list[HookBase]
-
build_writers
()[source]¶ Build a list of writers to be used. By default it contains writers that write metrics to the screen, a json file, and a tensorboard event file respectively. If you’d like a different list of writers, you can overwrite it in your trainer.
- Returns
a list of
EventWriter
objects.- Return type
list[EventWriter]
It is now implemented by:
return [ CommonMetricPrinter(self.global_batch_size, self.max_iter), JSONWriter(os.path.join(self.cfg.train.output_dir, "metrics.json")), TensorboardXWriter(self.cfg.train.output_dir), ]
-
train
()[source]¶ Run training.
- Returns
OrderedDict of results, if evaluation is enabled. Otherwise None.
-
classmethod
get_batch
(data: libai.data.structures.Instance, input_placement_device: str = 'cuda', mixup_func: Optional[Callable] = None)[source]¶ Convert batched local tensor to distributed tensor for model step running.
If you want to do something with batched data before model, (e.g. mixup), you can rewrite this function.
-
classmethod
build_tokenizer
(cfg)[source]¶ - Returns
- Return type
It now calls
libai.tokenizer.build_tokenizer()
.
-
classmethod
build_model
(cfg)[source]¶ - Returns
- Return type
flow.nn.Module
It now calls
libai.models.build_model()
. Overwrite it if you’d like a different model.
-
classmethod
build_optimizer
(cfg, model)[source]¶ - Returns
- Return type
torch.optim.Optimizer
It now calls
libai.optim.build_optimizer()
. Overwrite it if you’d like a different optimizer.
-
classmethod
build_lr_scheduler
(cfg, optimizer)[source]¶ It now calls
libai.scheduler.build_lr_scheduler()
. Overwrite it if you’d like a different scheduler.
-
classmethod
build_train_loader
(cfg, tokenizer=None)[source]¶ - Returns
iterable
It now calls
libai.data.build_train_valid_test_loader()
. Overwrite it if you’d like a different data loader.
-
classmethod
build_test_loader
(cfg, tokenizer=None)[source]¶ - Returns
iterable
It now calls
libai.data.build_image_test_loader()
for CV tasks orlibai.data.build_nlp_test_loader()
for NLP tasks. Overwrite it if you’d like a different data loader.
-
classmethod
test
(cfg, test_loaders, model, evaluator=None)[source]¶ Evaluate the given model. The given model is expected to already contain weights to evaluate.
- Parameters
cfg (CfgNode) –
test_loaders – list [dataloader1, dataloader2, …]
model (nn.Graph) –
evaluators (list[DatasetEvaluator] or None) – if None, will call
build_evaluator()
. Otherwise, must have the same length ascfg.DATASETS.TEST
.
- Returns
a dict of result metrics
- Return type
dict
libai.engine.hooks module¶
-
class
libai.engine.hooks.
CallbackHook
(*, before_train=None, after_train=None, before_step=None, after_step=None)[source]¶ Create a hook using callback functions provided by the user.
-
class
libai.engine.hooks.
IterationTimer
(warmup_iter=3)[source]¶ Track the time spent for each iteration (each run_step call in the trainer). Print a summary in the end of training. This hook uses the time between the call to its
before_step()
andafter_step()
methods. Under the convention thatbefore_step()
of all hooks should only take negligible amount of time, theIterationTimer
hook should be placed at the beginning of the list of hooks to obtain accurate timing.
-
class
libai.engine.hooks.
PeriodicWriter
(writers, period=20)[source]¶ Write events to EventStorage periodically. It is executed every
period
iterations and after the last iteration.
-
class
libai.engine.hooks.
PeriodicCheckpointer
(checkpointer: libai.utils.checkpoint.Checkpointer, period: int, max_iter: Optional[int] = None, max_to_keep: Optional[int] = None, file_prefix: str = 'model')[source]¶ Same as
libai.utils.checkpoint.PeriodicCheckpointer
, but as a hook. Note that when used as a hook, it is unable to save additional data other than what’s defined by the given checkpointer. It is executed everyperiod
iterations and after the last iteration.
-
class
libai.engine.hooks.
BestCheckpointer
(eval_period: int, checkpointer: libai.utils.checkpoint.Checkpointer, val_metric: str, mode: str = 'max', file_prefix: str = 'model_best')[source]¶ Checkpoints best weights based off given metric. This hook should be used in conjunction to and executed after the hook that produces the metric, e.g. EvalHook.
libai.engine.trainer module¶
-
class
libai.engine.trainer.
HookBase
[source]¶ Base class for hooks that can be registered with
TrainerBase
.Each hook can implement 4 methods. The way they are called is demonstrated in the following snippet:
hook.before_train() for iter in range(start_iter, max_iter): hook.before_step() trainer.run_step() hook.after_step() iter += 1 hook.after_train()
Notes
In the hook method, users can access
self.trainer
to access more properties about the context (e.g., model, current iteration, or config if usingDefaultTrainer
).A hook that does something in
before_step()
can often be implemented equivalently inafter_step()
. If the hook takes non-trivial time, it is strongly recommended to implement the hook inafter_step()
instead ofbefore_step()
. The convention is thatbefore_step()
should only take negligible time.Following this convention will allow hooks that do care about the difference between
before_step()
andafter_step()
(e.g., timer) to function properly.
-
trainer
: libai.engine.trainer.TrainerBase = None¶ A weak reference to the trainer object. Set by the trainer when the hook is registered.
-
class
libai.engine.trainer.
TrainerBase
[source]¶ Base class for iterative trainer with hooks. The only assumption we made here is: the training runs in a loop. A subclass can implement what the loop is. We made no assumptions about the existence of dataloader, optimizer, model, etc.
-
iter
¶ The current iteration.
- Type
int
-
start_iter
¶ The iteration to start with. By convention the minimum possible value is 0.
- Type
int
-
max_iter
¶ The iteration to end training.
- Type
int
-
storage
¶ An EventStorage that’s opened during the course of training.
- Type
-
register_hooks
(hooks)[source]¶ Register hooks to the trainer. The hooks are executed in the order they are registered.
- Parameters
hooks (list[Optional[HookBase]]) – list of hooks
-
-
class
libai.engine.trainer.
EagerTrainer
(model, data_loader, optimizer, grad_acc_steps=1)[source]¶ A simple eager trainer for the most common type of task: single-cost single-optimizer single-data-source iterative optimization, optionally using data-parallelism. It assumes that in every step, you:
Compute the loss with a data from the data_loader.
Compute the gradients with the above loss.
Update the model with the optimizer.
All other tasks during training (checkpointing, logging, evaluation, LR schedule) are maintained by hooks, which can be registered by
TrainerBase.register_hooks()
. If you want to do anything fancier than this, either subclass TrainerBase and implement your own run_step, or write your own training loop.