Training¶
To run training, we highly recommend using the standardized trainer
in LiBai.
Trainer Abstraction¶
LiBai provides a standardized trainer
abstraction with a hook system to help simplify the standard training behavior.
DefaultTrainer
is initialized from the lazy config system, used by tools/train_net.py
and many scripts. It includes many standard default behaviors that you might want to opt in, including default configurations for the optimizer, learning rate scheduler, logging, evaluation, model checkpointing, etc.
For simple customizations (e.g. change optimizer, evaluator, LR scheduler, data loader, etc.), you can just modify the corresponding configuration in config.py
according to your own needs (refer to Config_System).
Customize a DefaultTrainer¶
For complicated customizations, we recommend you to overwrite function in DefaultTrainer.
In DefaultTrainer
, the training process consists of run_step in trainer
and hooks
which can be modified according to your own needs.
The following code indicates how run_step
and hooks
work during training:
class DefaultTrainer(TrainerBase):
def train(self, start_iter: int, max_iter: int):
...
with EventStorage(self.start_iter) as self.storage:
try:
self.before_train() # in hooks
for self.iter in range(start_iter, max_iter):
self.before_step() # in hooks
self.run_step() # in self._trainer
self.after_step() # in hooks
self.iter += 1
except Exception:
logger.exception("Exception during training:")
raise
finally:
self.after_train() # in hooks
Refer to tools/train_net.py
to rewrite tools/my_train_net.py
with your modified _trainer
and hooks
. The next subsection will introduce how to modify them.
# tools/my_train_net.py
import ...
from libai.engine import DefaultTrainer
from path_to_myhook import myhook
from path_to_mytrainer import _mytrainer
class MyTrainer(DefaultTrainer):
def __init__(self, cfg):
super().__init__(cfg)
# add your _trainer according to your own needs
# NOTE: run_step() is overwrited in your _trainer
self._trainer = _mytrainer()
def build_hooks(self):
ret = [
hooks.IterationTimer(),
hooks.LRScheduler(),
hooks.PeriodicCheckpointer(self.checkpointer, self.cfg.train.checkpointer.period),
]
# add your hook according to your own needs
# NOTE: all hooks will be called sequentially
ret.append(myhook())
...
if dist.is_main_process():
ret.append(hooks.PeriodicWriter(self.build_writers(), self.cfg.train.log_period))
return ret
logger = logging.getLogger("libai." + __name__)
def main(args):
...
trainer = MyTrainer(cfg)
return trainer.train()
if __name__ == "__main__":
args = default_argument_parser().parse_args()
main(args)
Using trainer & hook
system means there will always be some non-standard behaviors which is hard to support in LiBai, especially for research. Therefore, we intentionally keep the trainer & hook
system minimal, rather than powerful.
Customize Hooks in Trainer¶
You can customize your own hooks for some extra tasks during training.
HookBase in libai/engine/trainer.py
provides a standard behavior for you to use hook. You can overwirte its function according to your own needs. Please refer to libai/engine/hooks.py for more details.
class HookBase:
def before_train(self):
"""
Called before the first iteration.
"""
def after_train(self):
"""
Called after the last iteration.
"""
def before_step(self):
"""
Called before each iteration.
"""
def after_step(self):
"""
Called after each iteration.
"""
Depending on the functionality of the hook, you can specify what the hook will do at each stage of the training in before_train
, after_train
, before_step
, after_step
. For example, to print iter
in trainer during training:
class InfoHook(HookBase):
def before_train(self):
logger.info(f"start training at {self.trainer.iter}")
def after_train(self):
logger.info(f"end training ad {self.trainer.iter}")
def after_step(self):
if self.trainer.iter % 100 == 0:
logger.info(f"iteration {self.trainer.iter}!")
Then you can import your hook
in tools/my_train_net.py
Modify train_step in Trainer¶
LiBai provides EagerTrainer
and GraphTrainer
in libai/engine/trainer.py
by default. EagerTrainer
is used in eager
mode, while GraphTrainer
is used in graph
mode, and the mode is determined by the graph.enabled
parameter in your config.py
.
For more details about
eager
andgraph
mode, please refer to oneflow doc.
For example, using a temp variable to keep the model’s output in run_step:
class MyEagerTrainer(EagerTrainer):
def __init__(self, model, data_loader, optimizer, grad_acc_steps=1):
super().__init__(model, data_loader, optimizer, grad_acc_steps)
self.previous_output = None
def run_step(self, get_batch: Callable):
...
loss_dict = self.model(**data)
self.previous_output = loss_dict
...
Then you can set your MyEagerTrainer
as self.trainer
in tools/my_train_net.py
Logging of Metrics¶
During training, the trainer put metrics to a centralized EventStorage. The following code can be used to access it and log metrics to it:
from libai.utils.events import get_event_storage
# inside the model:
if self.training:
value = # compute the value from inputs
storage = get_event_storage()
storage.put_scalar("some_accuracy", value)
See EventStorage for more details.
Metrics are then written to various destinations with EventWriter. Metrics information will be written to {cfg.train.output_dir}/metrics.json
. DefaultTrainer enables a few EventWriter with default configurations. See above for how to customize them.