Auto Parallel Training

LiBai supports auto-parallel training which means LiBai will automatically find an efficient parallel training strategy for a specific model during training. Users can try out auto-parallel training by the following steps.

Installation

Install OneFlow nightly

python3 -m pip install --pre oneflow -f https://staging.oneflow.info/branch/master/[PLATFORM]
  • All available [PLATFORM]:

Platform CUDA Driver Version Supported GPUs
cu112 >= 450.80.02 GTX 10xx, RTX 20xx, A100, RTX 30xx
cu102 >= 440.33 GTX 10xx, RTX 20xx
cpu N/A N/A

Train/Evaluate model in auto-parallel mode

You can train your own model in auto-parallel mode by simply updating the config as follows:

Modify config file

# your config
from .common.models.graph import graph

graph.auto_parallel.enabled = True

Training model with auto-parallel on 4 GPUs:

bash ./tools/train.sh tools/train_net.py configs/your_own_config.py 4

Directly modify the training command line

  • auto-parallel training:

bash ./tools/train.sh tools/train_net.py configs/your_own_config.py 4 graph.auto_parallel.enabled=True
  • auto-parallel evaluation:

bash ./tools/train.sh tools/train_net.py configs/your_own_config.py 4 --eval graph.auto_parallel.enabled=True

More details with instructions and interface

See OneFlow Auto-Parallelism.