Auto Parallel Training¶

LiBai supports auto-parallel training which means LiBai will automatically find an efficient parallel training strategy for a specific model during training. Users can try out auto-parallel training by the following steps.

Installation¶

Install OneFlow nightly

python3 -m pip install --pre oneflow -f https://staging.oneflow.info/branch/master/[PLATFORM]

All available [PLATFORM]:

Platform	CUDA Driver Version	Supported GPUs
cu112	>= 450.80.02	GTX 10xx, RTX 20xx, A100, RTX 30xx
cu102	>= 440.33	GTX 10xx, RTX 20xx
cpu	N/A	N/A

Train/Evaluate model in auto-parallel mode¶

You can train your own model in auto-parallel mode by simply updating the config as follows:

Modify config file¶

# your config
from .common.models.graph import graph

graph.auto_parallel.enabled = True

Training model with auto-parallel on 4 GPUs:

bash ./tools/train.sh tools/train_net.py configs/your_own_config.py 4

Directly modify the training command line¶

auto-parallel training:

bash ./tools/train.sh tools/train_net.py configs/your_own_config.py 4 graph.auto_parallel.enabled=True

auto-parallel evaluation:

bash ./tools/train.sh tools/train_net.py configs/your_own_config.py 4 --eval graph.auto_parallel.enabled=True

More details with instructions and interface¶

See OneFlow Auto-Parallelism.