Auto Parallel Training¶
LiBai supports auto-parallel training which means LiBai will automatically find an efficient parallel training strategy for a specific model during training. Users can try out auto-parallel training by the following steps.
Installation¶
Install OneFlow nightly
python3 -m pip install --pre oneflow -f https://staging.oneflow.info/branch/master/[PLATFORM]
All available
[PLATFORM]
:
Platform | CUDA Driver Version | Supported GPUs |
---|---|---|
cu112 | >= 450.80.02 | GTX 10xx, RTX 20xx, A100, RTX 30xx |
cu102 | >= 440.33 | GTX 10xx, RTX 20xx |
cpu | N/A | N/A |
Train/Evaluate model in auto-parallel mode¶
You can train your own model in auto-parallel mode by simply updating the config as follows:
Modify config file¶
# your config
from .common.models.graph import graph
graph.auto_parallel.enabled = True
Training model with auto-parallel on 4 GPUs:
bash ./tools/train.sh tools/train_net.py configs/your_own_config.py 4
Directly modify the training command line¶
auto-parallel training:
bash ./tools/train.sh tools/train_net.py configs/your_own_config.py 4 graph.auto_parallel.enabled=True
auto-parallel evaluation:
bash ./tools/train.sh tools/train_net.py configs/your_own_config.py 4 --eval graph.auto_parallel.enabled=True