Quick Run¶

This is a step-by-step tutorial on how to get started with LiBai:

Quick Run
- Train Bert-large Model Parallelly
  - Prepare the Data and the Vocab
  - How to Train Bert_large Model with Parallelism
- Train VisionTransformer on ImageNet Dataset
  - Prepare the Data
  - Train vit Model from Scratch

Train Bert-large Model Parallelly¶

Prepare the Data and the Vocab¶

We have prepared relevant datasets, which can be downloaded from the following links:

Download the dataset and move the data file to the folder. The file structure should be like:

$ tree data
path/to/bert_data
├── bert-base-chinese-vocab.txt
├── loss_compara_content_sentence.bin
└── loss_compara_content_sentence.idx

How to Train Bert_large Model with Parallelism¶

We provide train.sh for execute training. Before invoking the script, perform the following steps.

Step 1. Set data path and vocab path

Update the data path and vocab path in bert_large_pretrain config file:

# Refine data path and vocab path to data folder
vocab_file = "/path/to/bert_data/bert-base-chinese-vocab.txt"
data_prefix = "/path/to/bert_data/loss_compara_content_sentence"

Step 2. Configure your parameters

In the configs/bert_large_pretrain.py provided, a set of parameters are defined including training scheme, model, etc.
You can also modify the parameters setting. For example, if you want to use 8 GPUs for training, you can refer to the file configs/common/train.py. If you want to train model with 2D mesh hybrid parallelism (4 groups for data parallel and 2 groups for tensor parallel), you can set the the parameters as follows:

train.dist.data_parallel_size=4
train.dist.tensor_parallel_size=2

Step 3. Invoke parallel training

To train BertForPreTraining model on a single node with 8 GPUs, run:

bash tools/train.sh tools/train_net.py configs/bert_large_pretrain.py 8

To train BertForPreTraining model on 2 nodes with 16 GPUs,

in node0, run:
```
NODE=2 NODE_RANK=0 ADDR=192.168.0.0 PORT=12345 bash tools/train.sh tools/train_net.py configs/bert_large_pretrain.py 8
```
NODE=2 means total number of nodes

NODE_RANK=0 means current node is node0

ADDR=192.168.0.0 means the ip address of node0

PORT=12345 means the port of node0

in node1, run:
```
NODE=2 NODE_RANK=1 ADDR=192.168.0.0 PORT=12345 bash tools/train.sh tools/train_net.py configs/bert_large_pretrain.py 8
```
NODE=2 means total number of nodes

NODE_RANK=1 means current node is node1

ADDR=192.168.0.0 means the ip address of node0

PORT=12345 means the port of node0

Train VisionTransformer on ImageNet Dataset¶

Prepare the Data¶

For ImageNet, we use standard ImageNet dataset, which can be downloaded from http://image-net.org/.

For the standard folder dataset, move validation images to labeled sub-folders. The file structure should be like:

$ tree data
imagenet
├── train
│   ├── class1
│   │   ├── img1.jpeg
│   │   ├── img2.jpeg
│   │   └── ...
│   ├── class2
│   │   ├── img3.jpeg
│   │   └── ...
│   └── ...
└── val
    ├── class1
    │   ├── img4.jpeg
    │   ├── img5.jpeg
    │   └── ...
    ├── class2
    │   ├── img6.jpeg
    │   └── ...
    └── ...

Train vit Model from Scratch¶

Update the data path in vit_imagenet config file:

# Refine data path to imagenet data folder
dataloader.train.dataset[0].root = "/path/to/imagenet"
dataloader.test[0].dataset.root = "/path/to/imagenet"

To train vit_tiny_patch16_224 model on ImageNet on a single node with 8 GPUs for 300 epochs, run:

bash tools/train.sh tools/train_net.py configs/vit_imagenet.py 8

The default vit model in LiBai is set to vit_tiny_patch16_224. To train other vit models, update the vit_imagenet config file by importing other vit models in the config file as follows:

# from .common.models.vit.vit_tiny_patch16_224 import model
from .common.models.vit.vit_base_patch16_224 import model