Benchmarks

Here we provides our benchmark speed test results of LiBai’s models compared with Megatron-LM implementations. In LiBai V0.2.0, we only benchmark the speed tests under 32 GPUs in 4 nodes and all of the experiments were conducted under the same settings for a fair comparison.

Settings

Environments

  • The commit of LiBai for comparison: commit

  • The commit of OneFlow for comparison: commit

  • The commit of Megatron-LM for comparison: commit

Model Hyper-parameters

  • BERT Model

num_layers = 24/48
num_attention_heads = 16
hidden_size = 1024
seq_length = 512
  • GPT-2 Model

num_layers = 24/48
num_attention_heads = 16
hidden_size = 1024
seq_length = 1024

Main Results

Here we explain the evaluation indicators in the following tables:

  • fp16: mixed precision training

  • nl: num layers (When pipeline parallel size = 8, in order to have a relative number of layers per stage for computation, we adjust the num layers from 24 to 48)

  • ac: enable activation checkpointing

  • mb: micro-batch size per gpu

  • gb: global batch size total

  • dxmxp:

    • d: data-parallel-size

    • m: tensor-model-parallel-size

    • p: pipeline-model-parallel-size

  • 1n1g: 1 node, 1 gpu

  • 2n8g: 2 nodes, 8 gpus per node, 16 gpus in total

  • 4n8g: 4 nodes, 8 gpus per node, 32 gpus in total

  • grad_acc_num_step = global_batch_size / (micro_batch_size * data_parallel_size)

  • samples/s: throughput

Data Parallel

| BERT | LiBai | Megatron | | ——————————– | ———————————————————— | ———————————————————— | | nl24_fp16_1x1x1_mb24_gb24_1n1g | 46.91 samples/s | 42.6 samples/s | | nl24_fp16_8x1x1_mb16_gb64_1n4g | 176.88 samples/s | 154.7 samples/s | | nl24_fp16_8x1x1_mb16_gb128_1n8g | 351.57 samples/s | 309.2 samples/s | | nl24_fp16_16x1x1_mb16_gb256_2n8g | 675.87 samples/s | 534.7 samples/s | | nl24_fp16_32x1x1_mb16_gb512_4n8g | 1207.65 samples/s | 950.3 samples/s |

| GPT-2 | LiBai | Megatron | | ——————————- | ———————————————————— | ———————————————————— | | nl24_fp16_1x1x1_mb6_gb6_1n1g | 17.52 samples/s | 15.5 samples/s | | nl24_fp16_4x1x1_mb4_gb16_1n4g | 63.45 samples/s | 53.3 samples/s | | nl24_fp16_8x1x1_mb4_gb32_1n8g | 125.64 samples/s | 107.9 samples/s | | nl24_fp16_16x1x1_mb4_gb64_2n8g | 215.35 samples/s | 176.0 samples/s | | nl24_fp16_32x1x1_mb4_gb128_4n8g | 329.58 samples/s | 296.6 samples/s |

Tensor Model Parallel

| BERT | LiBai | Megatron | | ———————————— | ———————————————————— | ———————————————————— | | nl24_fp16_1x1x1_ac_mb128_gb1024_1n1g | 35.74 samples/s | 33.6 samples/s | | nl24_fp16_1x4x1_ac_mb128_gb1024_1n4g | 87.12 samples/s | 86.6 samples/s | | nl24_fp16_1x8x1_ac_mb128_gb1024_1n8g | 131.94 samples/s | 128.7 samples/s |

| GPT-2 | LiBai | Megatron | | —————————- | ———————————————————— | ———————————————————— | | nl24_fp16_1x1x1_mb6_gb6_1n1g | 17.52 samples/s | 15.5 samples/s | | nl24_fp16_1x4x1_mb6_gb6_1n4g | 40.38 samples/s | 38.0 samples/s | | nl24_fp16_1x8x1_mb8_gb8_1n8g | 60.53 samples/s | 55.7 samples/s |

Pipeline Model Parallel

| BERT | LiBai | Megatron | | ————————————— | ———————————————————— | ———————————————————— | | nl24_fp16_1x1x1_ac_mb128_gb1024_1n1g | 35.74 samples/s | 33.6 samples/s | | nl24_fp16_1x1x4_ac_mb128_gb1024_1n4g | 103.6 samples/s | 88.7 samples/s | | nl48_fp16_1x1x8_ac_mb64_gb1024_1n8g | 94.4 samples/s | 85.5 samples/s |

| GPT-2 | LiBai | Megatron | | ————————————– | ———————————————————— | ———————————————————— | | nl24_fp16_1x1x1_ac_mb32_gb256_1n1g | 14.43 samples/s | 13.3 samples/ | | nl24_fp16_1x1x4_ac_mb32_gb256_1n4g | 41.9 samples/s | 33.2 samples/s | | nl48_fp16_1x1x8_ac_mb24_gb384_1n8g | 37.4 samples/s | 31.8 samples/s |

2-D Parallel

Data Parallel + Tensor Model Parallel

| BERT | LiBai | Megatron | | ———————————— | ———————————————————— | ———————————————————— | | nl24_fp16_2x2x1_ac_mb128_gb2048_1n4g | 88.47 samples/s | 86.6 samples/s | | nl24_fp16_4x2x1_ac_mb128_gb4096_1n8g | 175.94 samples/s | 172.0 samples/s | | nl24_fp16_8x2x1_ac_mb128_gb8192_2n8g | 348.58 samples/s | 343.8 samples/s | | nl24_fp16_2x8x1_ac_mb128_gb2048_2n8g | 261.78 samples/s | 255.8 samples/s | | nl24_fp16_4x4x1_ac_mb128_gb2048_2n8g | 338.97 samples/s | 337.3 samples/s |

| GPT-2 | LiBai | Megatron | | ———————————– | ———————————————————— | ———————————————————— | | nl24_fp16_2x2x1_ac_mb32_gb512_1n4g | 37.63 samples/s | 36.9 samples/s | | nl24_fp16_4x2x1_ac_mb32_gb1024_1n8g | 74.35 samples/s | 73.2 samples/s | | nl24_fp16_8x2x1_ac_mb32_gb2048_2n8g | 148.94 samples/s | 146.5 samples/s | | nl24_fp16_2x8x1_ac_mb32_gb512_2n8g | 116.04 samples/s | 109.1 samples/s | | nl24_fp16_4x4x1_ac_mb32_gb512_2n8g | 141.25 samples/s | 138.1 samples/s |

Data Parallel + Pipeline Model Parallel

| BERT | LiBai | Megatron | | ———————————— | ———————————————————— | ———————————————————— | | nl24_fp16_2x1x4_ac_mb128_gb2048_1n8g | 207.31 samples/s | 175.0 samples/s | | nl24_fp16_4x1x4_ac_mb128_gb4096_2n8g | 406.24 samples/s | 342.9 samples/s | | nl24_fp16_8x1x4_ac_mb128_gb8192_4n8g | 805.04 samples/s | 650.7 samples/s |

| GPT-2 | LiBai | Megatron | | ———————————– | ———————————————————— | ———————————————————— | | nl24_fp16_2x1x4_ac_mb32_gb512_1n8g | 83.12 samples/s | 65.3 samples/s | | nl24_fp16_4x1x4_ac_mb32_gb1024_2n8g | 164.23 samples/s | 128.4 samples/s | | nl24_fp16_8x1x4_ac_mb32_gb2048_4n8g | 322.42 samples/s | 247.3 samples/s |

3-D Parallel

| BERT | LiBai | Megatron | | ———————————— | ———————————————————— | ———————————————————— | | nl24_fp16_2x2x4_ac_mb128_gb2048_2n8g | 267.39 samples/s | 233.7 samples/s | | nl24_fp16_4x2x4_ac_mb192_gb6144_4n8g | 503.51 samples/s | 439.4 samples/s | | nl24_fp16_2x4x4_ac_mb256_gb4096_4n8g | 405.75 samples/s | 338.7 samples/s |

| GPT-2 | LiBai | Megatron | | ———————————– | ———————————————————— | ———————————————————— | | nl24_fp16_2x2x4_ac_mb32_gb1024_2n8g | 128.77 samples/s | 106.3 samples/s | | nl24_fp16_4x2x4_ac_mb48_gb1536_4n8g | 209.32 samples/s | 179.5 samples/s | | nl24_fp16_2x4x4_ac_mb64_gb1024_4n8g | 186.67 samples/s | 178.2 samples/s |