Benchmarks

Here we provides our benchmark speed test results of LiBai’s models compared with Megatron-LM implementations. In LiBai V0.2.0, we only benchmark the speed tests under 32 GPUs in 4 nodes and all of the experiments were conducted under the same settings for a fair comparison.

Settings

Environments

  • The commit of LiBai for comparison: commit

  • The commit of OneFlow for comparison: commit

  • The commit of Megatron-LM for comparison: commit

Model Hyper-parameters

  • BERT Model

num_layers = 24/48
num_attention_heads = 16
hidden_size = 1024
seq_length = 512
  • GPT-2 Model

num_layers = 24/48
num_attention_heads = 16
hidden_size = 1024
seq_length = 1024

Main Results

Here we explain the evaluation indicators in the following tables:

  • fp16: mixed precision training

  • nl: num layers (When pipeline parallel size = 8, in order to have a relative number of layers per stage for computation, we adjust the num layers from 24 to 48)

  • ac: enable activation checkpointing

  • mb: micro-batch size per gpu

  • gb: global batch size total

  • d x m x p:

    • d: data-parallel-size

    • m: tensor-model-parallel-size

    • p: pipeline-model-parallel-size

  • 1n1g: 1 node, 1 gpu

  • 2n8g: 2 nodes, 8 gpus per node, 16 gpus in total

  • 4n8g: 4 nodes, 8 gpus per node, 32 gpus in total

  • grad_acc_num_step = global_batch_size / (micro_batch_size * data_parallel_size)

  • samples/s: throughput

Data Parallel

BERT LiBai Megatron
nl24_fp16_1x1x1_mb24_gb24_1n1g 46.91 samples/s 42.6 samples/s
nl24_fp16_4x1x1_mb16_gb64_1n4g 176.88 samples/s 154.7 samples/s
nl24_fp16_8x1x1_mb16_gb128_1n8g 351.57 samples/s 309.2 samples/s
nl24_fp16_16x1x1_mb16_gb256_2n8g 675.87 samples/s 534.7 samples/s
nl24_fp16_32x1x1_mb16_gb512_4n8g 1207.65 samples/s 950.3 samples/s
GPT-2 LiBai Megatron
nl24_fp16_1x1x1_mb6_gb6_1n1g 17.52 samples/s 15.5 samples/s
nl24_fp16_4x1x1_mb4_gb16_1n4g 63.45 samples/s 53.3 samples/s
nl24_fp16_8x1x1_mb4_gb32_1n8g 125.64 samples/s 107.9 samples/s
nl24_fp16_16x1x1_mb4_gb64_2n8g 215.35 samples/s 176.0 samples/s
nl24_fp16_32x1x1_mb4_gb128_4n8g 329.58 samples/s 296.6 samples/s

Tensor Model Parallel

BERT LiBai Megatron
nl24_fp16_1x1x1_ac_mb128_gb1024_1n1g 35.74 samples/s 33.6 samples/s
nl24_fp16_1x4x1_ac_mb128_gb1024_1n4g 87.12 samples/s 86.6 samples/s
nl24_fp16_1x8x1_ac_mb128_gb1024_1n8g 131.94 samples/s 128.7 samples/s
GPT-2 LiBai Megatron
nl24_fp16_1x1x1_mb6_gb6_1n1g 17.52 samples/s 15.5 samples/s
nl24_fp16_1x4x1_mb6_gb6_1n4g 40.38 samples/s 38.0 samples/s
nl24_fp16_1x8x1_mb8_gb8_1n8g 60.53 samples/s 55.7 samples/s

Pipeline Model Parallel

BERT LiBai Megatron
nl24_fp16_1x1x1_ac_mb128_gb1024_1n1g 35.74 samples/s 33.6 samples/s
nl24_fp16_1x1x4_ac_mb128_gb1024_1n4g 103.6 samples/s 88.7 samples/s
nl48_fp16_1x1x8_ac_mb64_gb1024_1n8g 94.4 samples/s 85.5 samples/s
GPT-2 LiBai Megatron
nl24_fp16_1x1x1_ac_mb32_gb256_1n1g 14.43 samples/s 13.3 samples/
nl24_fp16_1x1x4_ac_mb32_gb256_1n4g 41.9 samples/s 33.2 samples/s
nl48_fp16_1x1x8_ac_mb24_gb384_1n8g 37.4 samples/s 31.8 samples/s

2-D Parallel

Data Parallel + Tensor Model Parallel

BERT LiBai Megatron
nl24_fp16_2x2x1_ac_mb128_gb2048_1n4g 88.47 samples/s 86.6 samples/s
nl24_fp16_4x2x1_ac_mb128_gb4096_1n8g 175.94 samples/s 172.0 samples/s
nl24_fp16_8x2x1_ac_mb128_gb8192_2n8g 348.58 samples/s 343.8 samples/s
nl24_fp16_2x8x1_ac_mb128_gb2048_2n8g 261.78 samples/s 255.8 samples/s
nl24_fp16_4x4x1_ac_mb128_gb2048_2n8g 338.97 samples/s 337.3 samples/s
GPT-2 LiBai Megatron
nl24_fp16_2x2x1_ac_mb32_gb512_1n4g 37.63 samples/s 36.9 samples/s
nl24_fp16_4x2x1_ac_mb32_gb1024_1n8g 74.35 samples/s 73.2 samples/s
nl24_fp16_8x2x1_ac_mb32_gb2048_2n8g 148.94 samples/s 146.5 samples/s
nl24_fp16_2x8x1_ac_mb32_gb512_2n8g 116.04 samples/s 109.1 samples/s
nl24_fp16_4x4x1_ac_mb32_gb512_2n8g 141.25 samples/s 138.1 samples/s

Data Parallel + Pipeline Model Parallel

BERT LiBai Megatron
nl24_fp16_2x1x4_ac_mb128_gb2048_1n8g 207.31 samples/s 175.0 samples/s
nl24_fp16_4x1x4_ac_mb128_gb4096_2n8g 406.24 samples/s 342.9 samples/s
nl24_fp16_8x1x4_ac_mb128_gb8192_4n8g 805.04 samples/s 650.7 samples/s
GPT-2 LiBai Megatron
nl24_fp16_2x1x4_ac_mb32_gb512_1n8g 83.12 samples/s 65.3 samples/s
nl24_fp16_4x1x4_ac_mb32_gb1024_2n8g 164.23 samples/s 128.4 samples/s
nl24_fp16_8x1x4_ac_mb32_gb2048_4n8g 322.42 samples/s 247.3 samples/s

3-D Parallel

BERT LiBai Megatron
nl24_fp16_2x2x4_ac_mb128_gb2048_2n8g 267.39 samples/s 233.7 samples/s
nl24_fp16_4x2x4_ac_mb192_gb6144_4n8g 503.51 samples/s 439.4 samples/s
nl24_fp16_2x4x4_ac_mb256_gb4096_4n8g 405.75 samples/s 338.7 samples/s
GPT-2 LiBai Megatron
nl24_fp16_2x2x4_ac_mb32_gb1024_2n8g 128.77 samples/s 106.3 samples/s
nl24_fp16_4x2x4_ac_mb48_gb1536_4n8g 209.32 samples/s 179.5 samples/s
nl24_fp16_2x4x4_ac_mb64_gb1024_4n8g 186.67 samples/s 178.2 samples/s