Frequently Asked Questions¶
We list some common problems encountered by users and the corresponding solutions here. Feel free to enrich the list if you find any frequent issues and have ways to help others solve them.
Training¶
“Loss goes NaN or very large”
Check if the dataset annotations are valid. Mask must be
{0, 1}where1for tokens that are not masked and0for tokens that are masked.Check
initializer_rangein config file. It can be safely set to0.02in most cases. If the model size is very large, decreasinginitializer_rangeis a good choice. For example,initializer_rangecan be set to0.006when training 175 billion parameter configuration GPT-3 model.
“AMP enabled goes NaN”
SetONEFLOW_DEBUG_KERNEL_SYNC_CHECK_NUMERICS=1to check what triggers an overflow of the value range in fp16.“GPU out of memory when validation”
Decreasetest_micro_batch_sizeand use--fast-dev-runfor quickly running through training and evaluation to check if memory is sufficient.
Model¶
“
apply_query_key_layer_scalingin MultiheadAttention”
As the number of attention heads increases, some of the GEMMS inside the self-attention layer become smaller and the number of elements in the self attention softmax also increases.“QKV implementation is not consistent with Hugging Face in self attention”
# query_key_value:[batch_size, seq_len, 3*hidden_size] # QKV in LiBai query_key_value = query_key_value.view(bsz, -1, self.num_heads, 3 * self.head_size) query_key_value = query_key_value.permute(0, 2, 1, 3) query, key, value = flow.chunk(query_key_value, chunks=3, dim=-1) # QKV in Huggingface query, key, value = flow.chunk(query_key_value, chunks=3, dim=-1) query = query.view(query.size(0), query.size(1), self.num_heads, -1).permute(0, 2, 1, 3) key = key.view(key.size(0), key.size(1), self.num_heads, -1).permute(0, 2, 1, 3) value = value.view(value.size(0), value.size(1), self.num_heads, -1).permute(0, 2, 1, 3)
In tensor parallelism,
chunkdimension andflow.sbp.splitdimension will be the same in Huggingface’s implementation which will occur some unexpected behaviors (i.e., changing the tensor’s SBP unexpectedly).We also provide a tutorial about how to load Huggingface weights correctly. Please refer to How to use Huggingface’s pretrained weights in LiBai for more details.
“the order of layer normalization and the residual connection”
This is critical to enable the scaling of the BERT-style models beyond BERT-Large. The architecture withapply_residual_post_layernorm=Falseeliminates instabilities observed using the origin BERT architecture withapply_residual_post_layernorm=Trueand also has a lower training loss according to Megatron-LM.
If you find some troubles hard to understand, feel free to open an issue to collect feedbacks in OneFlow.