Frequently Asked Questions¶
We list some common problems encountered by users and the corresponding solutions here. Feel free to enrich the list if you find any frequent issues and have ways to help others solve them.
Training¶
“Loss goes NaN or very large”
Check if the dataset annotations are valid. Mask must be
{0, 1}
where1
for tokens that are not masked and0
for tokens that are masked.Check
initializer_range
in config file. It can be safely set to0.02
in most cases. If the model size is very large, decreasinginitializer_range
is a good choice. For example,initializer_range
can be set to0.006
when training 175 billion parameter configuration GPT-3 model.
“AMP enabled goes NaN”
SetONEFLOW_DEBUG_KERNEL_SYNC_CHECK_NUMERICS=1
to check what triggers an overflow of the value range in fp16.“GPU out of memory when validation”
Decreasetest_micro_batch_size
and use--fast-dev-run
for quickly running through training and evaluation to check if memory is sufficient.
Model¶
“
apply_query_key_layer_scaling
in MultiheadAttention”
As the number of attention heads increases, some of the GEMMS inside the self-attention layer become smaller and the number of elements in the self attention softmax also increases.“QKV implementation is not consistent with Hugging Face in self attention”
# query_key_value:[batch_size, seq_len, 3*hidden_size] # QKV in LiBai query_key_value = query_key_value.view(bsz, -1, self.num_heads, 3 * self.head_size) query_key_value = query_key_value.permute(0, 2, 1, 3) query, key, value = flow.chunk(query_key_value, chunks=3, dim=-1) # QKV in Huggingface query, key, value = flow.chunk(query_key_value, chunks=3, dim=-1) query = query.view(query.size(0), query.size(1), self.num_heads, -1).permute(0, 2, 1, 3) key = key.view(key.size(0), key.size(1), self.num_heads, -1).permute(0, 2, 1, 3) value = value.view(value.size(0), value.size(1), self.num_heads, -1).permute(0, 2, 1, 3)
In tensor parallelism,
chunk
dimension andflow.sbp.split
dimension will be the same in Huggingface’s implementation which will occur some unexpected behaviors (i.e., changing the tensor’s SBP unexpectedly).We also provide a tutorial about how to load Huggingface weights correctly. Please refer to How to use Huggingface’s pretrained weights in LiBai for more details.
“the order of layer normalization and the residual connection”
This is critical to enable the scaling of the BERT-style models beyond BERT-Large. The architecture withapply_residual_post_layernorm=False
eliminates instabilities observed using the origin BERT architecture withapply_residual_post_layernorm=True
and also has a lower training loss according to Megatron-LM.
If you find some troubles hard to understand, feel free to open an issue to collect feedbacks in OneFlow.