libai.models¶
Supported models in LiBai(李白)
-
libai.models.build.
build_graph
(cfg, model, optimizer=None, lr_scheduler=None, is_train=False)[source]¶ Build the nn.Graph, defined by
cfg.graph
.
-
libai.models.build.
build_model
(cfg)[source]¶ Build the whole model architecture, defined by
cfg.model
. Note that it does not load any weights fromcfg
.
VisionTransformer¶
-
class
libai.models.vision_transformer.
VisionTransformer
(img_size=224, patch_size=16, in_chans=3, embed_dim=192, depth=12, num_heads=3, mlp_ratio=4.0, drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, num_classes=1000, loss_func=None)[source]¶ Vision Transformer in LiBai.
LiBai’s implementation of: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
- Parameters
img_size (int, tuple(int)) – input image size
patch_size (int, tuple(int)) – patch size
in_chans (int) – number of input channels
embed_dim (int) – embedding dimension
depth (int) – depth of transformer
num_heads (int) – number of attention heads
mlp_ratio (int) – ratio of mlp hidden dim to embedding dim
drop_rate (float) – dropout rate
attn_drop_rate (float) – attention dropout rate
drop_path_rate (float) – stochastic depth rate
num_classes (int) – number of classes for classification head
loss_func (callable, optional) – loss function for computing the total loss between logits and labels
-
forward
(images, labels=None)[source]¶ - Parameters
images (flow.Tensor) – training samples.
labels (flow.LongTensor, optional) – training targets
- Returns
A dict containing
loss_value
orlogits
depending on training or evaluation mode.{"losses": loss_value}
when training,{"prediction_scores": logits}
when evaluating.- Return type
dict
SwinTransformer¶
-
class
libai.models.swin_transformer.
SwinTransformer
(img_size=224, patch_size=4, in_chans=3, num_classes=1000, embed_dim=96, depths=[2, 2, 6, 2], num_heads=[3, 6, 12, 24], window_size=7, mlp_ratio=4.0, qkv_bias=True, qk_scale=None, drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.1, norm_layer=<class 'libai.layers.layer_norm.LayerNorm'>, ape=False, patch_norm=True, loss_func=None, **kwargs)[source]¶ Swin Transformer in LiBai.
LiBai implement of: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
- Parameters
img_size (int, tuple(int)) – Input image size. Default 224
patch_size (int, tuple(int)) – Patch size. Default: 4
in_chans (int) – Number of input image channels. Default: 3
num_classes (int) – Number of classes for classification head. Default: 1000
embed_dim (int) – Patch embedding dimension. Default: 96
depths (tuple(int)) – Depth of each Swin Transformer layer.
num_heads (tuple(int)) – Number of attention heads in different layers.
window_size (int) – Window size. Default: 7
mlp_ratio (float) – Ratio of mlp hidden dim to embedding dim. Default: 4
qkv_bias (bool) – If True, add a learnable bias to query, key, value. Default: True
qk_scale (float) – Override default qk scale of head_dim ** -0.5 if set. Default: None
drop_rate (float) – Dropout rate. Default: 0
attn_drop_rate (float) – Attention dropout rate. Default: 0
drop_path_rate (float) – Stochastic depth rate. Default: 0.1
norm_layer (nn.Module) – Normalization layer. Default: libai.layers.LayerNorm.
ape (bool) – If True, add absolute position embedding to the patch embedding. Default: False
patch_norm (bool) – If True, add normalization after patch embedding. Default: True
loss_func (callable, optional) – Loss function for computing the total loss between logits and labels
-
forward
(images, labels=None)[source]¶ - Parameters
images (flow.Tensor) – training samples.
labels (flow.LongTensor, optional) – training targets
- Returns
A dict containing
loss_value
orlogits
depending on training or evaluation mode.{"losses": loss_value}
when training,{"prediction_scores": logits}
when evaluating.- Return type
dict
SwinTransformerV2¶
-
class
libai.models.swin_transformer_v2.
SwinTransformerV2
(img_size=224, patch_size=4, in_chans=3, num_classes=1000, embed_dim=96, depths=[2, 2, 6, 2], num_heads=[3, 6, 12, 24], window_size=7, mlp_ratio=4.0, qkv_bias=True, drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.1, norm_layer=<class 'libai.layers.layer_norm.LayerNorm'>, ape=False, patch_norm=True, pretrained_window_sizes=[0, 0, 0, 0], loss_func=None)[source]¶ Swin Transformer :param img_size: Input image size. Default 224 :type img_size: int | tuple(int) :param patch_size: Patch size. Default: 4 :type patch_size: int | tuple(int) :param in_chans: Number of input image channels. Default: 3 :type in_chans: int :param num_classes: Number of classes for classification head. Default: 1000 :type num_classes: int :param embed_dim: Patch embedding dimension. Default: 96 :type embed_dim: int :param depths: Depth of each Swin Transformer layer. :type depths: tuple(int) :param num_heads: Number of attention heads in different layers. :type num_heads: tuple(int) :param window_size: Window size. Default: 7 :type window_size: int :param mlp_ratio: Ratio of mlp hidden dim to embedding dim. Default: 4 :type mlp_ratio: float :param qkv_bias: If True, add a learnable bias to query, key, value. Default: True :type qkv_bias: bool :param drop_rate: Dropout rate. Default: 0 :type drop_rate: float :param attn_drop_rate: Attention dropout rate. Default: 0 :type attn_drop_rate: float :param drop_path_rate: Stochastic depth rate. Default: 0.1 :type drop_path_rate: float :param norm_layer: Normalization layer. Default: nn.LayerNorm. :type norm_layer: nn.Module :param ape: If True, add absolute position embedding to the patch embedding. Default: False :type ape: bool :param patch_norm: If True, add normalization after patch embedding. Default: True :type patch_norm: bool :param pretrained_window_sizes: Pretrained window sizes of each layer. :type pretrained_window_sizes: tuple(int)
-
forward
(images, labels=None)[source]¶ - Parameters
images (flow.Tensor) – training samples.
labels (flow.LongTensor, optional) – training targets
- Returns
A dict containing
loss_value
orlogits
depending on training or evaluation mode.{"losses": loss_value}
when training,{"prediction_scores": logits}
when evaluating.- Return type
dict
-
ResMLP¶
-
class
libai.models.resmlp.
ResMLP
(img_size=224, patch_size=16, in_chans=3, embed_dim=768, depth=12, drop_rate=0.0, drop_path_rate=0.0, init_scale=0.0001, num_classes=1000, loss_func=None)[source]¶ ResMLP in LiBai.
LiBai’s implementation of: ResMLP: Feedforward networks for image classification with data-efficient training
- Parameters
img_size (int, tuple(int)) – input image size
patch_size (int, tuple(int)) – patch size
in_chans (int) – number of input channels
embed_dim (int) – embedding dimension
depth (int) – depth of transformer
drop_rate (float) – dropout rate
drop_path_rate (float) – stochastic depth rate
init_scale (float) – the layer scale ratio
num_classes (int) – number of classes for classification head
loss_func (callable, optional) – loss function for computing the total loss between logits and labels
-
forward
(images, labels=None)[source]¶ - Parameters
images (flow.Tensor) – training samples.
labels (flow.LongTensor, optional) – training targets
- Returns
A dict containing
loss_value
orlogits
depending on training or evaluation mode.{"losses": loss_value}
when training,{"prediction_scores": logits}
when evaluating.- Return type
dict
BERT¶
-
class
libai.models.bert_model.
BertForPreTraining
(cfg)[source]¶ Bert Model with two heads on top as done during the pretraining: a masked language modeling head and a next sentence prediction (classification) head.
-
forward
(input_ids, attention_mask, tokentype_ids=None, ns_labels=None, lm_labels=None, loss_mask=None)[source]¶ - Parameters
input_ids (flow.LongTensor) – Indices of input sequence tokens in vocabulary.
attention_mask (flow.BoolTensor) –
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
1 for tokens that are not masked,
0 for tokens that are masked.
tokentype_ids (flow.LongTensor, optional) – Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]. Defaults to None.
ns_labels (flow.LongTensor, optional) –
Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see input_ids docstring). Indices should be in [0, 1]:
0 indicates sequence B is a continuation of sequence A,
1 indicates sequence B is a random sequence.
lm_labels (flow.LongTensor, optional) – Labels for computing the masked language modeling loss. Indices should be in [-1, 0, …, config.vocab_size].
loss_mask (flow.BoolTensor, optional) – Mask to avoid performing loss computing on ignored tokens. Tokens with indices set to -1 are ignored (masked), the loss is only computed for the tokens with labels in [0, …, config.vocab_size]
-
-
class
libai.models.bert_model.
BertModel
(vocab_size, hidden_size, hidden_layers, num_attention_heads, intermediate_size, hidden_dropout_prob, attention_probs_dropout_prob, max_position_embeddings, num_tokentypes=2, add_pooling_layer=True, initializer_range=0.02, layernorm_eps=1e-12, bias_gelu_fusion=True, bias_dropout_fusion=True, scale_mask_softmax_fusion=True, apply_query_key_layer_scaling=True, apply_residual_post_layernorm=False, amp_enabled=False)[source]¶ The bare Bert Model transformer outputting raw hidden-states without any specific head on top.
- Parameters
vocab_size (int) – The size of vocabulary file.
hidden_size (int) – The size of hidden states.
hidden_layers (int) – The number of
TransformerLayer
in encoder.num_attention_heads (int) – The number of attention heads for each attention layer of
TransformerLayer
.intermediate_size (int) – The size of intermediate layer in feed-forward network for each
TransformerLayer
.hidden_dropout_prob (float, optional) – The dropout ratio for the output for each TransformerLayer. Defaults to 0.0.
attention_probs_dropout_prob (float, optional) – The dropout ratio for the output of each attention layer in
TransformerLayer
. Defaults to 0.0.max_position_embeddings (int) – Max sequence length of input, defines the shape of Position Embeddings in
BertEmbedding
.num_tokentypes (int, optional) – Number of segment token indices. Defaults to 2.
add_pooling_layer (bool, optional) – Whether or not averaging or pooling the sequence of hidden-states for the whole input sequence. Defaults to
True
.initializer_range (float, optional) – Sigma of the normal distribution in the initialization method. Defaults to 0.02.
layernorm_epsilon (float, optional) – The epsilon of LayerNorm layer. Defaults to 1e-5.
bias_gelu_fusion (bool, optional) – Whether or not to fuse the computing of bias and gelu. Defaults to
False
.bias_dropout_fusion (bool, optional) – Whether or not to fuse the computing of dropout and bias. Defaults to
False
.scale_mask_softmax_fusion (bool, optional) – Whether to fuse the computing of mask and softmax in attention layers. Defaults to
False
.apply_query_key_layer_scaling (bool, optional) – Whether or not to use layer index related scaling in computing attention scores. If
True
, the scaling factor equals to sqrt(d) * (layer_index + 1). Defaults toTrue
.apply_residual_post_layernorm (bool, optional) – If set
True
, use original BERT residual connection ordering otherwise use Megatron BERT residual connection which is more stable when scaling model size introduced in https://arxiv.org/pdf/1909.08053.pdf. Default:False
.amp_enabled (bool, optional) – Whether or not to set fp16 for embedding weight in T5 model. Defaults to
False
.
-
forward
(input_ids, attention_mask, tokentype_ids=None)[source]¶ - Parameters
input_ids (flow.LongTensor) – Indices of input sequence tokens in vocabulary.
attention_mask (flow.BoolTensor) –
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
1 for tokens that are not masked,
0 for tokens that are masked.
tokentype_ids (flow.LongTensor, optional) – Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]. Defaults to None.
RoBERTa¶
-
class
libai.models.roberta_model.
RobertaForCausalLM
(cfg)[source]¶ -
forward
(input_ids, attention_mask, tokentype_ids=None, position_ids=None, labels=None, loss_mask=None)[source]¶ - Parameters
input_ids (flow.LongTensor) – Indices of input sequence tokens in vocabulary.
attention_mask (flow.BoolTensor) –
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
1 for tokens that are not masked,
0 for tokens that are masked.
tokentype_ids (flow.LongTensor, optional) – Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]. Defaults to None.
position_ids (flow.LongTensor, optional) – Indices of positions of each input sequence tokens in the position embeddings. Defaults to None.
labels (flow.LongTensor, optional) – Labels for computing the masked language modeling loss. Indices should be in [-1, 0, …, config.vocab_size]. Defaults to None.
loss_mask (flow.BoolTensor, optional) – Mask to avoid performing loss computing on ignored tokens. Tokens with indices set to -1 are ignored (masked), the loss is only computed for the tokens with labels in [0, …, config.vocab_size]. Defaults to None.
-
-
class
libai.models.roberta_model.
RobertaForPreTraining
(cfg)[source]¶ -
forward
(input_ids, attention_mask, tokentype_ids=None, lm_labels=None, loss_mask=None)[source]¶ - Parameters
input_ids (flow.LongTensor) – Indices of input sequence tokens in vocabulary.
attention_mask (flow.BoolTensor) –
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
1 for tokens that are not masked,
0 for tokens that are masked.
tokentype_ids (flow.LongTensor, optional) – Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]. Defaults to None.
labels (flow.LongTensor, optional) – Labels for computing the masked language modeling loss. Indices should be in [-1, 0, …, config.vocab_size]. Defaults to None.
loss_mask (flow.BoolTensor, optional) – Mask to avoid performing loss computing on ignored tokens. Tokens with indices set to -1 are ignored (masked), the loss is only computed for the tokens with labels in [0, …, config.vocab_size]. Defaults to None.
-
-
class
libai.models.roberta_model.
RobertaModel
(vocab_size, hidden_size, hidden_layers, num_attention_heads, intermediate_size, hidden_dropout_prob, attention_probs_dropout_prob, max_position_embeddings, num_tokentypes=2, add_pooling_layer=True, initializer_range=0.02, layernorm_eps=1e-12, pad_token_id=1, bias_gelu_fusion=True, bias_dropout_fusion=True, scale_mask_softmax_fusion=True, apply_query_key_layer_scaling=True, apply_residual_post_layernorm=False, amp_enabled=False)[source]¶ The bare Roberta Model transformer outputting raw hidden-states without any specific head on top.
- Args:
- vocab_size (int):
The size of vocabulary file.
- hidden_size (int):
The size of hidden states.
- hidden_layers (int):
The number of
TransformerLayer
in encoder.- num_attention_heads (int):
The number of attention heads for each attention layer of
TransformerLayer
.- intermediate_size (int):
The size of intermediate layer in feed-forward network for each
TransformerLayer
.- hidden_dropout_prob (float, optional):
The dropout ratio for the output for each TransformerLayer. Defaults to 0.0.
- attention_probs_dropout_prob (float, optional):
The dropout ratio for the output of each attention layer in
TransformerLayer
. Defaults to 0.0.- max_position_embeddings (int):
Max sequence length of input, defines the shape of Position Embeddings in
RobertaEmbeddings
.- type_vocab_size (int, optional):
Number of segment token indices. Defaults to 2.
- add_pooling_layer (bool, optional):
Whether or not averaging or pooling the sequence of hidden-states for the whole input sequence. Defaults to
True
.- initializer_range (float, optional):
Sigma of the normal distribution in the initialization method. Defaults to 0.02.
- layer_norm_eps (float, optional):
The epsilon of LayerNorm layer. Defaults to 1e-5.
- pad_token_id (int, optional):
The token id used for padding. Defaults to 1.
- bias_gelu_fusion (bool, optional):
Whether or not to fuse the computing of bias and gelu. Defaults to
False
.- bias_dropout_fusion (bool, optional):
Whether or not to fuse the computing of dropout and bias. Defaults to
False
.- scale_mask_softmax_fusion (bool, optional):
Whether to fuse the computing of mask and softmax in attention layers. Defaults to
False
.- apply_query_key_layer_scaling (bool, optional):
Whether or not to use layer index related scaling in computing attention scores. If
True
, the scaling factor equals to sqrt(d) * (layer_index + 1). Defaults toTrue
.- apply_residual_post_layernorm (bool, optional):
If set
True
, use original BERT(Roberta) residual connection ordering otherwise use Megatron BERT residual connection which is more stable when scaling model size introduced in https://arxiv.org/pdf/1909.08053.pdf. Default:False
.- amp_enabled (bool, optional):
Whether or not to set fp16 for embedding weight in T5 model. Defaults to
False
.
T5¶
-
class
libai.models.t5_model.
T5ForPreTraining
(cfg)[source]¶ T5 Model with classification head on top.
-
forward
(encoder_input_ids, decoder_input_ids, encoder_attn_mask, decoder_attn_mask, encoder_decoder_attn_mask, lm_labels=None, loss_mask=None, use_cache=False)[source]¶ - Parameters
encoder_input_ids (flow.LongTensor) – Indices of input sequence tokens in vocabulary for encoder.
decoder_input_ids (flow.LongTensor) – Indices of input sequence tokens in vocabulary for decoder.
encoder_attn_mask (flow.BoolTensor) –
Mask for encoder to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
1 for tokens that are not masked,
0 for tokens that are masked.
decoder_attn_mask (flow.BoolTensor) – Mask for decoder to avoid performing attention on subsequent token indices. Mask values have the same meaning as encoder_attn_mask.
encoder_decoder_attn_mask (flow.BoolTensor) – Mask for decoder to avoid performing attention on encoder padded token indices. Mask values have the same meaning as encoder_attn_mask.
lm_labels (flow.LongTensor, optional) – Labels for computing the masked language modeling loss. Indices should be in [-1, 0, …, config.vocab_size]. None for evaluating.
loss_mask (flow.BoolTensor, optional) – Mask to avoid performing loss computing on ignored tokens. Tokens with indices set to -1 are ignored (masked), the loss is only computed for the tokens with labels in [0, …, config.vocab_size]. None for evaluating.
use_cache (bool, optional) – It will be set to True, when the model is in the inference phase and used for incremental decoding. Defaults to False.
- Returns
A dict containing
loss_value
orlogits
depending on training or evaluation mode.{"masked_lm_loss": loss_value}
when training,{"prediction_scores": logits}
when evaluating.- Return type
dict
-
-
class
libai.models.t5_model.
T5Model
(vocab_size, hidden_size, hidden_layers, num_attention_heads, intermediate_size, embedding_dropout_prob, hidden_dropout_prob, attention_probs_dropout_prob, max_position_embeddings, initializer_range=0.02, layernorm_eps=1e-12, bias_gelu_fusion=False, bias_dropout_fusion=False, scale_mask_softmax_fusion=False, apply_query_key_layer_scaling=True, apply_residual_post_layernorm=False, amp_enabled=False)[source]¶ T5 Model that outputs logits.
- Parameters
vocab_size (int) – The size of vocabulary file.
hidden_size (int) – The size of hidden states.
hidden_layers (int) – The number of
TransformerLayer
in the encoder and decoder.num_attention_heads (int) – The number of attention heads for each attention layer of
TransformerLayer
.intermediate_size (int) – The size of intermediate layer in feed-forward network for each
TransformerLayer
.embedding_dropout_prob (float) – The dropout ratio for the output of T5Embedding Layer.
hidden_dropout_prob (float) – The dropout ratio for the output for each
TransformerLayer
.attention_probs_dropout_prob (float) – The dropout ratio for the output of each attention layer in
TransformerLayer
.max_position_embeddings (int) – Max sequence length of input, defines the shape of Position Embeddings in
T5Emebedding
.initializer_range (float, optional) – Sigma of the normal distribution in the initialization method. Defaults to 0.02.
layernorm_eps (float, optional) – The epsilon of LayerNorm layer. Defaults to 1e-12.
bias_gelu_fusion (bool, optional) – Whether or not to fuse the computing of bias and gelu. Defaults to
False
.bias_dropout_fusion (bool, optional) – Whether or not to fuse the computing of dropout and bias. Defaults to
False
.scale_mask_softmax_fusion (bool, optional) – Whether to fuse the computing of mask and softmax in attention layers. Defaults to
False
.apply_query_key_layer_scaling (bool, optional) – Whether or not to use layer index related scaling in computing attention scores. If
True
, the scaling factor equals to sqrt(d) * (layer_index + 1). Defaults toTrue
.apply_residual_post_layernorm (bool, optional) – If set
True
, use original BERT residual connection ordering otherwise use Megatron BERT residual connection which is more stable when scaling model size introduced in https://arxiv.org/pdf/1909.08053.pdf. Default:False
.amp_enabled (bool, optional) – Whether or not to set fp16 for embedding weight in T5 model. Defaults to
False
.
-
forward
(encoder_input_ids, decoder_input_ids, encoder_attn_mask, decoder_attn_mask, encoder_decoder_attn_mask, use_cache=False)[source]¶ - Parameters
encoder_input_ids (flow.LongTensor) – Indices of input sequence tokens in vocabulary for encoder.
decoder_input_ids (flow.LongTensor) – Indices of input sequence tokens in vocabulary for decoder.
encoder_attn_mask (flow.BoolTensor) –
Mask for encoder to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
1 for tokens that are not masked,
0 for tokens that are masked.
decoder_attn_mask (flow.BoolTensor) – Mask for decoder to avoid performing attention on subsequent token indices. Mask values have the same meaning as encoder_attn_mask.
encoder_decoder_attn_mask (flow.BoolTensor) – Mask for decoder to avoid performing attention on encoder padded token indices. Mask values have the same meaning as encoder_attn_mask.
use_cache (bool, optional) – It will be set to True, when the model is in the inference phase and used for incremental decoding. Defaults to False.
- Returns
logits
- Return type
flow.Tensor
GPT-2¶
-
class
libai.models.gpt_model.
GPTForPreTraining
(cfg)[source]¶ GPT Model with classification head on top.
-
forward
(input_ids, labels=None)[source]¶ - Parameters
input_ids (flow.LongTensor) – Indices of input sequence tokens in vocabulary.
labels (flow.LongTensor, optional) – Labels for computing language modeling loss. None for evaluating. Defaults to None.
- Returns
A dict containing
loss_value
orlogits
depending on training or evaluation.{"masked_lm_loss": loss_value}
when training,{"prediction_scores": logits}
when evaluating.- Return type
dict
-
-
class
libai.models.gpt_model.
GPTModel
(hidden_layers, vocab_size, hidden_size, ffn_hidden_size, num_attention_heads, max_seq_length=1024, embedding_dropout_prob=0.0, attention_dropout_prob=0.0, output_dropout_prob=0.0, layernorm_epsilon=1e-05, initializer_range=0.02, use_scaled_init_for_output_weights=True, bias_gelu_fusion=False, bias_dropout_fusion=False, scale_mask_softmax_fusion=False, apply_query_key_layer_scaling=False, apply_residual_post_layernorm=False, amp_enabled=False)[source]¶ GPT-2 language model. The output of the forward method is logits.
- Parameters
hidden_layers (int) – The number of
TransformerLayer
in the gpt model.vocab_size (int) – The size of vocabulary file.
hidden_size (int) – The size of hidden states.
ffn_hidden_size (int) – The size of intermediate layer in feed-forward network for each
TransformerLayer
.num_attention_heads (int) – The number of attention heads for each attention layer of
TransformerLayer
.max_seq_length (int, optional) – Max sequence length of input, defines the shape of Position Embeddings in GPTEmebedding. Defaults to 1024.
embedding_dropout_prob (float, optional) – The dropout ratio for the output of GPTEmbedding Layer. Defaults to 0.0.
attention_dropout_prob (float, optional) – The dropout ratio for the output of each attention layer in
TransformerLayer
. Defaults to 0.0.output_dropout_prob (float, optional) – The dropout ratio for the output for each TransformerLayer. Defaults to 0.0.
layernorm_epsilon (float, optional) – The epsilon of LayerNorm layer. Defaults to 1e-5.
initializer_range (float, optional) – Sigma of the normal distribution in the initialization method. Defaults to 0.02.
use_scaled_init_for_output_weights (bool, optional) – Defaults to
True
.bias_gelu_fusion (bool, optional) – Whether or not to fuse the computing of bias and gelu. Defaults to
False
.bias_dropout_fusion (bool, optional) – Whether or not to fuse the computing of dropout and bias. Defaults to
False
.scale_mask_softmax_fusion (bool, optional) – Whether to fuse the computing of mask and softmax in attention layers. Defaults to
False
.apply_query_key_layer_scaling (bool, optional) – Whether or not to use layer index related scaling in computing attention scores. If
True
, the scaling factor equals to sqrt(d) * (layer_index + 1). Defaults toFalse
.apply_residual_post_layernorm (bool, optional) – If set
True
, use original BERT residual connection ordering otherwise use Megatron BERT residual connection which is more stable when scaling model size introduced in https://arxiv.org/pdf/1909.08053.pdf. Default:False
.amp_enabled (bool, optional) – Whether or not to set fp16 for embedding weight in T5 model. Defaults to
False
.