libai.models¶

Supported models in LiBai(李白)

libai.models.build.build_graph(cfg, model, optimizer=None, lr_scheduler=None, is_train=False)[source]¶: Build the nn.Graph, defined by cfg.graph.

libai.models.build.build_model(cfg)[source]¶: Build the whole model architecture, defined by cfg.model. Note that it does not load any weights from cfg.

VisionTransformer¶

class libai.models.vision_transformer.VisionTransformer(img_size=224, patch_size=16, in_chans=3, embed_dim=192, depth=12, num_heads=3, mlp_ratio=4.0, drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, num_classes=1000, loss_func=None)[source]¶

Vision Transformer in LiBai.

LiBai’s implementation of: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Parameters

img_size (int, tuple(int)) – input image size
patch_size (int, tuple(int)) – patch size
in_chans (int) – number of input channels
embed_dim (int) – embedding dimension
depth (int) – depth of transformer
num_heads (int) – number of attention heads
mlp_ratio (int) – ratio of mlp hidden dim to embedding dim
drop_rate (float) – dropout rate
attn_drop_rate (float) – attention dropout rate
drop_path_rate (float) – stochastic depth rate
num_classes (int) – number of classes for classification head
loss_func (callable, optional) – loss function for computing the total loss between logits and labels

forward(images, labels=None)[source]¶

Parameters

images (flow.Tensor) – training samples.
labels (flow.LongTensor, optional) – training targets

Returns

A dict containing loss_value or logits depending on training or evaluation mode. {"losses": loss_value} when training, {"prediction_scores": logits} when evaluating.

Return type

dict

SwinTransformer¶

class libai.models.swin_transformer.SwinTransformer(img_size=224, patch_size=4, in_chans=3, num_classes=1000, embed_dim=96, depths=[2, 2, 6, 2], num_heads=[3, 6, 12, 24], window_size=7, mlp_ratio=4.0, qkv_bias=True, qk_scale=None, drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.1, norm_layer=<class 'libai.layers.layer_norm.LayerNorm'>, ape=False, patch_norm=True, loss_func=None, **kwargs)[source]¶

Swin Transformer in LiBai.

LiBai implement of: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Parameters

img_size (int, tuple(int)) – Input image size. Default 224
patch_size (int, tuple(int)) – Patch size. Default: 4
in_chans (int) – Number of input image channels. Default: 3
num_classes (int) – Number of classes for classification head. Default: 1000
embed_dim (int) – Patch embedding dimension. Default: 96
depths (tuple(int)) – Depth of each Swin Transformer layer.
num_heads (tuple(int)) – Number of attention heads in different layers.
window_size (int) – Window size. Default: 7
mlp_ratio (float) – Ratio of mlp hidden dim to embedding dim. Default: 4
qkv_bias (bool) – If True, add a learnable bias to query, key, value. Default: True
qk_scale (float) – Override default qk scale of head_dim ** -0.5 if set. Default: None
drop_rate (float) – Dropout rate. Default: 0
attn_drop_rate (float) – Attention dropout rate. Default: 0
drop_path_rate (float) – Stochastic depth rate. Default: 0.1
norm_layer (nn.Module) – Normalization layer. Default: libai.layers.LayerNorm.
ape (bool) – If True, add absolute position embedding to the patch embedding. Default: False
patch_norm (bool) – If True, add normalization after patch embedding. Default: True
loss_func (callable, optional) – Loss function for computing the total loss between logits and labels

forward(images, labels=None)[source]¶

Parameters

images (flow.Tensor) – training samples.
labels (flow.LongTensor, optional) – training targets

Returns

A dict containing loss_value or logits depending on training or evaluation mode. {"losses": loss_value} when training, {"prediction_scores": logits} when evaluating.

Return type

dict

SwinTransformerV2¶

class libai.models.swin_transformer_v2.SwinTransformerV2(img_size=224, patch_size=4, in_chans=3, num_classes=1000, embed_dim=96, depths=[2, 2, 6, 2], num_heads=[3, 6, 12, 24], window_size=7, mlp_ratio=4.0, qkv_bias=True, drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.1, norm_layer=<class 'libai.layers.layer_norm.LayerNorm'>, ape=False, patch_norm=True, pretrained_window_sizes=[0, 0, 0, 0], loss_func=None)[source]¶

Swin Transformer :param img_size: Input image size. Default 224 :type img_size: int | tuple(int) :param patch_size: Patch size. Default: 4 :type patch_size: int | tuple(int) :param in_chans: Number of input image channels. Default: 3 :type in_chans: int :param num_classes: Number of classes for classification head. Default: 1000 :type num_classes: int :param embed_dim: Patch embedding dimension. Default: 96 :type embed_dim: int :param depths: Depth of each Swin Transformer layer. :type depths: tuple(int) :param num_heads: Number of attention heads in different layers. :type num_heads: tuple(int) :param window_size: Window size. Default: 7 :type window_size: int :param mlp_ratio: Ratio of mlp hidden dim to embedding dim. Default: 4 :type mlp_ratio: float :param qkv_bias: If True, add a learnable bias to query, key, value. Default: True :type qkv_bias: bool :param drop_rate: Dropout rate. Default: 0 :type drop_rate: float :param attn_drop_rate: Attention dropout rate. Default: 0 :type attn_drop_rate: float :param drop_path_rate: Stochastic depth rate. Default: 0.1 :type drop_path_rate: float :param norm_layer: Normalization layer. Default: nn.LayerNorm. :type norm_layer: nn.Module :param ape: If True, add absolute position embedding to the patch embedding. Default: False :type ape: bool :param patch_norm: If True, add normalization after patch embedding. Default: True :type patch_norm: bool :param pretrained_window_sizes: Pretrained window sizes of each layer. :type pretrained_window_sizes: tuple(int)

forward(images, labels=None)[source]¶

Parameters

images (flow.Tensor) – training samples.
labels (flow.LongTensor, optional) – training targets

Returns

A dict containing loss_value or logits depending on training or evaluation mode. {"losses": loss_value} when training, {"prediction_scores": logits} when evaluating.

Return type

dict

ResMLP¶

class libai.models.resmlp.ResMLP(img_size=224, patch_size=16, in_chans=3, embed_dim=768, depth=12, drop_rate=0.0, drop_path_rate=0.0, init_scale=0.0001, num_classes=1000, loss_func=None)[source]¶

ResMLP in LiBai.

LiBai’s implementation of: ResMLP: Feedforward networks for image classification with data-efficient training

Parameters

img_size (int, tuple(int)) – input image size
patch_size (int, tuple(int)) – patch size
in_chans (int) – number of input channels
embed_dim (int) – embedding dimension
depth (int) – depth of transformer
drop_rate (float) – dropout rate
drop_path_rate (float) – stochastic depth rate
init_scale (float) – the layer scale ratio
num_classes (int) – number of classes for classification head
loss_func (callable, optional) – loss function for computing the total loss between logits and labels

forward(images, labels=None)[source]¶

Parameters

images (flow.Tensor) – training samples.
labels (flow.LongTensor, optional) – training targets

Returns

A dict containing loss_value or logits depending on training or evaluation mode. {"losses": loss_value} when training, {"prediction_scores": logits} when evaluating.

Return type

dict

BERT¶

class libai.models.bert_model.BertForPreTraining(cfg)[source]¶

Bert Model with two heads on top as done during the pretraining: a masked language modeling head and a next sentence prediction (classification) head.

forward(input_ids, attention_mask, tokentype_ids=None, ns_labels=None, lm_labels=None, loss_mask=None)[source]¶

Parameters

input_ids (flow.LongTensor) – Indices of input sequence tokens in vocabulary.
attention_mask (flow.BoolTensor) –
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
- 1 for tokens that are not masked,
- 0 for tokens that are masked.
tokentype_ids (flow.LongTensor, optional) – Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]. Defaults to None.
ns_labels (flow.LongTensor, optional) –
Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see input_ids docstring). Indices should be in [0, 1]:
- 0 indicates sequence B is a continuation of sequence A,
- 1 indicates sequence B is a random sequence.
lm_labels (flow.LongTensor, optional) – Labels for computing the masked language modeling loss. Indices should be in [-1, 0, …, config.vocab_size].
loss_mask (flow.BoolTensor, optional) – Mask to avoid performing loss computing on ignored tokens. Tokens with indices set to -1 are ignored (masked), the loss is only computed for the tokens with labels in [0, …, config.vocab_size]

class libai.models.bert_model.BertModel(vocab_size, hidden_size, hidden_layers, num_attention_heads, intermediate_size, hidden_dropout_prob, attention_probs_dropout_prob, max_position_embeddings, num_tokentypes=2, add_pooling_layer=True, initializer_range=0.02, layernorm_eps=1e-12, bias_gelu_fusion=True, bias_dropout_fusion=True, scale_mask_softmax_fusion=True, apply_query_key_layer_scaling=True, apply_residual_post_layernorm=False, amp_enabled=False)[source]¶

The bare Bert Model transformer outputting raw hidden-states without any specific head on top.

Parameters

vocab_size (int) – The size of vocabulary file.
hidden_size (int) – The size of hidden states.
hidden_layers (int) – The number of TransformerLayer in encoder.
num_attention_heads (int) – The number of attention heads for each attention layer of TransformerLayer.
intermediate_size (int) – The size of intermediate layer in feed-forward network for each TransformerLayer.
hidden_dropout_prob (float, optional) – The dropout ratio for the output for each TransformerLayer. Defaults to 0.0.
attention_probs_dropout_prob (float, optional) – The dropout ratio for the output of each attention layer in TransformerLayer. Defaults to 0.0.
max_position_embeddings (int) – Max sequence length of input, defines the shape of Position Embeddings in BertEmbedding.
num_tokentypes (int, optional) – Number of segment token indices. Defaults to 2.
add_pooling_layer (bool, optional) – Whether or not averaging or pooling the sequence of hidden-states for the whole input sequence. Defaults to True.
initializer_range (float, optional) – Sigma of the normal distribution in the initialization method. Defaults to 0.02.
layernorm_epsilon (float, optional) – The epsilon of LayerNorm layer. Defaults to 1e-5.
bias_gelu_fusion (bool, optional) – Whether or not to fuse the computing of bias and gelu. Defaults to False.
bias_dropout_fusion (bool, optional) – Whether or not to fuse the computing of dropout and bias. Defaults to False.
scale_mask_softmax_fusion (bool, optional) – Whether to fuse the computing of mask and softmax in attention layers. Defaults to False.
apply_query_key_layer_scaling (bool, optional) – Whether or not to use layer index related scaling in computing attention scores. If True, the scaling factor equals to sqrt(d) * (layer_index + 1). Defaults to True.
apply_residual_post_layernorm (bool, optional) – If set True, use original BERT residual connection ordering otherwise use Megatron BERT residual connection which is more stable when scaling model size introduced in https://arxiv.org/pdf/1909.08053.pdf. Default: False.
amp_enabled (bool, optional) – Whether or not to set fp16 for embedding weight in T5 model. Defaults to False.

forward(input_ids, attention_mask, tokentype_ids=None)[source]¶

Parameters

input_ids (flow.LongTensor) – Indices of input sequence tokens in vocabulary.
attention_mask (flow.BoolTensor) –
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
- 1 for tokens that are not masked,
- 0 for tokens that are masked.
tokentype_ids (flow.LongTensor, optional) – Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]. Defaults to None.

RoBERTa¶

class libai.models.roberta_model.RobertaForCausalLM(cfg)[source]¶

forward(input_ids, attention_mask, tokentype_ids=None, position_ids=None, labels=None, loss_mask=None)[source]¶

Parameters

input_ids (flow.LongTensor) – Indices of input sequence tokens in vocabulary.
attention_mask (flow.BoolTensor) –
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
- 1 for tokens that are not masked,
- 0 for tokens that are masked.
tokentype_ids (flow.LongTensor, optional) – Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]. Defaults to None.
position_ids (flow.LongTensor, optional) – Indices of positions of each input sequence tokens in the position embeddings. Defaults to None.
labels (flow.LongTensor, optional) – Labels for computing the masked language modeling loss. Indices should be in [-1, 0, …, config.vocab_size]. Defaults to None.
loss_mask (flow.BoolTensor, optional) – Mask to avoid performing loss computing on ignored tokens. Tokens with indices set to -1 are ignored (masked), the loss is only computed for the tokens with labels in [0, …, config.vocab_size]. Defaults to None.

class libai.models.roberta_model.RobertaForPreTraining(cfg)[source]¶

forward(input_ids, attention_mask, tokentype_ids=None, lm_labels=None, loss_mask=None)[source]¶

Parameters

input_ids (flow.LongTensor) – Indices of input sequence tokens in vocabulary.
attention_mask (flow.BoolTensor) –
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
- 1 for tokens that are not masked,
- 0 for tokens that are masked.
tokentype_ids (flow.LongTensor, optional) – Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]. Defaults to None.
labels (flow.LongTensor, optional) – Labels for computing the masked language modeling loss. Indices should be in [-1, 0, …, config.vocab_size]. Defaults to None.
loss_mask (flow.BoolTensor, optional) – Mask to avoid performing loss computing on ignored tokens. Tokens with indices set to -1 are ignored (masked), the loss is only computed for the tokens with labels in [0, …, config.vocab_size]. Defaults to None.

class libai.models.roberta_model.RobertaModel(vocab_size, hidden_size, hidden_layers, num_attention_heads, intermediate_size, hidden_dropout_prob, attention_probs_dropout_prob, max_position_embeddings, num_tokentypes=2, add_pooling_layer=True, initializer_range=0.02, layernorm_eps=1e-12, pad_token_id=1, bias_gelu_fusion=True, bias_dropout_fusion=True, scale_mask_softmax_fusion=True, apply_query_key_layer_scaling=True, apply_residual_post_layernorm=False, amp_enabled=False)[source]¶

The bare Roberta Model transformer outputting raw hidden-states without any specific head on top.

Args:

vocab_size (int):
The size of vocabulary file.

hidden_size (int):
The size of hidden states.

hidden_layers (int):
The number of TransformerLayer in encoder.

num_attention_heads (int):
The number of attention heads for each attention layer of TransformerLayer.

intermediate_size (int):
The size of intermediate layer in feed-forward network for each TransformerLayer.

hidden_dropout_prob (float, optional):
The dropout ratio for the output for each TransformerLayer. Defaults to 0.0.

attention_probs_dropout_prob (float, optional):
The dropout ratio for the output of each attention layer in TransformerLayer. Defaults to 0.0.

max_position_embeddings (int):
Max sequence length of input, defines the shape of Position Embeddings in RobertaEmbeddings.

type_vocab_size (int, optional):
Number of segment token indices. Defaults to 2.

add_pooling_layer (bool, optional):
Whether or not averaging or pooling the sequence of hidden-states for the whole input sequence. Defaults to True.

initializer_range (float, optional):
Sigma of the normal distribution in the initialization method. Defaults to 0.02.

layer_norm_eps (float, optional):
The epsilon of LayerNorm layer. Defaults to 1e-5.

pad_token_id (int, optional):
The token id used for padding. Defaults to 1.

bias_gelu_fusion (bool, optional):
Whether or not to fuse the computing of bias and gelu. Defaults to False.

bias_dropout_fusion (bool, optional):
Whether or not to fuse the computing of dropout and bias. Defaults to False.

scale_mask_softmax_fusion (bool, optional):
Whether to fuse the computing of mask and softmax in attention layers. Defaults to False.

apply_query_key_layer_scaling (bool, optional):
Whether or not to use layer index related scaling in computing attention scores. If True, the scaling factor equals to sqrt(d) * (layer_index + 1). Defaults to True.

apply_residual_post_layernorm (bool, optional):
If set True, use original BERT(Roberta) residual connection ordering otherwise use Megatron BERT residual connection which is more stable when scaling model size introduced in https://arxiv.org/pdf/1909.08053.pdf. Default: False.

amp_enabled (bool, optional):
Whether or not to set fp16 for embedding weight in T5 model. Defaults to False.

T5¶

class libai.models.t5_model.T5ForPreTraining(cfg)[source]¶

T5 Model with classification head on top.

forward(encoder_input_ids, decoder_input_ids, encoder_attn_mask, decoder_attn_mask, encoder_decoder_attn_mask, lm_labels=None, loss_mask=None, use_cache=False)[source]¶

Parameters

encoder_input_ids (flow.LongTensor) – Indices of input sequence tokens in vocabulary for encoder.
decoder_input_ids (flow.LongTensor) – Indices of input sequence tokens in vocabulary for decoder.
encoder_attn_mask (flow.BoolTensor) –
Mask for encoder to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
- 1 for tokens that are not masked,
- 0 for tokens that are masked.
decoder_attn_mask (flow.BoolTensor) – Mask for decoder to avoid performing attention on subsequent token indices. Mask values have the same meaning as encoder_attn_mask.
encoder_decoder_attn_mask (flow.BoolTensor) – Mask for decoder to avoid performing attention on encoder padded token indices. Mask values have the same meaning as encoder_attn_mask.
lm_labels (flow.LongTensor, optional) – Labels for computing the masked language modeling loss. Indices should be in [-1, 0, …, config.vocab_size]. None for evaluating.
loss_mask (flow.BoolTensor, optional) – Mask to avoid performing loss computing on ignored tokens. Tokens with indices set to -1 are ignored (masked), the loss is only computed for the tokens with labels in [0, …, config.vocab_size]. None for evaluating.
use_cache (bool, optional) – It will be set to True, when the model is in the inference phase and used for incremental decoding. Defaults to False.

Returns

A dict containing loss_value or logits depending on training or evaluation mode. {"masked_lm_loss": loss_value} when training, {"prediction_scores": logits} when evaluating.

Return type

dict

class libai.models.t5_model.T5Model(vocab_size, hidden_size, hidden_layers, num_attention_heads, intermediate_size, embedding_dropout_prob, hidden_dropout_prob, attention_probs_dropout_prob, max_position_embeddings, initializer_range=0.02, layernorm_eps=1e-12, bias_gelu_fusion=False, bias_dropout_fusion=False, scale_mask_softmax_fusion=False, apply_query_key_layer_scaling=True, apply_residual_post_layernorm=False, amp_enabled=False)[source]¶

T5 Model that outputs logits.

Parameters

vocab_size (int) – The size of vocabulary file.
hidden_size (int) – The size of hidden states.
hidden_layers (int) – The number of TransformerLayer in the encoder and decoder.
num_attention_heads (int) – The number of attention heads for each attention layer of TransformerLayer.
intermediate_size (int) – The size of intermediate layer in feed-forward network for each TransformerLayer.
embedding_dropout_prob (float) – The dropout ratio for the output of T5Embedding Layer.
hidden_dropout_prob (float) – The dropout ratio for the output for each TransformerLayer.
attention_probs_dropout_prob (float) – The dropout ratio for the output of each attention layer in TransformerLayer.
max_position_embeddings (int) – Max sequence length of input, defines the shape of Position Embeddings in T5Emebedding.
initializer_range (float, optional) – Sigma of the normal distribution in the initialization method. Defaults to 0.02.
layernorm_eps (float, optional) – The epsilon of LayerNorm layer. Defaults to 1e-12.
bias_gelu_fusion (bool, optional) – Whether or not to fuse the computing of bias and gelu. Defaults to False.
bias_dropout_fusion (bool, optional) – Whether or not to fuse the computing of dropout and bias. Defaults to False.
scale_mask_softmax_fusion (bool, optional) – Whether to fuse the computing of mask and softmax in attention layers. Defaults to False.
apply_query_key_layer_scaling (bool, optional) – Whether or not to use layer index related scaling in computing attention scores. If True, the scaling factor equals to sqrt(d) * (layer_index + 1). Defaults to True.
apply_residual_post_layernorm (bool, optional) – If set True, use original BERT residual connection ordering otherwise use Megatron BERT residual connection which is more stable when scaling model size introduced in https://arxiv.org/pdf/1909.08053.pdf. Default: False.
amp_enabled (bool, optional) – Whether or not to set fp16 for embedding weight in T5 model. Defaults to False.

forward(encoder_input_ids, decoder_input_ids, encoder_attn_mask, decoder_attn_mask, encoder_decoder_attn_mask, use_cache=False)[source]¶

Parameters

encoder_input_ids (flow.LongTensor) – Indices of input sequence tokens in vocabulary for encoder.
decoder_input_ids (flow.LongTensor) – Indices of input sequence tokens in vocabulary for decoder.
encoder_attn_mask (flow.BoolTensor) –
Mask for encoder to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
- 1 for tokens that are not masked,
- 0 for tokens that are masked.
decoder_attn_mask (flow.BoolTensor) – Mask for decoder to avoid performing attention on subsequent token indices. Mask values have the same meaning as encoder_attn_mask.
encoder_decoder_attn_mask (flow.BoolTensor) – Mask for decoder to avoid performing attention on encoder padded token indices. Mask values have the same meaning as encoder_attn_mask.
use_cache (bool, optional) – It will be set to True, when the model is in the inference phase and used for incremental decoding. Defaults to False.

Returns

logits

Return type

flow.Tensor

GPT-2¶

class libai.models.gpt_model.GPTForPreTraining(cfg)[source]¶

GPT Model with classification head on top.

forward(input_ids, labels=None)[source]¶

Parameters

input_ids (flow.LongTensor) – Indices of input sequence tokens in vocabulary.
labels (flow.LongTensor, optional) – Labels for computing language modeling loss. None for evaluating. Defaults to None.

Returns

A dict containing loss_value or logits depending on training or evaluation. {"masked_lm_loss": loss_value} when training, {"prediction_scores": logits} when evaluating.

Return type

dict

class libai.models.gpt_model.GPTModel(hidden_layers, vocab_size, hidden_size, ffn_hidden_size, num_attention_heads, max_seq_length=1024, embedding_dropout_prob=0.0, attention_dropout_prob=0.0, output_dropout_prob=0.0, layernorm_epsilon=1e-05, initializer_range=0.02, use_scaled_init_for_output_weights=True, bias_gelu_fusion=False, bias_dropout_fusion=False, scale_mask_softmax_fusion=False, apply_query_key_layer_scaling=False, apply_residual_post_layernorm=False, amp_enabled=False)[source]¶

GPT-2 language model. The output of the forward method is logits.

Parameters

hidden_layers (int) – The number of TransformerLayer in the gpt model.
vocab_size (int) – The size of vocabulary file.
hidden_size (int) – The size of hidden states.
ffn_hidden_size (int) – The size of intermediate layer in feed-forward network for each TransformerLayer.
num_attention_heads (int) – The number of attention heads for each attention layer of TransformerLayer.
max_seq_length (int, optional) – Max sequence length of input, defines the shape of Position Embeddings in GPTEmebedding. Defaults to 1024.
embedding_dropout_prob (float, optional) – The dropout ratio for the output of GPTEmbedding Layer. Defaults to 0.0.
attention_dropout_prob (float, optional) – The dropout ratio for the output of each attention layer in TransformerLayer. Defaults to 0.0.
output_dropout_prob (float, optional) – The dropout ratio for the output for each TransformerLayer. Defaults to 0.0.
layernorm_epsilon (float, optional) – The epsilon of LayerNorm layer. Defaults to 1e-5.
initializer_range (float, optional) – Sigma of the normal distribution in the initialization method. Defaults to 0.02.
use_scaled_init_for_output_weights (bool, optional) – Defaults to True.
bias_gelu_fusion (bool, optional) – Whether or not to fuse the computing of bias and gelu. Defaults to False.
bias_dropout_fusion (bool, optional) – Whether or not to fuse the computing of dropout and bias. Defaults to False.
scale_mask_softmax_fusion (bool, optional) – Whether to fuse the computing of mask and softmax in attention layers. Defaults to False.
apply_query_key_layer_scaling (bool, optional) – Whether or not to use layer index related scaling in computing attention scores. If True, the scaling factor equals to sqrt(d) * (layer_index + 1). Defaults to False.
apply_residual_post_layernorm (bool, optional) – If set True, use original BERT residual connection ordering otherwise use Megatron BERT residual connection which is more stable when scaling model size introduced in https://arxiv.org/pdf/1909.08053.pdf. Default: False.
amp_enabled (bool, optional) – Whether or not to set fp16 for embedding weight in T5 model. Defaults to False.

forward(input_ids)[source]¶

Parameters: input_ids (flow.LongTensor) – Indices of input sequence tokens in vocabulary.
Returns: logits
Return type: flow.Tensor