libai.models

Supported models in LiBai(李白)

libai.models.build.build_graph(cfg, model, optimizer=None, lr_scheduler=None, is_train=False)[source]

Build the nn.Graph, defined by cfg.graph.

libai.models.build.build_model(cfg)[source]

Build the whole model architecture, defined by cfg.model. Note that it does not load any weights from cfg.

VisionTransformer

class libai.models.vision_transformer.VisionTransformer(img_size=224, patch_size=16, in_chans=3, embed_dim=192, depth=12, num_heads=3, mlp_ratio=4.0, drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, num_classes=1000, loss_func=None)[source]

Vision Transformer in LiBai.

LiBai’s implementation of: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Parameters
  • img_size (int, tuple(int)) – input image size

  • patch_size (int, tuple(int)) – patch size

  • in_chans (int) – number of input channels

  • embed_dim (int) – embedding dimension

  • depth (int) – depth of transformer

  • num_heads (int) – number of attention heads

  • mlp_ratio (int) – ratio of mlp hidden dim to embedding dim

  • drop_rate (float) – dropout rate

  • attn_drop_rate (float) – attention dropout rate

  • drop_path_rate (float) – stochastic depth rate

  • num_classes (int) – number of classes for classification head

  • loss_func (callable, optional) – loss function for computing the total loss between logits and labels

forward(images, labels=None)[source]
Parameters
  • images (flow.Tensor) – training samples.

  • labels (flow.LongTensor, optional) – training targets

Returns

A dict containing loss_value or logits depending on training or evaluation mode. {"losses": loss_value} when training, {"prediction_scores": logits} when evaluating.

Return type

dict

SwinTransformer

class libai.models.swin_transformer.SwinTransformer(img_size=224, patch_size=4, in_chans=3, num_classes=1000, embed_dim=96, depths=[2, 2, 6, 2], num_heads=[3, 6, 12, 24], window_size=7, mlp_ratio=4.0, qkv_bias=True, qk_scale=None, drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.1, norm_layer=<class 'libai.layers.layer_norm.LayerNorm'>, ape=False, patch_norm=True, loss_func=None, **kwargs)[source]

Swin Transformer in LiBai.

LiBai implement of: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Parameters
  • img_size (int, tuple(int)) – Input image size. Default 224

  • patch_size (int, tuple(int)) – Patch size. Default: 4

  • in_chans (int) – Number of input image channels. Default: 3

  • num_classes (int) – Number of classes for classification head. Default: 1000

  • embed_dim (int) – Patch embedding dimension. Default: 96

  • depths (tuple(int)) – Depth of each Swin Transformer layer.

  • num_heads (tuple(int)) – Number of attention heads in different layers.

  • window_size (int) – Window size. Default: 7

  • mlp_ratio (float) – Ratio of mlp hidden dim to embedding dim. Default: 4

  • qkv_bias (bool) – If True, add a learnable bias to query, key, value. Default: True

  • qk_scale (float) – Override default qk scale of head_dim ** -0.5 if set. Default: None

  • drop_rate (float) – Dropout rate. Default: 0

  • attn_drop_rate (float) – Attention dropout rate. Default: 0

  • drop_path_rate (float) – Stochastic depth rate. Default: 0.1

  • norm_layer (nn.Module) – Normalization layer. Default: libai.layers.LayerNorm.

  • ape (bool) – If True, add absolute position embedding to the patch embedding. Default: False

  • patch_norm (bool) – If True, add normalization after patch embedding. Default: True

  • loss_func (callable, optional) – Loss function for computing the total loss between logits and labels

forward(images, labels=None)[source]
Parameters
  • images (flow.Tensor) – training samples.

  • labels (flow.LongTensor, optional) – training targets

Returns

A dict containing loss_value or logits depending on training or evaluation mode. {"losses": loss_value} when training, {"prediction_scores": logits} when evaluating.

Return type

dict

SwinTransformerV2

class libai.models.swin_transformer_v2.SwinTransformerV2(img_size=224, patch_size=4, in_chans=3, num_classes=1000, embed_dim=96, depths=[2, 2, 6, 2], num_heads=[3, 6, 12, 24], window_size=7, mlp_ratio=4.0, qkv_bias=True, drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.1, norm_layer=<class 'libai.layers.layer_norm.LayerNorm'>, ape=False, patch_norm=True, pretrained_window_sizes=[0, 0, 0, 0], loss_func=None)[source]

Swin Transformer :param img_size: Input image size. Default 224 :type img_size: int | tuple(int) :param patch_size: Patch size. Default: 4 :type patch_size: int | tuple(int) :param in_chans: Number of input image channels. Default: 3 :type in_chans: int :param num_classes: Number of classes for classification head. Default: 1000 :type num_classes: int :param embed_dim: Patch embedding dimension. Default: 96 :type embed_dim: int :param depths: Depth of each Swin Transformer layer. :type depths: tuple(int) :param num_heads: Number of attention heads in different layers. :type num_heads: tuple(int) :param window_size: Window size. Default: 7 :type window_size: int :param mlp_ratio: Ratio of mlp hidden dim to embedding dim. Default: 4 :type mlp_ratio: float :param qkv_bias: If True, add a learnable bias to query, key, value. Default: True :type qkv_bias: bool :param drop_rate: Dropout rate. Default: 0 :type drop_rate: float :param attn_drop_rate: Attention dropout rate. Default: 0 :type attn_drop_rate: float :param drop_path_rate: Stochastic depth rate. Default: 0.1 :type drop_path_rate: float :param norm_layer: Normalization layer. Default: nn.LayerNorm. :type norm_layer: nn.Module :param ape: If True, add absolute position embedding to the patch embedding. Default: False :type ape: bool :param patch_norm: If True, add normalization after patch embedding. Default: True :type patch_norm: bool :param pretrained_window_sizes: Pretrained window sizes of each layer. :type pretrained_window_sizes: tuple(int)

forward(images, labels=None)[source]
Parameters
  • images (flow.Tensor) – training samples.

  • labels (flow.LongTensor, optional) – training targets

Returns

A dict containing loss_value or logits depending on training or evaluation mode. {"losses": loss_value} when training, {"prediction_scores": logits} when evaluating.

Return type

dict

ResMLP

class libai.models.resmlp.ResMLP(img_size=224, patch_size=16, in_chans=3, embed_dim=768, depth=12, drop_rate=0.0, drop_path_rate=0.0, init_scale=0.0001, num_classes=1000, loss_func=None)[source]

ResMLP in LiBai.

LiBai’s implementation of: ResMLP: Feedforward networks for image classification with data-efficient training

Parameters
  • img_size (int, tuple(int)) – input image size

  • patch_size (int, tuple(int)) – patch size

  • in_chans (int) – number of input channels

  • embed_dim (int) – embedding dimension

  • depth (int) – depth of transformer

  • drop_rate (float) – dropout rate

  • drop_path_rate (float) – stochastic depth rate

  • init_scale (float) – the layer scale ratio

  • num_classes (int) – number of classes for classification head

  • loss_func (callable, optional) – loss function for computing the total loss between logits and labels

forward(images, labels=None)[source]
Parameters
  • images (flow.Tensor) – training samples.

  • labels (flow.LongTensor, optional) – training targets

Returns

A dict containing loss_value or logits depending on training or evaluation mode. {"losses": loss_value} when training, {"prediction_scores": logits} when evaluating.

Return type

dict

BERT

class libai.models.bert_model.BertForPreTraining(cfg)[source]

Bert Model with two heads on top as done during the pretraining: a masked language modeling head and a next sentence prediction (classification) head.

forward(input_ids, attention_mask, tokentype_ids=None, ns_labels=None, lm_labels=None, loss_mask=None)[source]
Parameters
  • input_ids (flow.LongTensor) – Indices of input sequence tokens in vocabulary.

  • attention_mask (flow.BoolTensor) –

    Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

    • 1 for tokens that are not masked,

    • 0 for tokens that are masked.

  • tokentype_ids (flow.LongTensor, optional) – Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]. Defaults to None.

  • ns_labels (flow.LongTensor, optional) –

    Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see input_ids docstring). Indices should be in [0, 1]:

    • 0 indicates sequence B is a continuation of sequence A,

    • 1 indicates sequence B is a random sequence.

  • lm_labels (flow.LongTensor, optional) – Labels for computing the masked language modeling loss. Indices should be in [-1, 0, …, config.vocab_size].

  • loss_mask (flow.BoolTensor, optional) – Mask to avoid performing loss computing on ignored tokens. Tokens with indices set to -1 are ignored (masked), the loss is only computed for the tokens with labels in [0, …, config.vocab_size]

class libai.models.bert_model.BertModel(vocab_size, hidden_size, hidden_layers, num_attention_heads, intermediate_size, hidden_dropout_prob, attention_probs_dropout_prob, max_position_embeddings, num_tokentypes=2, add_pooling_layer=True, initializer_range=0.02, layernorm_eps=1e-12, bias_gelu_fusion=True, bias_dropout_fusion=True, scale_mask_softmax_fusion=True, apply_query_key_layer_scaling=True, apply_residual_post_layernorm=False, amp_enabled=False)[source]

The bare Bert Model transformer outputting raw hidden-states without any specific head on top.

Parameters
  • vocab_size (int) – The size of vocabulary file.

  • hidden_size (int) – The size of hidden states.

  • hidden_layers (int) – The number of TransformerLayer in encoder.

  • num_attention_heads (int) – The number of attention heads for each attention layer of TransformerLayer.

  • intermediate_size (int) – The size of intermediate layer in feed-forward network for each TransformerLayer.

  • hidden_dropout_prob (float, optional) – The dropout ratio for the output for each TransformerLayer. Defaults to 0.0.

  • attention_probs_dropout_prob (float, optional) – The dropout ratio for the output of each attention layer in TransformerLayer. Defaults to 0.0.

  • max_position_embeddings (int) – Max sequence length of input, defines the shape of Position Embeddings in BertEmbedding.

  • num_tokentypes (int, optional) – Number of segment token indices. Defaults to 2.

  • add_pooling_layer (bool, optional) – Whether or not averaging or pooling the sequence of hidden-states for the whole input sequence. Defaults to True.

  • initializer_range (float, optional) – Sigma of the normal distribution in the initialization method. Defaults to 0.02.

  • layernorm_epsilon (float, optional) – The epsilon of LayerNorm layer. Defaults to 1e-5.

  • bias_gelu_fusion (bool, optional) – Whether or not to fuse the computing of bias and gelu. Defaults to False.

  • bias_dropout_fusion (bool, optional) – Whether or not to fuse the computing of dropout and bias. Defaults to False.

  • scale_mask_softmax_fusion (bool, optional) – Whether to fuse the computing of mask and softmax in attention layers. Defaults to False.

  • apply_query_key_layer_scaling (bool, optional) – Whether or not to use layer index related scaling in computing attention scores. If True, the scaling factor equals to sqrt(d) * (layer_index + 1). Defaults to True.

  • apply_residual_post_layernorm (bool, optional) – If set True, use original BERT residual connection ordering otherwise use Megatron BERT residual connection which is more stable when scaling model size introduced in https://arxiv.org/pdf/1909.08053.pdf. Default: False.

  • amp_enabled (bool, optional) – Whether or not to set fp16 for embedding weight in T5 model. Defaults to False.

forward(input_ids, attention_mask, tokentype_ids=None)[source]
Parameters
  • input_ids (flow.LongTensor) – Indices of input sequence tokens in vocabulary.

  • attention_mask (flow.BoolTensor) –

    Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

    • 1 for tokens that are not masked,

    • 0 for tokens that are masked.

  • tokentype_ids (flow.LongTensor, optional) – Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]. Defaults to None.

RoBERTa

class libai.models.roberta_model.RobertaForCausalLM(cfg)[source]
forward(input_ids, attention_mask, tokentype_ids=None, position_ids=None, labels=None, loss_mask=None)[source]
Parameters
  • input_ids (flow.LongTensor) – Indices of input sequence tokens in vocabulary.

  • attention_mask (flow.BoolTensor) –

    Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

    • 1 for tokens that are not masked,

    • 0 for tokens that are masked.

  • tokentype_ids (flow.LongTensor, optional) – Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]. Defaults to None.

  • position_ids (flow.LongTensor, optional) – Indices of positions of each input sequence tokens in the position embeddings. Defaults to None.

  • labels (flow.LongTensor, optional) – Labels for computing the masked language modeling loss. Indices should be in [-1, 0, …, config.vocab_size]. Defaults to None.

  • loss_mask (flow.BoolTensor, optional) – Mask to avoid performing loss computing on ignored tokens. Tokens with indices set to -1 are ignored (masked), the loss is only computed for the tokens with labels in [0, …, config.vocab_size]. Defaults to None.

class libai.models.roberta_model.RobertaForPreTraining(cfg)[source]
forward(input_ids, attention_mask, tokentype_ids=None, lm_labels=None, loss_mask=None)[source]
Parameters
  • input_ids (flow.LongTensor) – Indices of input sequence tokens in vocabulary.

  • attention_mask (flow.BoolTensor) –

    Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

    • 1 for tokens that are not masked,

    • 0 for tokens that are masked.

  • tokentype_ids (flow.LongTensor, optional) – Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]. Defaults to None.

  • labels (flow.LongTensor, optional) – Labels for computing the masked language modeling loss. Indices should be in [-1, 0, …, config.vocab_size]. Defaults to None.

  • loss_mask (flow.BoolTensor, optional) – Mask to avoid performing loss computing on ignored tokens. Tokens with indices set to -1 are ignored (masked), the loss is only computed for the tokens with labels in [0, …, config.vocab_size]. Defaults to None.

class libai.models.roberta_model.RobertaModel(vocab_size, hidden_size, hidden_layers, num_attention_heads, intermediate_size, hidden_dropout_prob, attention_probs_dropout_prob, max_position_embeddings, num_tokentypes=2, add_pooling_layer=True, initializer_range=0.02, layernorm_eps=1e-12, pad_token_id=1, bias_gelu_fusion=True, bias_dropout_fusion=True, scale_mask_softmax_fusion=True, apply_query_key_layer_scaling=True, apply_residual_post_layernorm=False, amp_enabled=False)[source]

The bare Roberta Model transformer outputting raw hidden-states without any specific head on top.

Args:
vocab_size (int):

The size of vocabulary file.

hidden_size (int):

The size of hidden states.

hidden_layers (int):

The number of TransformerLayer in encoder.

num_attention_heads (int):

The number of attention heads for each attention layer of TransformerLayer.

intermediate_size (int):

The size of intermediate layer in feed-forward network for each TransformerLayer.

hidden_dropout_prob (float, optional):

The dropout ratio for the output for each TransformerLayer. Defaults to 0.0.

attention_probs_dropout_prob (float, optional):

The dropout ratio for the output of each attention layer in TransformerLayer. Defaults to 0.0.

max_position_embeddings (int):

Max sequence length of input, defines the shape of Position Embeddings in RobertaEmbeddings.

type_vocab_size (int, optional):

Number of segment token indices. Defaults to 2.

add_pooling_layer (bool, optional):

Whether or not averaging or pooling the sequence of hidden-states for the whole input sequence. Defaults to True.

initializer_range (float, optional):

Sigma of the normal distribution in the initialization method. Defaults to 0.02.

layer_norm_eps (float, optional):

The epsilon of LayerNorm layer. Defaults to 1e-5.

pad_token_id (int, optional):

The token id used for padding. Defaults to 1.

bias_gelu_fusion (bool, optional):

Whether or not to fuse the computing of bias and gelu. Defaults to False.

bias_dropout_fusion (bool, optional):

Whether or not to fuse the computing of dropout and bias. Defaults to False.

scale_mask_softmax_fusion (bool, optional):

Whether to fuse the computing of mask and softmax in attention layers. Defaults to False.

apply_query_key_layer_scaling (bool, optional):

Whether or not to use layer index related scaling in computing attention scores. If True, the scaling factor equals to sqrt(d) * (layer_index + 1). Defaults to True.

apply_residual_post_layernorm (bool, optional):

If set True, use original BERT(Roberta) residual connection ordering otherwise use Megatron BERT residual connection which is more stable when scaling model size introduced in https://arxiv.org/pdf/1909.08053.pdf. Default: False.

amp_enabled (bool, optional):

Whether or not to set fp16 for embedding weight in T5 model. Defaults to False.

T5

class libai.models.t5_model.T5ForPreTraining(cfg)[source]

T5 Model with classification head on top.

forward(encoder_input_ids, decoder_input_ids, encoder_attn_mask, decoder_attn_mask, encoder_decoder_attn_mask, lm_labels=None, loss_mask=None, use_cache=False)[source]
Parameters
  • encoder_input_ids (flow.LongTensor) – Indices of input sequence tokens in vocabulary for encoder.

  • decoder_input_ids (flow.LongTensor) – Indices of input sequence tokens in vocabulary for decoder.

  • encoder_attn_mask (flow.BoolTensor) –

    Mask for encoder to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

    • 1 for tokens that are not masked,

    • 0 for tokens that are masked.

  • decoder_attn_mask (flow.BoolTensor) – Mask for decoder to avoid performing attention on subsequent token indices. Mask values have the same meaning as encoder_attn_mask.

  • encoder_decoder_attn_mask (flow.BoolTensor) – Mask for decoder to avoid performing attention on encoder padded token indices. Mask values have the same meaning as encoder_attn_mask.

  • lm_labels (flow.LongTensor, optional) – Labels for computing the masked language modeling loss. Indices should be in [-1, 0, …, config.vocab_size]. None for evaluating.

  • loss_mask (flow.BoolTensor, optional) – Mask to avoid performing loss computing on ignored tokens. Tokens with indices set to -1 are ignored (masked), the loss is only computed for the tokens with labels in [0, …, config.vocab_size]. None for evaluating.

  • use_cache (bool, optional) – It will be set to True, when the model is in the inference phase and used for incremental decoding. Defaults to False.

Returns

A dict containing loss_value or logits depending on training or evaluation mode. {"masked_lm_loss": loss_value} when training, {"prediction_scores": logits} when evaluating.

Return type

dict

class libai.models.t5_model.T5Model(vocab_size, hidden_size, hidden_layers, num_attention_heads, intermediate_size, embedding_dropout_prob, hidden_dropout_prob, attention_probs_dropout_prob, max_position_embeddings, initializer_range=0.02, layernorm_eps=1e-12, bias_gelu_fusion=False, bias_dropout_fusion=False, scale_mask_softmax_fusion=False, apply_query_key_layer_scaling=True, apply_residual_post_layernorm=False, amp_enabled=False)[source]

T5 Model that outputs logits.

Parameters
  • vocab_size (int) – The size of vocabulary file.

  • hidden_size (int) – The size of hidden states.

  • hidden_layers (int) – The number of TransformerLayer in the encoder and decoder.

  • num_attention_heads (int) – The number of attention heads for each attention layer of TransformerLayer.

  • intermediate_size (int) – The size of intermediate layer in feed-forward network for each TransformerLayer.

  • embedding_dropout_prob (float) – The dropout ratio for the output of T5Embedding Layer.

  • hidden_dropout_prob (float) – The dropout ratio for the output for each TransformerLayer.

  • attention_probs_dropout_prob (float) – The dropout ratio for the output of each attention layer in TransformerLayer.

  • max_position_embeddings (int) – Max sequence length of input, defines the shape of Position Embeddings in T5Emebedding.

  • initializer_range (float, optional) – Sigma of the normal distribution in the initialization method. Defaults to 0.02.

  • layernorm_eps (float, optional) – The epsilon of LayerNorm layer. Defaults to 1e-12.

  • bias_gelu_fusion (bool, optional) – Whether or not to fuse the computing of bias and gelu. Defaults to False.

  • bias_dropout_fusion (bool, optional) – Whether or not to fuse the computing of dropout and bias. Defaults to False.

  • scale_mask_softmax_fusion (bool, optional) – Whether to fuse the computing of mask and softmax in attention layers. Defaults to False.

  • apply_query_key_layer_scaling (bool, optional) – Whether or not to use layer index related scaling in computing attention scores. If True, the scaling factor equals to sqrt(d) * (layer_index + 1). Defaults to True.

  • apply_residual_post_layernorm (bool, optional) – If set True, use original BERT residual connection ordering otherwise use Megatron BERT residual connection which is more stable when scaling model size introduced in https://arxiv.org/pdf/1909.08053.pdf. Default: False.

  • amp_enabled (bool, optional) – Whether or not to set fp16 for embedding weight in T5 model. Defaults to False.

forward(encoder_input_ids, decoder_input_ids, encoder_attn_mask, decoder_attn_mask, encoder_decoder_attn_mask, use_cache=False)[source]
Parameters
  • encoder_input_ids (flow.LongTensor) – Indices of input sequence tokens in vocabulary for encoder.

  • decoder_input_ids (flow.LongTensor) – Indices of input sequence tokens in vocabulary for decoder.

  • encoder_attn_mask (flow.BoolTensor) –

    Mask for encoder to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

    • 1 for tokens that are not masked,

    • 0 for tokens that are masked.

  • decoder_attn_mask (flow.BoolTensor) – Mask for decoder to avoid performing attention on subsequent token indices. Mask values have the same meaning as encoder_attn_mask.

  • encoder_decoder_attn_mask (flow.BoolTensor) – Mask for decoder to avoid performing attention on encoder padded token indices. Mask values have the same meaning as encoder_attn_mask.

  • use_cache (bool, optional) – It will be set to True, when the model is in the inference phase and used for incremental decoding. Defaults to False.

Returns

logits

Return type

flow.Tensor

GPT-2

class libai.models.gpt_model.GPTForPreTraining(cfg)[source]

GPT Model with classification head on top.

forward(input_ids, labels=None)[source]
Parameters
  • input_ids (flow.LongTensor) – Indices of input sequence tokens in vocabulary.

  • labels (flow.LongTensor, optional) – Labels for computing language modeling loss. None for evaluating. Defaults to None.

Returns

A dict containing loss_value or logits depending on training or evaluation. {"masked_lm_loss": loss_value} when training, {"prediction_scores": logits} when evaluating.

Return type

dict

class libai.models.gpt_model.GPTModel(num_layers, vocab_size, hidden_size, ffn_hidden_size, num_attention_heads, max_seq_length=1024, embedding_dropout_prob=0.0, attention_dropout_prob=0.0, output_dropout_prob=0.0, layernorm_epsilon=1e-05, initializer_range=0.02, use_scaled_init_for_output_weights=True, bias_gelu_fusion=False, bias_dropout_fusion=False, scale_mask_softmax_fusion=False, apply_query_key_layer_scaling=False, apply_residual_post_layernorm=False, amp_enabled=False)[source]

GPT-2 language model. The output of the forward method is logits.

Parameters
  • num_layers (int) – The number of TransformerLayer in the gpt model.

  • vocab_size (int) – The size of vocabulary file.

  • hidden_size (int) – The size of hidden states.

  • ffn_hidden_size (int) – The size of intermediate layer in feed-forward network for each TransformerLayer.

  • num_attention_heads (int) – The number of attention heads for each attention layer of TransformerLayer.

  • max_seq_length (int, optional) – Max sequence length of input, defines the shape of Position Embeddings in GPTEmebedding. Defaults to 1024.

  • embedding_dropout_prob (float, optional) – The dropout ratio for the output of GPTEmbedding Layer. Defaults to 0.0.

  • attention_dropout_prob (float, optional) – The dropout ratio for the output of each attention layer in TransformerLayer. Defaults to 0.0.

  • output_dropout_prob (float, optional) – The dropout ratio for the output for each TransformerLayer. Defaults to 0.0.

  • layernorm_epsilon (float, optional) – The epsilon of LayerNorm layer. Defaults to 1e-5.

  • initializer_range (float, optional) – Sigma of the normal distribution in the initialization method. Defaults to 0.02.

  • use_scaled_init_for_output_weights (bool, optional) – Defaults to True.

  • bias_gelu_fusion (bool, optional) – Whether or not to fuse the computing of bias and gelu. Defaults to False.

  • bias_dropout_fusion (bool, optional) – Whether or not to fuse the computing of dropout and bias. Defaults to False.

  • scale_mask_softmax_fusion (bool, optional) – Whether to fuse the computing of mask and softmax in attention layers. Defaults to False.

  • apply_query_key_layer_scaling (bool, optional) – Whether or not to use layer index related scaling in computing attention scores. If True, the scaling factor equals to sqrt(d) * (layer_index + 1). Defaults to False.

  • apply_residual_post_layernorm (bool, optional) – If set True, use original BERT residual connection ordering otherwise use Megatron BERT residual connection which is more stable when scaling model size introduced in https://arxiv.org/pdf/1909.08053.pdf. Default: False.

  • amp_enabled (bool, optional) – Whether or not to set fp16 for embedding weight in T5 model. Defaults to False.

forward(input_ids)[source]
Parameters

input_ids (flow.LongTensor) – Indices of input sequence tokens in vocabulary.

Returns

logits

Return type

flow.Tensor