libai.layers

class libai.layers.DropPath(drop_prob: float = 0.0, scale_by_keep: bool = True)[source]

Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).

class libai.layers.Embedding(num_embeddings, embedding_dim, padding_idx=None, init_method=<function xavier_normal_>, amp_enabled=False, dtype=oneflow.float32, layer_idx=0)[source]

Construct the trainable embedding module, which does not support parallelization. This can be used for positional embedding and token type embedding.

Parameters
  • num_embeddings – size of vocabulary.

  • embedding_dim – dimension of embeddings.

  • padding_idx – pad index. Defaults to None.

  • init_method – method to initialize weights. Defaults to flow.nn.init.xavier_normal_.

  • amp_enabled – fp16 option for embedding weight. Defaults to False.

extra_repr()str[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

class libai.layers.LayerNorm(normalized_shape, eps=1e-05, elementwise_affine=True, bias=True, *, layer_idx=0)[source]

Applies Layer Normalization over a mini-batch of inputs in 1D parallelism.

Parameters
  • normalized_shape – input shape from an expected input of size.

  • eps – a value added to the denominator for numerical stability. Defaults to 1e-5. elementwise_affine: a boolean value that when set to True, this module has learnable per-element affine parameters initialized to ones (for weights) and zeros (for biases). Default: True.

  • elementwise_affine – a boolean value that when set to True, this module has learnable per-element affine parameters initialized to ones (for weights) and zeros (for biases). Default: True.

  • bias – If set to False, the layer will not learn an additive bias. Defaults to True.

  • layer_idx – a layer_idx sign which determines the placement. It will be used in pipeline parallelism. Defaults to 0.

extra_repr()str[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

libai.layers.Linear

alias of libai.layers.linear.Linear1D

class libai.layers.Linear1D(in_features, out_features, bias=True, parallel='data', init_method=<function xavier_normal_>, skip_bias_add=False, dtype=oneflow.float32, *, layer_idx=0)[source]

Linear layer with 1D parallelism which includes column parallelism and row parallelism. The linear layer is defined as \(y = xA^T + b\).

In column parallelism, A^T is parallelized along the second dimension as \(A^T = [A_1, ..., A_p]\).

In row parallelism, A^T is parallelized along the first dimension and X along its second dimension as:

\[\begin{split}A^T = \begin{bmatrix} A\_1 \\ . \\ . \\ . \\ A\_p \end{bmatrix} x = \begin{bmatrix} x\_1 & ... & x\_p \end{bmatrix}\end{split}\]
Parameters
  • in_features – size of each input sample.

  • out_features – size of each output sample.

  • bias – If set to False, the layer will not learn an additive bias. Defaults to True.

  • parallel – Parallel mode. Defaults to “data”.

  • init_method – method to initialize weight. Defaults to nn.init.xavier_normal_().

  • skip_bias_add – skip adding bias but instead return it, so that adding bias can be fused with other elementwise operations. Defaults to False.

  • layer_idx – A layer_idx sign which determines the placement. It will be used in pipeline parallelism. Defaults to 0.

  • dtype – the dtype of weight. Defaults to flow.float32

extra_repr()str[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

class libai.layers.MLP(hidden_size, ffn_hidden_size, output_dropout_prob=0.0, init_method=<function xavier_normal_>, output_layer_init_method=None, bias_gelu_fusion=False, bias_dropout_fusion=False, *, layer_idx=0)[source]

MLP will take the input with h hidden state, project it to intermediate hidden dimension, perform gelu transformation, and project the state back into h hidden dimension.

Parameters
  • hidden_size – size of each input and output sample.

  • ffn_hidden_size – size of each intermediate sample.

  • output_dropout_prob – Output dropout probability. Defaults to 0.0.

  • init_method – method to initialize the first linear weight. Defaults to nn.init.xavier_normal_().

  • output_layer_init_method – method to initialize the second linear weight. If set to None, it will use init_method instead. Defaults to None.

  • bias_gelu_fusion – If set to True, it will fuse bias adding and elementwise gelu activation. Defaults to False.

  • bias_dropout_fusion – If set to True, it will fuse bias adding and dropout. Defaults to False.

  • layer_idx – A layer_idx sign which determines the placement. It will be used in pipeline parallelism. Defaults to 0.

extra_repr()str[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

class libai.layers.MultiheadAttention(hidden_size, num_attention_heads, is_cross_attention=False, attention_dropout_prob=0.0, output_dropout_prob=0.0, init_method=<function xavier_normal_>, output_layer_init_method=None, bias_dropout_fusion=False, scale_mask_softmax_fusion=False, apply_query_key_layer_scaling=False, attn_mask_type=<AttnMaskType.padding: 1>, *, layer_idx=0)[source]

Multi-head attention layer, support self attention and cross attention.

Parameters
  • hidden_size – size of hidden state.

  • num_attention_heads – number of attention heads.

  • is_cross_attention – used to specify whether it is self attention or cross attention. Defaults to False.

  • attention_dropout_prob – dropout probability of attention weights. Defaults to 0.0.

  • output_dropout_prob – dropout probability of output. Defaults to 0.0.

  • init_method – method to initialize the input layer weights. Defaults to init.xavier_normal_.

  • output_layer_init_method – method to initialize the output layer weights. If None, use init_method.

  • bias_dropout_fusion – whether to fuse add bias and dropout. Defaults to False.

  • scale_mask_softmax_fusion – whether to fuse scale, mask and softmax. Defaults to False.

  • apply_query_key_layer_scaling – if True, scaling the attention score by layer index. Defaults to False.

  • layer_idx – a layer_idx sign which determines the placements. It will be used in pipeline parallelism. Defaults to 0.

extra_repr()str[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(hidden_states: oneflow.Tensor, encoder_states: Optional[oneflow.Tensor] = None, attention_mask: Optional[oneflow.Tensor] = None, past_key_value: Optional[Tuple[oneflow.Tensor, oneflow.Tensor]] = None, use_cache: bool = False)[source]
Parameters
  • hidden_states (flow.Tensor) – shape is [bsz, tgt_len, hidden_size].

  • encoder_states (flow.Tensor, optional) – shape is [bsz, src_len, hidden_size]. Defaults to None.

  • attention_mask (flow.Tensor, optional) – shape is [bsz, 1, tgt_len, src_len]. It should be the combination of padding mask and casual mask. It is the padding mask of source input when used with self-attention in encoder. And it is the combination of padding mask of target input and casual mask when used with self-attention in decoder. It is the padding mask of source input when used with cross-attention in decoder. Defaults to None.

  • past_key_value (Tuple[flow.Tensor, flow.Tensor], optional) – tuple of key and value, each shape is [bsz, num_heads, src_len, head_size]. Defaults to None.

  • use_cache (bool, optional) – it will be set to True, when the model is in the inference phase and used for incremental decoding. Defaults to False.

class libai.layers.ParallelCrossEntropyLoss[source]

This criterion acts like CrossEntropyLoss except it will execute distributed cross entropy loss computation cross different GPUs.

forward(logits: oneflow.Tensor, target: oneflow.Tensor)[source]

Function for the distributed cross entropy.

Parameters
  • logits (flow.Tensor) – vocab_parallel_logits with shape (batch_size, seq_length, vocab_size) and sbp signature is [S(0), S(2)].

  • target (flow.Tensor) – target with shape (batch_size, seq_length) and sbp signature is [S(0), B].

class libai.layers.PatchEmbedding(img_size=224, patch_size=16, in_chans=3, embed_dim=768, norm_layer=None, flatten=True, *, layer_idx=0)[source]

2D Image to Patch Embedding

Parameters
  • img_size – size of input image. Default to 224.

  • patch_size – embedded patch size. Default to 16.

  • in_chans – input channel’s size. Default to 3.

  • embed_dim – dimension of embedded patch. Default to 768.

  • norm_layer – normalization patch embedding or not. Default to None.

  • flatten – flatten patch embedding or keep the 2-D shape. Default to True.

  • layer_idx – A layer_idx sign which determines the placement. It will be used in pipeline

  • Default to 0. (parallelism.) –

class libai.layers.RMSLayerNorm(normalized_shape, eps=1e-06, layer_idx=0)[source]

T5 uses a layer_norm which only scales and doesn’t shift, which is also known as Root Mean Square Layer Normalization thus varience is calculated w/o mean and there is no bias. More details see: https://arxiv.org/abs/1910.07467.

Parameters
  • normalized_shape – input shape from an expected input of size.

  • eps – a value added to the denominator for numerical stability. Defaults to 1e-5. elementwise_affine: a boolean value that when set to True, this module has learnable per-element affine parameters initialized to ones (for weights) and zeros (for biases). Default: True.

  • layer_idx – a layer_idx sign which determines the placement. It will be used in pipeline parallelism. Defaults to 0.

class libai.layers.SinePositionalEmbedding(num_embeddings, embedding_dim)[source]

Construct the sinusoidal positional embeddings.

Parameters
  • num_embeddings – size of vocabulary.

  • embedding_dim – dimension of embeddings.

extra_repr()str[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

class libai.layers.TransformerLayer(hidden_size, ffn_hidden_size, num_attention_heads, is_decoder=False, attention_dropout_prob=0.0, output_dropout_prob=0.0, drop_path_prob=0.0, layernorm_epsilon=1e-05, init_method=<function xavier_normal_>, output_layer_init_method=None, bias_gelu_fusion=False, bias_dropout_fusion=False, scale_mask_softmax_fusion=False, apply_query_key_layer_scaling=False, apply_residual_post_layernorm=False, attn_mask_type=<AttnMaskType.padding: 1>, *, layer_idx=0)[source]

A single transformer layer.

Transformer layer takes input with size [bsz, seq_length, hidden size] and returns an output of the same size. The input and output has same sbp sign, (S(0), B).

Parameters
  • hidden_size – size of hidden state.

  • ffn_hidden_size – size of feed forword neural network.

  • num_attention_heads – number of attention heads.

  • is_decoder – used to specify whether this is transformer encoder layer or transformer decoder layer. Default: False.

  • attention_dropout_prob – dropout probability of attention weights.

  • output_dropout_prob – dropout probability of output.

  • layernorm_epsilon – epsilon used in layernorm layer. Default: 1e-5.

  • init_method – method to initialize the input layer weights.

  • output_layer_init_method – method to initialize the output layer weights. If None, use init_method.

  • bias_gelu_fusion – whether fuse add bias and gelu. Default: False.

  • bias_dropout_fusion – whether fuse add bias and dropout. Default: False.

  • scale_mask_softmax_fusion – whether to fuse scale, mask and softmax. Default: False.

  • apply_query_key_layer_scaling – if true, scaling the attention score by layer index. Default: False.

  • apply_residual_post_layernorm – if true, use original BERT residual connection ordering. Otherwise, use Megatron BERT residual connection which is more stable when scaling model size introduced in https://arxiv.org/pdf/1909.08053.pdf. Default: False.

  • layer_idx – the layer index, which determines the placement.

forward(hidden_states, attention_mask=None, encoder_states=None, encoder_attention_mask=None, past_key_value=None, use_cache=False)[source]
Parameters
  • hidden_states – shape is (batch_size, seq_length, hidden_size), sbp signature is (S(0), B).

  • attention_mask – the combination of key padding mask and casual mask of hidden states with shape (batch_size, 1, seq_length, seq_length) and the sbp signature is (S(0), B),

  • encoder_states – encoder output with shape (batch_size, seq_length, hidden_size) and the sbp signature is (S(0), B), which will be used in cross attention.

  • encoder_attention_mask – key padding mask of encoder states with shape (batch_size, 1, seq_length, seq_length) and the sbp signature is (S(0), B).

  • past_key_value – tuple of key and value, each shape is (seq_length, bsz, num_heads, head_size), For decoder layer, the past_key_value contains the states both from self attention and cross attention.

  • use_cache – it will be set to True when the model is in the inference phase and used for incremental decoding.

class libai.layers.VocabEmbedding(num_embeddings, embedding_dim, padding_idx=None, init_method=<function xavier_normal_>, amp_enabled=False)[source]

Construct the word embeddings, which may be split along vocabulary dimension.

Parameters
  • num_embeddings – size of vocabulary.

  • embedding_dim – dimension of embeddings.

  • padding_idx – pad index. Defaults to None.

  • init_method – method to initialize weights. Defaults to flow.nn.init.xavier_normal_.

  • amp_enabled – fp16 option for embedding weight. Defaults to False.

extra_repr()str[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

libai.layers.build_activation(activation: Optional[libai.layers.activation.Activation])[source]

Fetching activation layers by name, e.g., build_activation("gelu") returns nn.GELU() module.

libai.layers.drop_path(x, drop_prob: float = 0.5, training: bool = False, scale_by_keep: bool = True)[source]

Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).