libai.layers¶

class libai.layers.DropPath(drop_prob: float = 0.0, scale_by_keep: bool = True)[source]¶: Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).

class libai.layers.Embedding(num_embeddings, embedding_dim, padding_idx=None, init_method=<function xavier_normal_>, amp_enabled=False, dtype=oneflow.float32, layer_idx=0)[source]¶

Construct the trainable embedding module, which does not support parallelization. This can be used for positional embedding and token type embedding.

Parameters

num_embeddings – size of vocabulary.
embedding_dim – dimension of embeddings.
padding_idx – pad index. Defaults to None.
init_method – method to initialize weights. Defaults to flow.nn.init.xavier_normal_.
amp_enabled – fp16 option for embedding weight. Defaults to False.

extra_repr() → str[source]¶

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

class libai.layers.LayerNorm(normalized_shape, eps=1e-05, elementwise_affine=True, bias=True, *, layer_idx=0)[source]¶

Applies Layer Normalization over a mini-batch of inputs in 1D parallelism.

Parameters

normalized_shape – input shape from an expected input of size.
eps – a value added to the denominator for numerical stability. Defaults to 1e-5. elementwise_affine: a boolean value that when set to True, this module has learnable per-element affine parameters initialized to ones (for weights) and zeros (for biases). Default: True.
elementwise_affine – a boolean value that when set to True, this module has learnable per-element affine parameters initialized to ones (for weights) and zeros (for biases). Default: True.
bias – If set to False, the layer will not learn an additive bias. Defaults to True.
layer_idx – a layer_idx sign which determines the placement. It will be used in pipeline parallelism. Defaults to 0.

extra_repr() → str[source]¶

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

libai.layers.Linear¶: alias of libai.layers.linear.Linear1D

class libai.layers.Linear1D(in_features, out_features, bias=True, parallel='data', init_method=<function xavier_normal_>, skip_bias_add=False, dtype=oneflow.float32, *, layer_idx=0)[source]¶

Linear layer with 1D parallelism which includes column parallelism and row parallelism. The linear layer is defined as \(y = xA^T + b\).

In column parallelism, A^T is parallelized along the second dimension as \(A^T = [A_1, ..., A_p]\).

In row parallelism, A^T is parallelized along the first dimension and X along its second dimension as:

\[\begin{split}A^T = \begin{bmatrix} A\_1 \\ . \\ . \\ . \\ A\_p \end{bmatrix} x = \begin{bmatrix} x\_1 & ... & x\_p \end{bmatrix}\end{split}\]

Parameters

in_features – size of each input sample.
out_features – size of each output sample.
bias – If set to False, the layer will not learn an additive bias. Defaults to True.
parallel – Parallel mode. Defaults to “data”.
init_method – method to initialize weight. Defaults to nn.init.xavier_normal_().
skip_bias_add – skip adding bias but instead return it, so that adding bias can be fused with other elementwise operations. Defaults to False.
layer_idx – A layer_idx sign which determines the placement. It will be used in pipeline parallelism. Defaults to 0.
dtype – the dtype of weight. Defaults to flow.float32

extra_repr() → str[source]¶

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

class libai.layers.MLP(hidden_size, ffn_hidden_size, output_dropout_prob=0.0, init_method=<function xavier_normal_>, output_layer_init_method=None, bias_gelu_fusion=False, bias_dropout_fusion=False, *, layer_idx=0)[source]¶

MLP will take the input with h hidden state, project it to intermediate hidden dimension, perform gelu transformation, and project the state back into h hidden dimension.

Parameters

hidden_size – size of each input and output sample.
ffn_hidden_size – size of each intermediate sample.
output_dropout_prob – Output dropout probability. Defaults to 0.0.
init_method – method to initialize the first linear weight. Defaults to nn.init.xavier_normal_().
output_layer_init_method – method to initialize the second linear weight. If set to None, it will use init_method instead. Defaults to None.
bias_gelu_fusion – If set to True, it will fuse bias adding and elementwise gelu activation. Defaults to False.
bias_dropout_fusion – If set to True, it will fuse bias adding and dropout. Defaults to False.
layer_idx – A layer_idx sign which determines the placement. It will be used in pipeline parallelism. Defaults to 0.

extra_repr() → str[source]¶

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

class libai.layers.MultiheadAttention(hidden_size, num_attention_heads, is_cross_attention=False, attention_dropout_prob=0.0, output_dropout_prob=0.0, init_method=<function xavier_normal_>, output_layer_init_method=None, bias_dropout_fusion=False, scale_mask_softmax_fusion=False, apply_query_key_layer_scaling=False, attn_mask_type=<AttnMaskType.padding: 1>, *, layer_idx=0)[source]¶

Multi-head attention layer, support self attention and cross attention.

Parameters

hidden_size – size of hidden state.
num_attention_heads – number of attention heads.
is_cross_attention – used to specify whether it is self attention or cross attention. Defaults to False.
attention_dropout_prob – dropout probability of attention weights. Defaults to 0.0.
output_dropout_prob – dropout probability of output. Defaults to 0.0.
init_method – method to initialize the input layer weights. Defaults to init.xavier_normal_.
output_layer_init_method – method to initialize the output layer weights. If None, use init_method.
bias_dropout_fusion – whether to fuse add bias and dropout. Defaults to False.
scale_mask_softmax_fusion – whether to fuse scale, mask and softmax. Defaults to False.
apply_query_key_layer_scaling – if True, scaling the attention score by layer index. Defaults to False.
layer_idx – a layer_idx sign which determines the placements. It will be used in pipeline parallelism. Defaults to 0.

extra_repr() → str[source]¶

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(hidden_states: oneflow.Tensor, encoder_states: Optional[oneflow.Tensor] = None, attention_mask: Optional[oneflow.Tensor] = None, past_key_value: Optional[Tuple[oneflow.Tensor, oneflow.Tensor]] = None, use_cache: bool = False)[source]¶

Parameters

hidden_states (flow.Tensor) – shape is [bsz, tgt_len, hidden_size].
encoder_states (flow.Tensor, optional) – shape is [bsz, src_len, hidden_size]. Defaults to None.
attention_mask (flow.Tensor, optional) – shape is [bsz, 1, tgt_len, src_len]. It should be the combination of padding mask and casual mask. It is the padding mask of source input when used with self-attention in encoder. And it is the combination of padding mask of target input and casual mask when used with self-attention in decoder. It is the padding mask of source input when used with cross-attention in decoder. Defaults to None.
past_key_value (Tuple[flow.Tensor, flow.Tensor], optional) – tuple of key and value, each shape is [bsz, num_heads, src_len, head_size]. Defaults to None.
use_cache (bool, optional) – it will be set to True, when the model is in the inference phase and used for incremental decoding. Defaults to False.

class libai.layers.ParallelCrossEntropyLoss[source]¶

This criterion acts like CrossEntropyLoss except it will execute distributed cross entropy loss computation cross different GPUs.

forward(logits: oneflow.Tensor, target: oneflow.Tensor)[source]¶

Function for the distributed cross entropy.

Parameters

logits (flow.Tensor) – vocab_parallel_logits with shape (batch_size, seq_length, vocab_size) and sbp signature is [S(0), S(2)].
target (flow.Tensor) – target with shape (batch_size, seq_length) and sbp signature is [S(0), B].

class libai.layers.PatchEmbedding(img_size=224, patch_size=16, in_chans=3, embed_dim=768, norm_layer=None, flatten=True, *, layer_idx=0)[source]¶

2D Image to Patch Embedding

Parameters

img_size – size of input image. Default to 224.
patch_size – embedded patch size. Default to 16.
in_chans – input channel’s size. Default to 3.
embed_dim – dimension of embedded patch. Default to 768.
norm_layer – normalization patch embedding or not. Default to None.
flatten – flatten patch embedding or keep the 2-D shape. Default to True.
layer_idx – A layer_idx sign which determines the placement. It will be used in pipeline
Default to 0. (parallelism.) –

class libai.layers.RMSLayerNorm(normalized_shape, eps=1e-06, layer_idx=0)[source]¶

T5 uses a layer_norm which only scales and doesn’t shift, which is also known as Root Mean Square Layer Normalization thus varience is calculated w/o mean and there is no bias. More details see: https://arxiv.org/abs/1910.07467.

Parameters

normalized_shape – input shape from an expected input of size.
eps – a value added to the denominator for numerical stability. Defaults to 1e-5. elementwise_affine: a boolean value that when set to True, this module has learnable per-element affine parameters initialized to ones (for weights) and zeros (for biases). Default: True.
layer_idx – a layer_idx sign which determines the placement. It will be used in pipeline parallelism. Defaults to 0.

class libai.layers.SinePositionalEmbedding(num_embeddings, embedding_dim)[source]¶

Construct the sinusoidal positional embeddings.

Parameters

num_embeddings – size of vocabulary.
embedding_dim – dimension of embeddings.

extra_repr() → str[source]¶

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

class libai.layers.TransformerLayer(hidden_size, ffn_hidden_size, num_attention_heads, is_decoder=False, attention_dropout_prob=0.0, output_dropout_prob=0.0, drop_path_prob=0.0, layernorm_epsilon=1e-05, init_method=<function xavier_normal_>, output_layer_init_method=None, bias_gelu_fusion=False, bias_dropout_fusion=False, scale_mask_softmax_fusion=False, apply_query_key_layer_scaling=False, apply_residual_post_layernorm=False, attn_mask_type=<AttnMaskType.padding: 1>, *, layer_idx=0)[source]¶

A single transformer layer.

Transformer layer takes input with size [bsz, seq_length, hidden size] and returns an output of the same size. The input and output has same sbp sign, (S(0), B).

Parameters

hidden_size – size of hidden state.
ffn_hidden_size – size of feed forword neural network.
num_attention_heads – number of attention heads.
is_decoder – used to specify whether this is transformer encoder layer or transformer decoder layer. Default: False.
attention_dropout_prob – dropout probability of attention weights.
output_dropout_prob – dropout probability of output.
layernorm_epsilon – epsilon used in layernorm layer. Default: 1e-5.
init_method – method to initialize the input layer weights.
output_layer_init_method – method to initialize the output layer weights. If None, use init_method.
bias_gelu_fusion – whether fuse add bias and gelu. Default: False.
bias_dropout_fusion – whether fuse add bias and dropout. Default: False.
scale_mask_softmax_fusion – whether to fuse scale, mask and softmax. Default: False.
apply_query_key_layer_scaling – if true, scaling the attention score by layer index. Default: False.
apply_residual_post_layernorm – if true, use original BERT residual connection ordering. Otherwise, use Megatron BERT residual connection which is more stable when scaling model size introduced in https://arxiv.org/pdf/1909.08053.pdf. Default: False.
layer_idx – the layer index, which determines the placement.

forward(hidden_states, attention_mask=None, encoder_states=None, encoder_attention_mask=None, past_key_value=None, use_cache=False)[source]¶

Parameters

hidden_states – shape is (batch_size, seq_length, hidden_size), sbp signature is (S(0), B).
attention_mask – the combination of key padding mask and casual mask of hidden states with shape (batch_size, 1, seq_length, seq_length) and the sbp signature is (S(0), B),
encoder_states – encoder output with shape (batch_size, seq_length, hidden_size) and the sbp signature is (S(0), B), which will be used in cross attention.
encoder_attention_mask – key padding mask of encoder states with shape (batch_size, 1, seq_length, seq_length) and the sbp signature is (S(0), B).
past_key_value – tuple of key and value, each shape is (seq_length, bsz, num_heads, head_size), For decoder layer, the past_key_value contains the states both from self attention and cross attention.
use_cache – it will be set to True when the model is in the inference phase and used for incremental decoding.

class libai.layers.VocabEmbedding(num_embeddings, embedding_dim, padding_idx=None, init_method=<function xavier_normal_>, amp_enabled=False)[source]¶

Construct the word embeddings, which may be split along vocabulary dimension.

Parameters

num_embeddings – size of vocabulary.
embedding_dim – dimension of embeddings.
padding_idx – pad index. Defaults to None.
init_method – method to initialize weights. Defaults to flow.nn.init.xavier_normal_.
amp_enabled – fp16 option for embedding weight. Defaults to False.

extra_repr() → str[source]¶

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

libai.layers.build_activation(activation: Optional[libai.layers.activation.Activation])[source]¶: Fetching activation layers by name, e.g., build_activation("gelu") returns nn.GELU() module.

libai.layers.drop_path(x, drop_prob: float = 0.5, training: bool = False, scale_by_keep: bool = True)[source]¶: Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).