libai.layers¶
-
class
libai.layers.
DropPath
(drop_prob: float = 0.0, scale_by_keep: bool = True)[source]¶ Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
-
class
libai.layers.
Embedding
(num_embeddings, embedding_dim, padding_idx=None, init_method=<function xavier_normal_>, amp_enabled=False, dtype=oneflow.float32, layer_idx=0)[source]¶ Construct the trainable embedding module, which does not support parallelization. This can be used for positional embedding and token type embedding.
- Parameters
num_embeddings – size of vocabulary.
embedding_dim – dimension of embeddings.
padding_idx – pad index. Defaults to None.
init_method – method to initialize weights. Defaults to
flow.nn.init.xavier_normal_
.amp_enabled – fp16 option for embedding weight. Defaults to False.
-
class
libai.layers.
LayerNorm
(normalized_shape, eps=1e-05, elementwise_affine=True, bias=True, *, layer_idx=0)[source]¶ Applies Layer Normalization over a mini-batch of inputs in 1D parallelism.
- Parameters
normalized_shape – input shape from an expected input of size.
eps – a value added to the denominator for numerical stability. Defaults to 1e-5. elementwise_affine: a boolean value that when set to
True
, this module has learnable per-element affine parameters initialized to ones (for weights) and zeros (for biases). Default:True
.elementwise_affine – a boolean value that when set to
True
, this module has learnable per-element affine parameters initialized to ones (for weights) and zeros (for biases). Default:True
.bias – If set to
False
, the layer will not learn an additive bias. Defaults toTrue
.layer_idx – a layer_idx sign which determines the placement. It will be used in pipeline parallelism. Defaults to 0.
-
libai.layers.
Linear
¶ alias of
libai.layers.linear.Linear1D
-
class
libai.layers.
Linear1D
(in_features, out_features, bias=True, parallel='data', init_method=<function xavier_normal_>, skip_bias_add=False, dtype=oneflow.float32, *, layer_idx=0)[source]¶ Linear layer with 1D parallelism which includes column parallelism and row parallelism. The linear layer is defined as \(y = xA^T + b\).
In column parallelism, A^T is parallelized along the second dimension as \(A^T = [A_1, ..., A_p]\).
In row parallelism, A^T is parallelized along the first dimension and X along its second dimension as:
\[\begin{split}A^T = \begin{bmatrix} A\_1 \\ . \\ . \\ . \\ A\_p \end{bmatrix} x = \begin{bmatrix} x\_1 & ... & x\_p \end{bmatrix}\end{split}\]- Parameters
in_features – size of each input sample.
out_features – size of each output sample.
bias – If set to
False
, the layer will not learn an additive bias. Defaults toTrue
.parallel – Parallel mode. Defaults to “data”.
init_method – method to initialize weight. Defaults to
nn.init.xavier_normal_()
.skip_bias_add – skip adding bias but instead return it, so that adding bias can be fused with other elementwise operations. Defaults to
False
.layer_idx – A layer_idx sign which determines the placement. It will be used in pipeline parallelism. Defaults to 0.
dtype – the dtype of weight. Defaults to
flow.float32
-
class
libai.layers.
MLP
(hidden_size, ffn_hidden_size, output_dropout_prob=0.0, init_method=<function xavier_normal_>, output_layer_init_method=None, bias_gelu_fusion=False, bias_dropout_fusion=False, *, layer_idx=0)[source]¶ MLP will take the input with h hidden state, project it to intermediate hidden dimension, perform gelu transformation, and project the state back into h hidden dimension.
- Parameters
hidden_size – size of each input and output sample.
ffn_hidden_size – size of each intermediate sample.
output_dropout_prob – Output dropout probability. Defaults to 0.0.
init_method – method to initialize the first linear weight. Defaults to
nn.init.xavier_normal_()
.output_layer_init_method – method to initialize the second linear weight. If set to None, it will use
init_method
instead. Defaults to None.bias_gelu_fusion – If set to
True
, it will fuse bias adding and elementwise gelu activation. Defaults toFalse
.bias_dropout_fusion – If set to
True
, it will fuse bias adding and dropout. Defaults toFalse
.layer_idx – A layer_idx sign which determines the placement. It will be used in pipeline parallelism. Defaults to 0.
-
class
libai.layers.
MultiheadAttention
(hidden_size, num_attention_heads, is_cross_attention=False, attention_dropout_prob=0.0, output_dropout_prob=0.0, init_method=<function xavier_normal_>, output_layer_init_method=None, bias_dropout_fusion=False, scale_mask_softmax_fusion=False, apply_query_key_layer_scaling=False, attn_mask_type=<AttnMaskType.padding: 1>, *, layer_idx=0)[source]¶ Multi-head attention layer, support self attention and cross attention.
- Parameters
hidden_size – size of hidden state.
num_attention_heads – number of attention heads.
is_cross_attention – used to specify whether it is self attention or cross attention. Defaults to False.
attention_dropout_prob – dropout probability of attention weights. Defaults to 0.0.
output_dropout_prob – dropout probability of output. Defaults to 0.0.
init_method – method to initialize the input layer weights. Defaults to
init.xavier_normal_
.output_layer_init_method – method to initialize the output layer weights. If None, use
init_method
.bias_dropout_fusion – whether to fuse add bias and dropout. Defaults to False.
scale_mask_softmax_fusion – whether to fuse scale, mask and softmax. Defaults to False.
apply_query_key_layer_scaling – if True, scaling the attention score by layer index. Defaults to False.
layer_idx – a layer_idx sign which determines the placements. It will be used in pipeline parallelism. Defaults to 0.
-
extra_repr
() → str[source]¶ Set the extra representation of the module
To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.
-
forward
(hidden_states: oneflow.Tensor, encoder_states: Optional[oneflow.Tensor] = None, attention_mask: Optional[oneflow.Tensor] = None, past_key_value: Optional[Tuple[oneflow.Tensor, oneflow.Tensor]] = None, use_cache: bool = False)[source]¶ - Parameters
hidden_states (flow.Tensor) – shape is [bsz, tgt_len, hidden_size].
encoder_states (flow.Tensor, optional) – shape is [bsz, src_len, hidden_size]. Defaults to None.
attention_mask (flow.Tensor, optional) – shape is [bsz, 1, tgt_len, src_len]. It should be the combination of padding mask and casual mask. It is the padding mask of source input when used with self-attention in encoder. And it is the combination of padding mask of target input and casual mask when used with self-attention in decoder. It is the padding mask of source input when used with cross-attention in decoder. Defaults to None.
past_key_value (Tuple[flow.Tensor, flow.Tensor], optional) – tuple of key and value, each shape is [bsz, num_heads, src_len, head_size]. Defaults to None.
use_cache (bool, optional) – it will be set to True, when the model is in the inference phase and used for incremental decoding. Defaults to False.
-
class
libai.layers.
ParallelCrossEntropyLoss
[source]¶ This criterion acts like
CrossEntropyLoss
except it will execute distributed cross entropy loss computation cross different GPUs.-
forward
(logits: oneflow.Tensor, target: oneflow.Tensor)[source]¶ Function for the distributed cross entropy.
- Parameters
logits (flow.Tensor) – vocab_parallel_logits with shape (batch_size, seq_length, vocab_size) and sbp signature is [S(0), S(2)].
target (flow.Tensor) – target with shape (batch_size, seq_length) and sbp signature is [S(0), B].
-
-
class
libai.layers.
PatchEmbedding
(img_size=224, patch_size=16, in_chans=3, embed_dim=768, norm_layer=None, flatten=True, *, layer_idx=0)[source]¶ 2D Image to Patch Embedding
- Parameters
img_size – size of input image. Default to 224.
patch_size – embedded patch size. Default to 16.
in_chans – input channel’s size. Default to 3.
embed_dim – dimension of embedded patch. Default to 768.
norm_layer – normalization patch embedding or not. Default to None.
flatten – flatten patch embedding or keep the 2-D shape. Default to True.
layer_idx – A layer_idx sign which determines the placement. It will be used in pipeline
Default to 0. (parallelism.) –
-
class
libai.layers.
RMSLayerNorm
(normalized_shape, eps=1e-06, layer_idx=0)[source]¶ T5 uses a layer_norm which only scales and doesn’t shift, which is also known as Root Mean Square Layer Normalization thus varience is calculated w/o mean and there is no bias. More details see: https://arxiv.org/abs/1910.07467.
- Parameters
normalized_shape – input shape from an expected input of size.
eps – a value added to the denominator for numerical stability. Defaults to 1e-5. elementwise_affine: a boolean value that when set to
True
, this module has learnable per-element affine parameters initialized to ones (for weights) and zeros (for biases). Default:True
.layer_idx – a layer_idx sign which determines the placement. It will be used in pipeline parallelism. Defaults to 0.
-
class
libai.layers.
SinePositionalEmbedding
(num_embeddings, embedding_dim)[source]¶ Construct the sinusoidal positional embeddings.
- Parameters
num_embeddings – size of vocabulary.
embedding_dim – dimension of embeddings.
-
class
libai.layers.
TransformerLayer
(hidden_size, ffn_hidden_size, num_attention_heads, is_decoder=False, attention_dropout_prob=0.0, output_dropout_prob=0.0, drop_path_prob=0.0, layernorm_epsilon=1e-05, init_method=<function xavier_normal_>, output_layer_init_method=None, bias_gelu_fusion=False, bias_dropout_fusion=False, scale_mask_softmax_fusion=False, apply_query_key_layer_scaling=False, apply_residual_post_layernorm=False, attn_mask_type=<AttnMaskType.padding: 1>, *, layer_idx=0)[source]¶ A single transformer layer.
Transformer layer takes input with size [bsz, seq_length, hidden size] and returns an output of the same size. The input and output has same sbp sign, (S(0), B).
- Parameters
hidden_size – size of hidden state.
ffn_hidden_size – size of feed forword neural network.
num_attention_heads – number of attention heads.
is_decoder – used to specify whether this is transformer encoder layer or transformer decoder layer. Default:
False
.attention_dropout_prob – dropout probability of attention weights.
output_dropout_prob – dropout probability of output.
layernorm_epsilon – epsilon used in layernorm layer. Default: 1e-5.
init_method – method to initialize the input layer weights.
output_layer_init_method – method to initialize the output layer weights. If None, use init_method.
bias_gelu_fusion – whether fuse add bias and gelu. Default:
False
.bias_dropout_fusion – whether fuse add bias and dropout. Default:
False
.scale_mask_softmax_fusion – whether to fuse scale, mask and softmax. Default:
False
.apply_query_key_layer_scaling – if true, scaling the attention score by layer index. Default:
False
.apply_residual_post_layernorm – if
true
, use original BERT residual connection ordering. Otherwise, use Megatron BERT residual connection which is more stable when scaling model size introduced in https://arxiv.org/pdf/1909.08053.pdf. Default:False
.layer_idx – the layer index, which determines the placement.
-
forward
(hidden_states, attention_mask=None, encoder_states=None, encoder_attention_mask=None, past_key_value=None, use_cache=False)[source]¶ - Parameters
hidden_states – shape is (batch_size, seq_length, hidden_size), sbp signature is (S(0), B).
attention_mask – the combination of key padding mask and casual mask of hidden states with shape (batch_size, 1, seq_length, seq_length) and the sbp signature is (S(0), B),
encoder_states – encoder output with shape (batch_size, seq_length, hidden_size) and the sbp signature is (S(0), B), which will be used in cross attention.
encoder_attention_mask – key padding mask of encoder states with shape (batch_size, 1, seq_length, seq_length) and the sbp signature is (S(0), B).
past_key_value – tuple of key and value, each shape is (seq_length, bsz, num_heads, head_size), For decoder layer, the past_key_value contains the states both from self attention and cross attention.
use_cache – it will be set to True when the model is in the inference phase and used for incremental decoding.
-
class
libai.layers.
VocabEmbedding
(num_embeddings, embedding_dim, padding_idx=None, init_method=<function xavier_normal_>, amp_enabled=False)[source]¶ Construct the word embeddings, which may be split along vocabulary dimension.
- Parameters
num_embeddings – size of vocabulary.
embedding_dim – dimension of embeddings.
padding_idx – pad index. Defaults to None.
init_method – method to initialize weights. Defaults to
flow.nn.init.xavier_normal_
.amp_enabled – fp16 option for embedding weight. Defaults to False.