Write Dataloaders¶

This tutorial explains how the dataset APIs work, and how to customize your own datasets with them.

Build Common Dataloaders¶

To build dataloaders in LiBai, we highly recommend users to use the default build_nlp_train_val_test_loader, build_nlp_train_loader, build_nlp_test_loader, build_image_train_loader and build_image_test_loader which are defined in libai/data/build.py for most of the common cases.

The only thing you need to do is to write pytorch style Dataset, and return Instance structure in __getitem__. The Instance structure stores the attributes of an instance (e.g., image, tokens) as “fields”, and the DistTensorData structure provides a standard to_global()(called in get_batch()) function to convert local tensors to global tensors.

The returned instance by __getitem__ function must contain the same keys with the args passed in forward function of the model. The following shows an example:

NOTE: Set placement_idx=-1 in DistTensorData when the tensor is only used in loss_function, it is used for pipeline parallel training.

# my_dataset.py
import numpy as np
import oneflow as flow

from libai.data.structures import DistTensorData, Instance

class MyDataset(flow.utils.data.Dataset):

    ...

    def __getitem__(self, idx):
        text = np.array(self.dataset[idx], dtype=np.long)
        # transfer to flow.tensor
        input_ids = flow.tensor(text[:-1], dtype=flow.long)
        lm_labels = flow.tensor(text[1:2], dtype=flow.long)
        # attention_mask must be a [0, 1] metric
        attention_mask = flow.tensor(text[2:3], dtype=flow.long)
        loss_mask = flow.tensor(text[3:], dtype=flow.long)
        # the keys (`input_ids` ... `labels`) should be same as the parameter name of model.forward()
        sample = Instance(
            input_ids=DistTensorData(input_ids),
            # attention_mask must be a [0, 1] metric
            attention_mask=DistTensorData(attention_mask),
            loss_mask=DistTensorData(lm_labels, placement_idx=-1),
            labels=DistTensorData(lm_labels, placement_idx=-1),
        )
        return sample

# my_model.py
import oneflow.nn as nn

class MyModel(nn.Module):
    ...
    
    # the parameters' name is the same as the returned key in __getitem__
    def forward(self, input_ids, attention_mask, loss_mask, labels):
        ...

In particular, the values of attention_mask can only be 0 or 1 if you need to generate your own attention_mask. Because LiBai has already processed attention_mask in libai/layers/attention.py as follows:

attention_scores = flow.mul(attention_scores, attention_mask)
attention_scores = attention_scores - 10000.0 * (1 - attention_mask)
attention_weights = flow.softmax(attention_scores, dim=-1)

After finishing your MyDataset, set dataloader in your config.py depending on your needs. If you have only one training dataset for nlp task and want to split it into train, valid and test datasets automatically, you can choose build_nlp_train_val_test_loader, the evaluation will be calculated in valid and test dataset.

Otherwise, you can choose build_nlp_train_loader && build_nlp_test_loader or build_image_train_loader && build_image_test_loader in config.py according to your own needs. see libai/data/build.py for more details.