libai.data¶

libai.data.data_utils module¶

class libai.data.data_utils.IndexedCachedDataset(path)[source]¶

class libai.data.data_utils.IndexedDataset(path)[source]¶: Loader for IndexedDataset

class libai.data.data_utils.MMapIndexedDataset(path, skip_warmup=False)[source]¶

get(idx, offset=0, length=None)[source]¶

Retrieves a single item from the dataset with the option to only return a portion of the item.

get(idx) is the same as [idx] but get() does not support slicing.

libai.data.datasets module¶

class libai.data.datasets.BertDataset(name, tokenizer, indexed_dataset, data_prefix, max_num_samples, mask_lm_prob, max_seq_length, short_seq_prob=0.0, seed=1234, binary_head=True, masking_style='bert')[source]¶

Dataset containing sentence pairs for BERT training. Each index corresponds to a randomly generated sentence pair.

Parameters

name – Name of dataset for clarification.
tokenizer – Tokenizer to use.
data_prefix – Path to the training dataset.
indexed_dataset – Indexed dataset to use.
max_seq_length – Maximum length of the sequence. All values are padded to this length. Defaults to 512.
mask_lm_prob – Probability to mask tokens. Defaults to 0.15.
short_seq_prob – Probability of producing a short sequence. Defaults to 0.0.
max_predictions_per_seq – Maximum number of mask tokens in each sentence. Defaults to None.
seed – Seed for random number generator for reproducibility. Defaults to 1234.
binary_head – Specifies whether the underlying dataset generates a pair of blocks along with a sentence_target or not. Setting it to True assumes that the underlying dataset generates a label for the pair of sentences which is surfaced as sentence_target. Defaults to True.

class libai.data.datasets.CIFAR100Dataset(root: str, train: bool = True, transform: Optional[Callable] = None, download: bool = False, **kwargs)[source]¶

CIFAR100 Dataset in LiBai.

Parameters

root (string) – Root directory of dataset where directory cifar-10-batches-py exists or will be saved to if download is set to True.
train (bool, optional) – If True, creates dataset from training set, otherwise creates from test set.
transform (callable, optional) – A function/transform that takes in a PIL image and returns a transformed version. E.g, transforms.RandomCrop
download (bool, optional) – If true, downloads the dataset from the internet and puts it in root directory. If the dataset is already downloaded, it will not be downloaded again.
dataset_name (str, optional) – Name for the dataset as an identifier. E.g, cifar100

class libai.data.datasets.CIFAR10Dataset(root: str, train: bool = True, transform: Optional[Callable] = None, download: bool = False, **kwargs)[source]¶

CIFAR10 Dataset in LiBai.

Parameters

root (string) – Root directory of dataset where directory cifar-10-batches-py exists or will be saved to if download is set to True.
train (bool, optional) – If True, creates dataset from training set, otherwise creates from test set.
transform (callable, optional) – A function/transform that takes in a PIL image and returns a transformed version. E.g, transforms.RandomCrop
download (bool, optional) – If true, downloads the dataset from the internet and puts it in root directory. If the dataset is already downloaded, it will not be downloaded again.

class libai.data.datasets.GPT2Dataset(name, tokenizer, data_prefix, indexed_dataset, max_num_samples, max_seq_length, seed=1234)[source]¶

class libai.data.datasets.ImageNetDataset(root: str, train: bool = True, transform: Optional[Callable] = None, **kwargs)[source]¶

ImageNet 2012 Classification Dataset in LiBai.

Parameters

root (string) – Root directory of the ImageNet Dataset.
train (bool, optional) – If True, creates dataset from training set, otherwise creates from test set.
transform (callable, optional) – A function/transform that takes in a PIL image and returns a transformed version. E.g, transforms.RandomCrop

class libai.data.datasets.MNISTDataset(root: str, train: bool = True, transform: Optional[Callable] = None, download: bool = False, **kwargs)[source]¶

MNIST Dataset in LiBai.

Parameters

root (string) – Root directory of dataset where MNIST/processed/training.pt and MNIST/processed/test.pt exist.
train (bool, optional) – If True, creates dataset from training.pt, otherwise from test.pt.
download (bool, optional) – If true, downloads the dataset from the internet and puts it in root directory. If the dataset is already downloaded, it will not be downloaded again.
transform (callable, optional) – A function/transform that takes in a PIL image and returns a transformed version. E.g, transforms.RandomCrop
dataset_name (str, optional) – Name for the dataset as an identifier. E.g, mnist

class libai.data.datasets.RobertaDataset(name, tokenizer, indexed_dataset, data_prefix, max_num_samples, mask_lm_prob, max_seq_length, short_seq_prob=0.0, seed=1234, masking_style='bert')[source]¶

Dataset containing sentence for RoBERTa training. Each index corresponds to a randomly selected sentence.

Parameters

name – Name of dataset for clarification.
tokenizer – Tokenizer to use.
data_prefix – Path to the training dataset.
indexed_dataset – Indexed dataset to use.
max_seq_length – Maximum length of the sequence. All values are padded to this length. Defaults to 512.
mask_lm_prob – Probability to mask tokens. Defaults to 0.15.
short_seq_prob – Probability of producing a short sequence. Defaults to 0.0.
max_predictions_per_seq – Maximum number of mask tokens in each sentence. Defaults to None.
seed – Seed for random number generator for reproducibility. Defaults to 1234.

class libai.data.datasets.T5Dataset(name, tokenizer, indexed_dataset, data_prefix, max_num_samples, masked_lm_prob, max_seq_length, max_seq_length_dec, short_seq_prob, seed)[source]¶

Dataset containing sentences for T5 training.

Parameters

name – Name of dataset.
tokenizer – Tokenizer to use.
data_prefix (str) – Path to the training dataset.
indexed_dataset – Indexed dataset to use.
max_seq_length (int, optional) – Maximum length of the sequence passing into encoder. All values are padded to this length. Defaults to 512.
max_seq_length_dec (int, optional) – Maximum length of the sequence passing into decoder. All values are padded to this length. Defaults to 128.
mask_lm_prob (float, optional) – Probability to mask tokens. Defaults to 0.15.
max_preds_per_seq (int, optional) – Maximum number of masked tokens in each sentence. Defaults to None.
short_seq_prob (float, optional) – Probability of producing a short sequence. Defaults to 0.0.
seed (int, optional) – Seed for random number generator for reproducibility. Defaults to 1234.

libai.data.samplers module¶

class libai.data.samplers.CyclicSampler(dataset, micro_batch_size, shuffle=False, consumed_samples=0, data_parallel_rank=0, data_parallel_size=1, seed=0)[source]¶

This sampler supports cyclic sampling, and it is also compatible with non-data parallelism and data parallelism.

Parameters

dataset – dataset to be sampled.
micro_batch_size – batch size for per model instance.
is micro_batch_size times data_parallel_size. (global_batch_size) –
shuffle – whether to shuffle the dataset.
consumed_samples – the number of samples that have been trained at the current time, used for resuming training (default: 0).
data_parallel_rank – local rank for data parallelism.
data_parallel_size – the size of data parallelism.
seed – random seed, used for reproducing experiments (default: 0).

set_consumed_samples(consumed_samples)[source]¶: You can recover the training iteration by setting consumed_samples.

set_epoch(epoch)[source]¶: Used for restoring training status.

class libai.data.samplers.SingleRoundSampler(dataset, micro_batch_size, shuffle=False, data_parallel_rank=0, data_parallel_size=1, seed=0, drop_last=False)[source]¶

This sampler supports single round sampling, and it is also compatible with non data parallelism and data parallelism.

Parameters

dataset – dataset to be sampled.
micro_batch_size – batch size for per model instance, global_batch_size is micro_batch_size times data_parallel_size.
shuffle – whether to shuffle the dataset.
data_parallel_rank – local rank for data parallelism.
data_parallel_size – the size of data parallelism.
seed – random seed, used for reproducing experiments (default: 0).
drop_last – whether to drop the remaining data (default: False).

libai.data.build module¶

libai.data.build.build_image_test_loader(dataset, test_batch_size, sampler={'shuffle': True, 'drop_last': False, '_target_': <class 'libai.data.samplers.samplers.SingleRoundSampler'>}, num_workers=4, seed=0, collate_fn=None, **kwargs)[source]¶

Build image test dataloader, used for test dataset

Returns

It will return test dataloader

test_loader: dataloader for testing

Parameters

dataset – dataset from which to load the data. e.g.: dataset or [dataset1, dataset2, …]
test_batch_size – how many samples per batch to load in testing (micro-batch-size per GPU).
sampler – defines the strategy to draw samples from the dataset. Can be any Iterable with __len__ implemented.
num_workers – how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: 4).
seed – random seed, used for reproducing experiments (default: 0).
collate_fn – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.

libai.data.build.build_image_train_loader(dataset, train_batch_size, test_batch_size=None, sampler={'shuffle': True, '_target_': <class 'libai.data.samplers.samplers.CyclicSampler'>}, num_workers=4, consumed_samples=0, seed=0, collate_fn=None, dataset_mixer=<class 'oneflow.utils.data.dataset.ConcatDataset'>, mixup_func=None, **kwargs)[source]¶

Build image train dataloader, it’s used for train dataset

Returns

It will return train dataloader, and Nonetype for valid/test dataloader

train_loader: dataloader for training

None: Nonetype

None: Nonetype

Parameters

dataset – dataset from which to load the data. e.g.: dataset or [dataset1, dataset2, …]
train_batch_size – how many samples per batch to load in training (micro-batch-size per GPU).
test_batch_size – no use, set it to None.
sampler – defines the strategy to draw samples from the dataset. Can be any Iterable with __len__ implemented.
num_workers – how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: 4).
consumed_samples – the number of samples that have been trained at the current time, used for resuming training (default: 0).
seed – random seed, used for reproducing experiments (default: 0).
collate_fn – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.
dataset_mixer – function for concating list dataset.
mixup_func – function for data argumentation.

libai.data.build.build_nlp_test_loader(dataset, test_batch_size, sampler={'shuffle': False, 'drop_last': False, '_target_': <class 'libai.data.samplers.samplers.SingleRoundSampler'>}, num_workers=4, seed=0, collate_fn=None)[source]¶

Build nlp test dataloader, it’s used for test dataset

Returns

It will return test dataloader

test_loader: dataloader for testing

Parameters

dataset – dataset from which to load the data. e.g.: dataset or [dataset1, dataset2, …]
test_batch_size – how many samples per batch to load in testing (micro-batch-size per GPU).
sampler – defines the strategy to draw samples from the dataset. Can be any Iterable with __len__ implemented.
num_workers – how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: 4).
seed – random seed, used for reproducing experiments (default: 0).
collate_fn – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.

libai.data.build.build_nlp_train_loader(dataset, train_batch_size, test_batch_size=None, sampler={'shuffle': True, '_target_': <class 'libai.data.samplers.samplers.CyclicSampler'>}, num_workers=4, consumed_samples=0, seed=0, collate_fn=None, dataset_mixer=<class 'oneflow.utils.data.dataset.ConcatDataset'>, **kwargs)[source]¶

Build nlp train dataloader, it’s used for train dataset

Returns

It will return train dataloader, and Nonetype for valid/test dataloader

train_loader: dataloader for training

None: Nonetype

None: Nonetype

Parameters

dataset – dataset from which to load the data. e.g.: dataset or [dataset1, dataset2, …]
train_batch_size – how many samples per batch to load in training (micro-batch-size per GPU).
test_batch_size – no use, set it to None.
sampler – defines the strategy to draw samples from the dataset. Can be any Iterable with __len__ implemented.
num_workers – how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: 4).
consumed_samples – the number of samples that have been trained at the current time, used for resuming training (default: 0).
seed – random seed, used for reproducing experiments (default: 0).
collate_fn – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.
dataset_mixer – function for concating list dataset.

libai.data.build.build_nlp_train_val_test_loader(dataset, splits, weights, train_val_test_num_samples, train_batch_size, test_batch_size, train_sampler={'shuffle': True, '_target_': <class 'libai.data.samplers.samplers.CyclicSampler'>}, test_sampler={'shuffle': False, 'drop_last': False, '_target_': <class 'libai.data.samplers.samplers.SingleRoundSampler'>}, num_workers=4, consumed_samples=0, seed=0, collate_fn=None, dataset_mixer=<class 'oneflow.utils.data.dataset.ConcatDataset'>)[source]¶

Build nlp train_val_test dataloader, used for dataset lack of valid/test dataset

Returns

It will return train/valid/test dataloader

train_loader: dataloader for training

valid_loader: dataloader for validation

test_loader: dataloader for testing

Parameters

dataset – dataset from which to load the data. e.g.: dataset or [dataset1, dataset2, …]
splits – ratio config for spliting dataset to train/valid/test. e.g.: [[7, 2, 1], …]
weights – ratio config for concate dataset list (Not Supported yet). e.g.: [1.0, …]
train_batch_size – how many samples per batch to load in training (micro-batch-size per GPU).
test_batch_size – how many samples per batch to load in testing (micro-batch-size per GPU).
sampler – defines the strategy to draw samples from the dataset. Can be any Iterable with __len__ implemented.
num_workers – how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: 4).
consumed_samples – the number of samples that have been trained at the current time, used for resuming training (default: 0).
seed – random seed, used for reproducing experiments (default: 0).
collate_fn – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.
dataset_mixer – function for concating list dataset.