libai.data

libai.data.data_utils module

class libai.data.data_utils.IndexedCachedDataset(path)[source]
class libai.data.data_utils.IndexedDataset(path)[source]

Loader for IndexedDataset

class libai.data.data_utils.MMapIndexedDataset(path, skip_warmup=False)[source]
get(idx, offset=0, length=None)[source]

Retrieves a single item from the dataset with the option to only return a portion of the item.

get(idx) is the same as [idx] but get() does not support slicing.

libai.data.datasets module

class libai.data.datasets.BertDataset(name, tokenizer, indexed_dataset, data_prefix, max_num_samples, mask_lm_prob, max_seq_length, short_seq_prob=0.0, seed=1234, binary_head=True, masking_style='bert')[source]

Dataset containing sentence pairs for BERT training. Each index corresponds to a randomly generated sentence pair.

Parameters
  • name – Name of dataset for clarification.

  • tokenizer – Tokenizer to use.

  • data_prefix – Path to the training dataset.

  • indexed_dataset – Indexed dataset to use.

  • max_seq_length – Maximum length of the sequence. All values are padded to this length. Defaults to 512.

  • mask_lm_prob – Probability to mask tokens. Defaults to 0.15.

  • short_seq_prob – Probability of producing a short sequence. Defaults to 0.0.

  • max_predictions_per_seq – Maximum number of mask tokens in each sentence. Defaults to None.

  • seed – Seed for random number generator for reproducibility. Defaults to 1234.

  • binary_head – Specifies whether the underlying dataset generates a pair of blocks along with a sentence_target or not. Setting it to True assumes that the underlying dataset generates a label for the pair of sentences which is surfaced as sentence_target. Defaults to True.

class libai.data.datasets.CIFAR100Dataset(root: str, train: bool = True, transform: Optional[Callable] = None, download: bool = False, **kwargs)[source]

CIFAR100 Dataset in LiBai.

Parameters
  • root (string) – Root directory of dataset where directory cifar-10-batches-py exists or will be saved to if download is set to True.

  • train (bool, optional) – If True, creates dataset from training set, otherwise creates from test set.

  • transform (callable, optional) – A function/transform that takes in a PIL image and returns a transformed version. E.g, transforms.RandomCrop

  • download (bool, optional) – If true, downloads the dataset from the internet and puts it in root directory. If the dataset is already downloaded, it will not be downloaded again.

  • dataset_name (str, optional) – Name for the dataset as an identifier. E.g, cifar100

class libai.data.datasets.CIFAR10Dataset(root: str, train: bool = True, transform: Optional[Callable] = None, download: bool = False, **kwargs)[source]

CIFAR10 Dataset in LiBai.

Parameters
  • root (string) – Root directory of dataset where directory cifar-10-batches-py exists or will be saved to if download is set to True.

  • train (bool, optional) – If True, creates dataset from training set, otherwise creates from test set.

  • transform (callable, optional) – A function/transform that takes in a PIL image and returns a transformed version. E.g, transforms.RandomCrop

  • download (bool, optional) – If true, downloads the dataset from the internet and puts it in root directory. If the dataset is already downloaded, it will not be downloaded again.

class libai.data.datasets.GPT2Dataset(name, tokenizer, data_prefix, indexed_dataset, max_num_samples, max_seq_length, seed=1234)[source]
class libai.data.datasets.ImageNetDataset(root: str, train: bool = True, transform: Optional[Callable] = None, **kwargs)[source]

ImageNet 2012 Classification Dataset in LiBai.

Parameters
  • root (string) – Root directory of the ImageNet Dataset.

  • train (bool, optional) – If True, creates dataset from training set, otherwise creates from test set.

  • transform (callable, optional) – A function/transform that takes in a PIL image and returns a transformed version. E.g, transforms.RandomCrop

class libai.data.datasets.MNISTDataset(root: str, train: bool = True, transform: Optional[Callable] = None, download: bool = False, **kwargs)[source]

MNIST Dataset in LiBai.

Parameters
  • root (string) – Root directory of dataset where MNIST/processed/training.pt and MNIST/processed/test.pt exist.

  • train (bool, optional) – If True, creates dataset from training.pt, otherwise from test.pt.

  • download (bool, optional) – If true, downloads the dataset from the internet and puts it in root directory. If the dataset is already downloaded, it will not be downloaded again.

  • transform (callable, optional) – A function/transform that takes in a PIL image and returns a transformed version. E.g, transforms.RandomCrop

  • dataset_name (str, optional) – Name for the dataset as an identifier. E.g, mnist

class libai.data.datasets.RobertaDataset(name, tokenizer, indexed_dataset, data_prefix, max_num_samples, mask_lm_prob, max_seq_length, short_seq_prob=0.0, seed=1234, masking_style='bert')[source]

Dataset containing sentence for RoBERTa training. Each index corresponds to a randomly selected sentence.

Parameters
  • name – Name of dataset for clarification.

  • tokenizer – Tokenizer to use.

  • data_prefix – Path to the training dataset.

  • indexed_dataset – Indexed dataset to use.

  • max_seq_length – Maximum length of the sequence. All values are padded to this length. Defaults to 512.

  • mask_lm_prob – Probability to mask tokens. Defaults to 0.15.

  • short_seq_prob – Probability of producing a short sequence. Defaults to 0.0.

  • max_predictions_per_seq – Maximum number of mask tokens in each sentence. Defaults to None.

  • seed – Seed for random number generator for reproducibility. Defaults to 1234.

class libai.data.datasets.T5Dataset(name, tokenizer, indexed_dataset, data_prefix, max_num_samples, masked_lm_prob, max_seq_length, max_seq_length_dec, short_seq_prob, seed)[source]

Dataset containing sentences for T5 training.

Parameters
  • name – Name of dataset.

  • tokenizer – Tokenizer to use.

  • data_prefix (str) – Path to the training dataset.

  • indexed_dataset – Indexed dataset to use.

  • max_seq_length (int, optional) – Maximum length of the sequence passing into encoder. All values are padded to this length. Defaults to 512.

  • max_seq_length_dec (int, optional) – Maximum length of the sequence passing into decoder. All values are padded to this length. Defaults to 128.

  • mask_lm_prob (float, optional) – Probability to mask tokens. Defaults to 0.15.

  • max_preds_per_seq (int, optional) – Maximum number of masked tokens in each sentence. Defaults to None.

  • short_seq_prob (float, optional) – Probability of producing a short sequence. Defaults to 0.0.

  • seed (int, optional) – Seed for random number generator for reproducibility. Defaults to 1234.

libai.data.samplers module

class libai.data.samplers.CyclicSampler(dataset, micro_batch_size, shuffle=False, consumed_samples=0, data_parallel_rank=0, data_parallel_size=1, seed=0)[source]

This sampler supports cyclic sampling, and it is also compatible with non-data parallelism and data parallelism.

Parameters
  • dataset – dataset to be sampled.

  • micro_batch_size – batch size for per model instance.

  • is micro_batch_size times data_parallel_size. (global_batch_size) –

  • shuffle – whether to shuffle the dataset.

  • consumed_samples – the number of samples that have been trained at the current time, used for resuming training (default: 0).

  • data_parallel_rank – local rank for data parallelism.

  • data_parallel_size – the size of data parallelism.

  • seed – random seed, used for reproducing experiments (default: 0).

set_consumed_samples(consumed_samples)[source]

You can recover the training iteration by setting consumed_samples.

set_epoch(epoch)[source]

Used for restoring training status.

class libai.data.samplers.SingleRoundSampler(dataset, micro_batch_size, shuffle=False, data_parallel_rank=0, data_parallel_size=1, seed=0, drop_last=False)[source]

This sampler supports single round sampling, and it is also compatible with non data parallelism and data parallelism.

Parameters
  • dataset – dataset to be sampled.

  • micro_batch_size – batch size for per model instance, global_batch_size is micro_batch_size times data_parallel_size.

  • shuffle – whether to shuffle the dataset.

  • data_parallel_rank – local rank for data parallelism.

  • data_parallel_size – the size of data parallelism.

  • seed – random seed, used for reproducing experiments (default: 0).

  • drop_last – whether to drop the remaining data (default: False).

libai.data.build module

libai.data.build.build_image_test_loader(dataset, test_batch_size, sampler={'shuffle': True, 'drop_last': False, '_target_': <class 'libai.data.samplers.samplers.SingleRoundSampler'>}, num_workers=4, seed=0, collate_fn=None, **kwargs)[source]

Build image test dataloader, used for test dataset

Returns

It will return test dataloader

  • test_loader: dataloader for testing

Parameters
  • dataset – dataset from which to load the data. e.g.: dataset or [dataset1, dataset2, …]

  • test_batch_size – how many samples per batch to load in testing (micro-batch-size per GPU).

  • sampler – defines the strategy to draw samples from the dataset. Can be any Iterable with __len__ implemented.

  • num_workers – how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: 4).

  • seed – random seed, used for reproducing experiments (default: 0).

  • collate_fn – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.

libai.data.build.build_image_train_loader(dataset, train_batch_size, test_batch_size=None, sampler={'shuffle': True, '_target_': <class 'libai.data.samplers.samplers.CyclicSampler'>}, num_workers=4, consumed_samples=0, seed=0, collate_fn=None, dataset_mixer=<class 'oneflow.utils.data.dataset.ConcatDataset'>, mixup_func=None, **kwargs)[source]

Build image train dataloader, it’s used for train dataset

Returns

It will return train dataloader, and Nonetype for valid/test dataloader

  • train_loader: dataloader for training

  • None: Nonetype

  • None: Nonetype

Parameters
  • dataset – dataset from which to load the data. e.g.: dataset or [dataset1, dataset2, …]

  • train_batch_size – how many samples per batch to load in training (micro-batch-size per GPU).

  • test_batch_size – no use, set it to None.

  • sampler – defines the strategy to draw samples from the dataset. Can be any Iterable with __len__ implemented.

  • num_workers – how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: 4).

  • consumed_samples – the number of samples that have been trained at the current time, used for resuming training (default: 0).

  • seed – random seed, used for reproducing experiments (default: 0).

  • collate_fn – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.

  • dataset_mixer – function for concating list dataset.

  • mixup_func – function for data argumentation.

libai.data.build.build_nlp_test_loader(dataset, test_batch_size, sampler={'shuffle': False, 'drop_last': False, '_target_': <class 'libai.data.samplers.samplers.SingleRoundSampler'>}, num_workers=4, seed=0, collate_fn=None)[source]

Build nlp test dataloader, it’s used for test dataset

Returns

It will return test dataloader

  • test_loader: dataloader for testing

Parameters
  • dataset – dataset from which to load the data. e.g.: dataset or [dataset1, dataset2, …]

  • test_batch_size – how many samples per batch to load in testing (micro-batch-size per GPU).

  • sampler – defines the strategy to draw samples from the dataset. Can be any Iterable with __len__ implemented.

  • num_workers – how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: 4).

  • seed – random seed, used for reproducing experiments (default: 0).

  • collate_fn – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.

libai.data.build.build_nlp_train_loader(dataset, train_batch_size, test_batch_size=None, sampler={'shuffle': True, '_target_': <class 'libai.data.samplers.samplers.CyclicSampler'>}, num_workers=4, consumed_samples=0, seed=0, collate_fn=None, dataset_mixer=<class 'oneflow.utils.data.dataset.ConcatDataset'>, **kwargs)[source]

Build nlp train dataloader, it’s used for train dataset

Returns

It will return train dataloader, and Nonetype for valid/test dataloader

  • train_loader: dataloader for training

  • None: Nonetype

  • None: Nonetype

Parameters
  • dataset – dataset from which to load the data. e.g.: dataset or [dataset1, dataset2, …]

  • train_batch_size – how many samples per batch to load in training (micro-batch-size per GPU).

  • test_batch_size – no use, set it to None.

  • sampler – defines the strategy to draw samples from the dataset. Can be any Iterable with __len__ implemented.

  • num_workers – how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: 4).

  • consumed_samples – the number of samples that have been trained at the current time, used for resuming training (default: 0).

  • seed – random seed, used for reproducing experiments (default: 0).

  • collate_fn – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.

  • dataset_mixer – function for concating list dataset.

libai.data.build.build_nlp_train_val_test_loader(dataset, splits, weights, train_val_test_num_samples, train_batch_size, test_batch_size, train_sampler={'shuffle': True, '_target_': <class 'libai.data.samplers.samplers.CyclicSampler'>}, test_sampler={'shuffle': False, 'drop_last': False, '_target_': <class 'libai.data.samplers.samplers.SingleRoundSampler'>}, num_workers=4, consumed_samples=0, seed=0, collate_fn=None, dataset_mixer=<class 'oneflow.utils.data.dataset.ConcatDataset'>)[source]

Build nlp train_val_test dataloader, used for dataset lack of valid/test dataset

Returns

It will return train/valid/test dataloader

  • train_loader: dataloader for training

  • valid_loader: dataloader for validation

  • test_loader: dataloader for testing

Parameters
  • dataset – dataset from which to load the data. e.g.: dataset or [dataset1, dataset2, …]

  • splits – ratio config for spliting dataset to train/valid/test. e.g.: [[7, 2, 1], …]

  • weights – ratio config for concate dataset list (Not Supported yet). e.g.: [1.0, …]

  • train_batch_size – how many samples per batch to load in training (micro-batch-size per GPU).

  • test_batch_size – how many samples per batch to load in testing (micro-batch-size per GPU).

  • sampler – defines the strategy to draw samples from the dataset. Can be any Iterable with __len__ implemented.

  • num_workers – how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: 4).

  • consumed_samples – the number of samples that have been trained at the current time, used for resuming training (default: 0).

  • seed – random seed, used for reproducing experiments (default: 0).

  • collate_fn – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.

  • dataset_mixer – function for concating list dataset.