libai.data¶
libai.data.data_utils module¶
libai.data.datasets module¶
-
class
libai.data.datasets.
BertDataset
(name, tokenizer, indexed_dataset, data_prefix, max_num_samples, mask_lm_prob, max_seq_length, short_seq_prob=0.0, seed=1234, binary_head=True, masking_style='bert')[source]¶ Dataset containing sentence pairs for BERT training. Each index corresponds to a randomly generated sentence pair.
- Parameters
name – Name of dataset for clarification.
tokenizer – Tokenizer to use.
data_prefix – Path to the training dataset.
indexed_dataset – Indexed dataset to use.
max_seq_length – Maximum length of the sequence. All values are padded to this length. Defaults to 512.
mask_lm_prob – Probability to mask tokens. Defaults to 0.15.
short_seq_prob – Probability of producing a short sequence. Defaults to 0.0.
max_predictions_per_seq – Maximum number of mask tokens in each sentence. Defaults to None.
seed – Seed for random number generator for reproducibility. Defaults to 1234.
binary_head – Specifies whether the underlying dataset generates a pair of blocks along with a sentence_target or not. Setting it to True assumes that the underlying dataset generates a label for the pair of sentences which is surfaced as sentence_target. Defaults to True.
-
class
libai.data.datasets.
CIFAR100Dataset
(root: str, train: bool = True, transform: Optional[Callable] = None, download: bool = False, **kwargs)[source]¶ CIFAR100 Dataset in LiBai.
- Parameters
root (string) – Root directory of dataset where directory
cifar-10-batches-py
exists or will be saved to if download is set to True.train (bool, optional) – If True, creates dataset from training set, otherwise creates from test set.
transform (callable, optional) – A function/transform that takes in a PIL image and returns a transformed version. E.g,
transforms.RandomCrop
download (bool, optional) – If true, downloads the dataset from the internet and puts it in root directory. If the dataset is already downloaded, it will not be downloaded again.
dataset_name (str, optional) – Name for the dataset as an identifier. E.g,
cifar100
-
class
libai.data.datasets.
CIFAR10Dataset
(root: str, train: bool = True, transform: Optional[Callable] = None, download: bool = False, **kwargs)[source]¶ CIFAR10 Dataset in LiBai.
- Parameters
root (string) – Root directory of dataset where directory
cifar-10-batches-py
exists or will be saved to if download is set to True.train (bool, optional) – If True, creates dataset from training set, otherwise creates from test set.
transform (callable, optional) – A function/transform that takes in a PIL image and returns a transformed version. E.g,
transforms.RandomCrop
download (bool, optional) – If true, downloads the dataset from the internet and puts it in root directory. If the dataset is already downloaded, it will not be downloaded again.
-
class
libai.data.datasets.
GPT2Dataset
(name, tokenizer, data_prefix, indexed_dataset, max_num_samples, max_seq_length, seed=1234)[source]¶
-
class
libai.data.datasets.
ImageNetDataset
(root: str, train: bool = True, transform: Optional[Callable] = None, **kwargs)[source]¶ ImageNet 2012 Classification Dataset in LiBai.
- Parameters
root (string) – Root directory of the ImageNet Dataset.
train (bool, optional) – If True, creates dataset from training set, otherwise creates from test set.
transform (callable, optional) – A function/transform that takes in a PIL image and returns a transformed version. E.g,
transforms.RandomCrop
-
class
libai.data.datasets.
MNISTDataset
(root: str, train: bool = True, transform: Optional[Callable] = None, download: bool = False, **kwargs)[source]¶ MNIST Dataset in LiBai.
- Parameters
root (string) – Root directory of dataset where
MNIST/processed/training.pt
andMNIST/processed/test.pt
exist.train (bool, optional) – If True, creates dataset from
training.pt
, otherwise fromtest.pt
.download (bool, optional) – If true, downloads the dataset from the internet and puts it in root directory. If the dataset is already downloaded, it will not be downloaded again.
transform (callable, optional) – A function/transform that takes in a PIL image and returns a transformed version. E.g,
transforms.RandomCrop
dataset_name (str, optional) – Name for the dataset as an identifier. E.g,
mnist
-
class
libai.data.datasets.
RobertaDataset
(name, tokenizer, indexed_dataset, data_prefix, max_num_samples, mask_lm_prob, max_seq_length, short_seq_prob=0.0, seed=1234, masking_style='bert')[source]¶ Dataset containing sentence for RoBERTa training. Each index corresponds to a randomly selected sentence.
- Parameters
name – Name of dataset for clarification.
tokenizer – Tokenizer to use.
data_prefix – Path to the training dataset.
indexed_dataset – Indexed dataset to use.
max_seq_length – Maximum length of the sequence. All values are padded to this length. Defaults to 512.
mask_lm_prob – Probability to mask tokens. Defaults to 0.15.
short_seq_prob – Probability of producing a short sequence. Defaults to 0.0.
max_predictions_per_seq – Maximum number of mask tokens in each sentence. Defaults to None.
seed – Seed for random number generator for reproducibility. Defaults to 1234.
-
class
libai.data.datasets.
T5Dataset
(name, tokenizer, indexed_dataset, data_prefix, max_num_samples, masked_lm_prob, max_seq_length, max_seq_length_dec, short_seq_prob, seed)[source]¶ Dataset containing sentences for T5 training.
- Parameters
name – Name of dataset.
tokenizer – Tokenizer to use.
data_prefix (str) – Path to the training dataset.
indexed_dataset – Indexed dataset to use.
max_seq_length (int, optional) – Maximum length of the sequence passing into encoder. All values are padded to this length. Defaults to 512.
max_seq_length_dec (int, optional) – Maximum length of the sequence passing into decoder. All values are padded to this length. Defaults to 128.
mask_lm_prob (float, optional) – Probability to mask tokens. Defaults to 0.15.
max_preds_per_seq (int, optional) – Maximum number of masked tokens in each sentence. Defaults to None.
short_seq_prob (float, optional) – Probability of producing a short sequence. Defaults to 0.0.
seed (int, optional) – Seed for random number generator for reproducibility. Defaults to 1234.
libai.data.samplers module¶
-
class
libai.data.samplers.
CyclicSampler
(dataset, micro_batch_size, shuffle=False, consumed_samples=0, data_parallel_rank=0, data_parallel_size=1, seed=0)[source]¶ This sampler supports cyclic sampling, and it is also compatible with non-data parallelism and data parallelism.
- Parameters
dataset – dataset to be sampled.
micro_batch_size – batch size for per model instance.
is micro_batch_size times data_parallel_size. (global_batch_size) –
shuffle – whether to shuffle the dataset.
consumed_samples – the number of samples that have been trained at the current time, used for resuming training (default:
0
).data_parallel_rank – local rank for data parallelism.
data_parallel_size – the size of data parallelism.
seed – random seed, used for reproducing experiments (default:
0
).
-
class
libai.data.samplers.
SingleRoundSampler
(dataset, micro_batch_size, shuffle=False, data_parallel_rank=0, data_parallel_size=1, seed=0, drop_last=False)[source]¶ This sampler supports single round sampling, and it is also compatible with non data parallelism and data parallelism.
- Parameters
dataset – dataset to be sampled.
micro_batch_size – batch size for per model instance, global_batch_size is micro_batch_size times data_parallel_size.
shuffle – whether to shuffle the dataset.
data_parallel_rank – local rank for data parallelism.
data_parallel_size – the size of data parallelism.
seed – random seed, used for reproducing experiments (default:
0
).drop_last – whether to drop the remaining data (default:
False
).
libai.data.build module¶
-
libai.data.build.
build_image_test_loader
(dataset, test_batch_size, sampler={'shuffle': True, 'drop_last': False, '_target_': <class 'libai.data.samplers.samplers.SingleRoundSampler'>}, num_workers=4, seed=0, collate_fn=None, **kwargs)[source]¶ Build image test dataloader, used for test dataset
- Returns
It will return test dataloader
test_loader: dataloader for testing
- Parameters
dataset – dataset from which to load the data. e.g.: dataset or [dataset1, dataset2, …]
test_batch_size – how many samples per batch to load in testing (micro-batch-size per GPU).
sampler – defines the strategy to draw samples from the dataset. Can be any
Iterable
with__len__
implemented.num_workers – how many subprocesses to use for data loading.
0
means that the data will be loaded in the main process. (default:4
).seed – random seed, used for reproducing experiments (default:
0
).collate_fn – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.
-
libai.data.build.
build_image_train_loader
(dataset, train_batch_size, test_batch_size=None, sampler={'shuffle': True, '_target_': <class 'libai.data.samplers.samplers.CyclicSampler'>}, num_workers=4, consumed_samples=0, seed=0, collate_fn=None, dataset_mixer=<class 'oneflow.utils.data.dataset.ConcatDataset'>, mixup_func=None, **kwargs)[source]¶ Build image train dataloader, it’s used for train dataset
- Returns
It will return train dataloader, and Nonetype for valid/test dataloader
train_loader: dataloader for training
None: Nonetype
None: Nonetype
- Parameters
dataset – dataset from which to load the data. e.g.: dataset or [dataset1, dataset2, …]
train_batch_size – how many samples per batch to load in training (micro-batch-size per GPU).
test_batch_size – no use, set it to None.
sampler – defines the strategy to draw samples from the dataset. Can be any
Iterable
with__len__
implemented.num_workers – how many subprocesses to use for data loading.
0
means that the data will be loaded in the main process. (default:4
).consumed_samples – the number of samples that have been trained at the current time, used for resuming training (default:
0
).seed – random seed, used for reproducing experiments (default:
0
).collate_fn – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.
dataset_mixer – function for concating list dataset.
mixup_func – function for data argumentation.
-
libai.data.build.
build_nlp_test_loader
(dataset, test_batch_size, sampler={'shuffle': False, 'drop_last': False, '_target_': <class 'libai.data.samplers.samplers.SingleRoundSampler'>}, num_workers=4, seed=0, collate_fn=None)[source]¶ Build nlp test dataloader, it’s used for test dataset
- Returns
It will return test dataloader
test_loader: dataloader for testing
- Parameters
dataset – dataset from which to load the data. e.g.: dataset or [dataset1, dataset2, …]
test_batch_size – how many samples per batch to load in testing (micro-batch-size per GPU).
sampler – defines the strategy to draw samples from the dataset. Can be any
Iterable
with__len__
implemented.num_workers – how many subprocesses to use for data loading.
0
means that the data will be loaded in the main process. (default:4
).seed – random seed, used for reproducing experiments (default:
0
).collate_fn – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.
-
libai.data.build.
build_nlp_train_loader
(dataset, train_batch_size, test_batch_size=None, sampler={'shuffle': True, '_target_': <class 'libai.data.samplers.samplers.CyclicSampler'>}, num_workers=4, consumed_samples=0, seed=0, collate_fn=None, dataset_mixer=<class 'oneflow.utils.data.dataset.ConcatDataset'>, **kwargs)[source]¶ Build nlp train dataloader, it’s used for train dataset
- Returns
It will return train dataloader, and Nonetype for valid/test dataloader
train_loader: dataloader for training
None: Nonetype
None: Nonetype
- Parameters
dataset – dataset from which to load the data. e.g.: dataset or [dataset1, dataset2, …]
train_batch_size – how many samples per batch to load in training (micro-batch-size per GPU).
test_batch_size – no use, set it to None.
sampler – defines the strategy to draw samples from the dataset. Can be any
Iterable
with__len__
implemented.num_workers – how many subprocesses to use for data loading.
0
means that the data will be loaded in the main process. (default:4
).consumed_samples – the number of samples that have been trained at the current time, used for resuming training (default:
0
).seed – random seed, used for reproducing experiments (default:
0
).collate_fn – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.
dataset_mixer – function for concating list dataset.
-
libai.data.build.
build_nlp_train_val_test_loader
(dataset, splits, weights, train_val_test_num_samples, train_batch_size, test_batch_size, train_sampler={'shuffle': True, '_target_': <class 'libai.data.samplers.samplers.CyclicSampler'>}, test_sampler={'shuffle': False, 'drop_last': False, '_target_': <class 'libai.data.samplers.samplers.SingleRoundSampler'>}, num_workers=4, consumed_samples=0, seed=0, collate_fn=None, dataset_mixer=<class 'oneflow.utils.data.dataset.ConcatDataset'>)[source]¶ Build nlp train_val_test dataloader, used for dataset lack of valid/test dataset
- Returns
It will return train/valid/test dataloader
train_loader: dataloader for training
valid_loader: dataloader for validation
test_loader: dataloader for testing
- Parameters
dataset – dataset from which to load the data. e.g.: dataset or [dataset1, dataset2, …]
splits – ratio config for spliting dataset to train/valid/test. e.g.: [[7, 2, 1], …]
weights – ratio config for concate dataset list (Not Supported yet). e.g.: [1.0, …]
train_batch_size – how many samples per batch to load in training (micro-batch-size per GPU).
test_batch_size – how many samples per batch to load in testing (micro-batch-size per GPU).
sampler – defines the strategy to draw samples from the dataset. Can be any
Iterable
with__len__
implemented.num_workers – how many subprocesses to use for data loading.
0
means that the data will be loaded in the main process. (default:4
).consumed_samples – the number of samples that have been trained at the current time, used for resuming training (default:
0
).seed – random seed, used for reproducing experiments (default:
0
).collate_fn – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.
dataset_mixer – function for concating list dataset.