Preprocessing Dataset¶
If you use LiBai’s Dataset
to training NLP model, you can preprocess the training data.
This tutorial introduces how to preprocess your own training data, let’s take training Bert
as an example.
First, You need to store the training data in loose JSON format file, which contains one text sample per line, For example:
{"chapter": "Chapter One", "text": "April Johnson had been crammed inside an apartment", "type": "April", "background": "novel"}
{"chapter": "Chapter Two", "text": "He couldn't remember their names", "type": "Dominic", "background": "novel"}
You can set the --json-keys
argument to select the specific data of per sample, and the other keys will not be used.
Then, Process the JSON file into a binary format for training. To conver the json into mmap, cached index file, or the lazy loader format use toos/preprocess_data.py
. Set the --dataset-impl
flag to mmap
, cached
, or lazy
respectively. You can run the following code to prepare you own dataset for training BERT:
#!/bin/bash
IMPL=mmap
KEYS=text
python tools/preprocess_data.py \
--input path/to/test_sample_cn.json \
--json-keys ${KEYS} \
--vocab-file path/to/bert-base-chinese-vocab.txt \
--dataset-impl ${IMPL} \
--tokenizer-name BertTokenizer \
--do-lower-case \
--do-chinese-wwm \
--split-sentences \
--output-prefix cn_samples_${IMPL} \
--workers 1 \
--log-interval 2
Further command line arguments are described in the source file preprocess_data.py
.