Preprocessing Dataset¶

If you use LiBai’s Dataset to training NLP model, you can preprocess the training data.

This tutorial introduces how to preprocess your own training data, let’s take training Bert as an example.

First, You need to store the training data in loose JSON format file, which contains one text sample per line, For example:

{"chapter": "Chapter One", "text": "April Johnson had been crammed inside an apartment", "type": "April", "background": "novel"}
{"chapter": "Chapter Two", "text": "He couldn't remember their names", "type": "Dominic", "background": "novel"}

You can set the --json-keys argument to select the specific data of per sample, and the other keys will not be used.

Then, Process the JSON file into a binary format for training. To conver the json into mmap, cached index file, or the lazy loader format use toos/preprocess_data.py. Set the --dataset-impl flag to mmap, cached, or lazy respectively. You can run the following code to prepare you own dataset for training BERT:

#!/bin/bash

IMPL=mmap
KEYS=text

python tools/preprocess_data.py \
        --input path/to/test_sample_cn.json \
        --json-keys ${KEYS} \
        --vocab-file path/to/bert-base-chinese-vocab.txt \
        --dataset-impl ${IMPL} \
        --tokenizer-name BertTokenizer \
        --do-lower-case \
        --do-chinese-wwm \
        --split-sentences \
        --output-prefix cn_samples_${IMPL} \
        --workers 1 \
        --log-interval 2

Further command line arguments are described in the source file preprocess_data.py.