libai.evaluation¶

class libai.evaluation.BLEUEvaluator[source]¶

Evaluate BLEU(Bilingual Evaluation Understudy) score.

BLEU is a score for comparing a candidate translation of text to one or more reference translations.

evaluate()[source]¶

Evaluate/summarize the performance after processing all input/output pairs.

Returns

A new evaluator class can return a dict of arbitrary format as long as the user can process the results. In our train_net.py, we expect the following format:

key: the name of the task (e.g., Classification)
value: a dict of {metric name: score}, e.g.: {“Acc@1”: 75.0}

Return type

dict

process(inputs, outputs)[source]¶

Process the pair of inputs and outputs.

pred_logits = outputs["prediction_scores"]
labels = inputs["labels"]
# do evaluation on pred_logits/labels pair
...

Parameters

inputs (dict) – the inputs that’s used to call the model.
outputs (dict) – the return dict of model(**inputs)

reset()[source]¶: Preparation for a new round of evaluation. Should be called before starting a round of evaluation.

class libai.evaluation.ClsEvaluator(topk=(1, 5))[source]¶

Evaluate accuracy for classification. The metrics range from 0 to 100 (instead of 0 to 1). We support evaluate different topk accuracy. You can reset cfg.train.topk=(1, 5, N) according to your needs.

evaluate()[source]¶

Evaluate/summarize the performance after processing all input/output pairs.

Returns

A new evaluator class can return a dict of arbitrary format as long as the user can process the results. In our train_net.py, we expect the following format:

key: the name of the task (e.g., Classification)
value: a dict of {metric name: score}, e.g.: {“Acc@1”: 75.0}

Return type

dict

process(inputs, outputs)[source]¶

Process the pair of inputs and outputs.

pred_logits = outputs["prediction_scores"]
labels = inputs["labels"]
# do evaluation on pred_logits/labels pair
...

Parameters

inputs (dict) – the inputs that’s used to call the model.
outputs (dict) – the return dict of model(**inputs)

reset()[source]¶: Preparation for a new round of evaluation. Should be called before starting a round of evaluation.

class libai.evaluation.DatasetEvaluator[source]¶

Base class for a dataset evaluator. The function inference_on_dataset() runs the model over all samples in the dataset, and have a DatasetEvaluator to process the inputs/outputs. This class will accumulate information of the inputs/outputs (by process()), and produce evaluation results in the end (by evaluate()).

evaluate()[source]¶

Evaluate/summarize the performance after processing all input/output pairs.

Returns

A new evaluator class can return a dict of arbitrary format as long as the user can process the results. In our train_net.py, we expect the following format:

key: the name of the task (e.g., Classification)
value: a dict of {metric name: score}, e.g.: {“Acc@1”: 75.0}

Return type

dict

process(inputs, outputs)[source]¶

Process the pair of inputs and outputs.

pred_logits = outputs["prediction_scores"]
labels = inputs["labels"]
# do evaluation on pred_logits/labels pair
...

Parameters

inputs (dict) – the inputs that’s used to call the model.
outputs (dict) – the return dict of model(**inputs)

reset()[source]¶: Preparation for a new round of evaluation. Should be called before starting a round of evaluation.

class libai.evaluation.PPLEvaluator[source]¶

Evaluate perplexity for Language Model.

Perplexity is a measurement of how well a probability distribution or probability model predicts a sample.

evaluate()[source]¶

Evaluate/summarize the performance after processing all input/output pairs.

Returns

A new evaluator class can return a dict of arbitrary format as long as the user can process the results. In our train_net.py, we expect the following format:

key: the name of the task (e.g., Classification)
value: a dict of {metric name: score}, e.g.: {“Acc@1”: 75.0}

Return type

dict

process(inputs, outputs)[source]¶

Process the pair of inputs and outputs.

pred_logits = outputs["prediction_scores"]
labels = inputs["labels"]
# do evaluation on pred_logits/labels pair
...

Parameters

inputs (dict) – the inputs that’s used to call the model.
outputs (dict) – the return dict of model(**inputs)

reset()[source]¶: Preparation for a new round of evaluation. Should be called before starting a round of evaluation.

class libai.evaluation.RegEvaluator[source]¶

evaluate()[source]¶

Evaluate/summarize the performance after processing all input/output pairs.

Returns

A new evaluator class can return a dict of arbitrary format as long as the user can process the results. In our train_net.py, we expect the following format:

key: the name of the task (e.g., Classification)
value: a dict of {metric name: score}, e.g.: {“Acc@1”: 75.0}

Return type

dict

process(inputs, outputs)[source]¶

Process the pair of inputs and outputs.

pred_logits = outputs["prediction_scores"]
labels = inputs["labels"]
# do evaluation on pred_logits/labels pair
...

Parameters

inputs (dict) – the inputs that’s used to call the model.
outputs (dict) – the return dict of model(**inputs)

reset()[source]¶: Preparation for a new round of evaluation. Should be called before starting a round of evaluation.

libai.evaluation.flatten_results_dict(results)[source]¶

Expand a hierarchical dict of scalars into a flat dict of scalars. If results[k1][k2][k3] = v, the returned dict will have the entry {“k1/k2/k3”: v}.

Parameters: results (dict) –

libai.evaluation.inference_on_dataset(model, data_loader, batch_size, eval_iter, get_batch: Callable, input_placement_device: str, evaluator: Optional[Union[libai.evaluation.evaluator.DatasetEvaluator, List[libai.evaluation.evaluator.DatasetEvaluator]]])[source]¶

Run model on the data_loader and evaluate the metrics with evaluator. Also benchmark the inference speed of model.__call__ accurately. The model will be used in eval mode.

Parameters

model (callable) – a callable which takes an object from data_loader and returns some outputs. If it’s an nn.Module, it will be temporarily set to eval mode. If you wish to evaluate a model in training mode instead, you can wrap the given model and override its behavior of .eval() and .train().
batch_size – batch size for inference
data_loader – an iterable object with a length. The elements it generates will be the inputs to the model.
eval_iter – running steps for evaluation
get_batch – a Callable function for getting data from dataloader
input_placement_device – used in get_batch, set it to cuda or cpu. see input_placement_device in libai.configs.common.train.py for more details.
evaluator – the evaluator(s) to run. Use None if you only want to benchmark, but don’t want to do any evaluation.

Returns

The return value of evaluator.evaluate()

libai.evaluation.print_csv_format(results)[source]¶

Print main metrics in a particular format so that they are easy to copypaste into a spreadsheet.

Parameters: results (OrderedDict[dict]) – task_name -> {metric -> score} unordered dict can also be printed, but in arbitrary order