Evaluation¶

Evaluation is a process that takes a number of inputs/outputs pairs and calculates them to get metrics. You can always use the model directly and parse its inputs/outputs manually to perform evaluation. Alternatively, evaluation can be implemented in LiBai using the DatasetEvaluator interface.

LiBai includes a few DatasetEvaluator that computes metrics like top-N accuracy, PPL(Perplexity), etc. You can also implement your own DatasetEvaluator that performs some other jobs using the inputs/outputs pairs. For example, to count how many instances are detected on the validation set:

class Counter(DatasetEvaluator):
  def reset(self):
    self.count = 0
  def process(self, inputs, outputs):
    for output in outputs:
      self.count += len(output["instances"])
  def evaluate(self):
    # save self.count somewhere, or print it, or return it.
    return {"count": self.count}

Customize Evaluator using DatasetEvaluator¶

DatasetEvaluator is the Base class for a dataset evaluator. This class accumulates information of the inputs/outputs (by process) after every batch inference, and produces evaluation results in the end (by evaluate). The input is from the trainer.get_batch(), which converts the outputs of dataset.__getitem__() to dict. The output is from the dict return of model.forward().

Firstly, declare a new evaluator class that inherits the DatasetEvaluator and overwrites its process and evaluation functions to satisfy the needs.

For example, declare a MyEvaluator class in libai/evaluator/myevaluator.py:

class MyEvaluator(DatasetEvaluator):
    def __init__(self):
        self._predictions = []

    def reset(self):
        self._predictions = []

    def process(self, inputs, outputs):
        # the key of inputs/outputs can be customized
        pred_logits = outputs["prediction_scores"]
        labels = inputs["labels"]

        # measure accuracy
        preds = pred_logits.cpu().topk(1)[1].squeeze(1).numpy()
        labels = labels.cpu().numpy()

        self._predictions.append({"preds": preds, "labels": labels})

    def evaluate(self):
        correct = 0.0
        all_sample = 0.0
        for pred in self._predictions:
            preds = pred["preds"]
            labels = pred["labels"]
            correct += (preds==labels).sum()
            all_sample += len(preds)
        self._results = OrderedDict()
        self._results["acc"] = correct/all_sample
        return copy.deepcopy(self._results)

Secondly, import the customized class and set the evaluation in config:

from libai.evaluation.myevaluator import MyEvaluator
evaluation=dict(
      enabled=True,
      # evaluator for calculating top-k acc
      evaluator=LazyCall(MyEvaluator)(),
      eval_period=5000,
      eval_iter=1e9,  # running steps for validation/test
      # Metrics to be used for best model checkpoint.
      eval_metric="acc", # your returned metric key in MyEvaluator.evaluate()
      eval_mode="max", # set `max` or `min` for saving best model according to your metric
)

Run Evaluator Manually¶

To check your evaluator code outside LiBai, use the methods of evaluators manually:

def get_all_inputs_outputs():
  for data in data_loader:
    yield data, model(data)

evaluator.reset()
for inputs, outputs in get_all_inputs_outputs():
  evaluator.process(inputs, outputs)
eval_results = evaluator.evaluate()

Evaluators can also be used with inference_on_dataset. For example:

eval_results = inference_on_dataset(
    model,
    data_loader,
    evaluator,
    ...
)