Reference-based metrics
These compare the model’s output against a known-goodideal answer in the dataset.
ROUGE
ROUGE
Measures word overlap between the response and the reference. Best for summarisation
and open-ended generation tasks where phrasing can vary.CLI:
--metric rougeBLEU
BLEU
N-gram precision with brevity penalty. More common in machine translation evals.CLI:
--metric bleuExact match
Exact match
Binary score — 1.0 if the response matches the ideal exactly, 0.0 otherwise.
Useful for classification, short answers, and code generation with deterministic output.CLI:
--metric exactLLM-as-judge
Uses a separate model to evaluate response quality on configurable criteria. Works with or without a reference answer.Custom criteria
Custom prompt template
For full control over the judge prompt, pass a.md file with these placeholders:
{criteria}, {conversation}, {response}, {ideal_section}.
examples/judge_prompt.md in the repo is a copy of the default template to start from.
CLI
Custom metrics
Pass any Python function that takes(response, ideal=None, messages=None, **kwargs)
and returns a float or a dict with a "score" key.
CLI
Point at a Python file and name the function:examples/custom_metrics.py for three ready-to-use examples.