These compare the model’s output against a known-good ideal answer in the dataset.
ROUGE
Measures word overlap between the response and the reference. Best for summarisation
and open-ended generation tasks where phrasing can vary.
from aevyra_verdict import RougeScoreRougeScore() # defaults to rougeLRougeScore(variant="rouge1") # unigram overlapRougeScore(variant="rouge2") # bigram overlapRougeScore(variant="rougeL") # longest common subsequence
CLI: --metric rouge
BLEU
N-gram precision with brevity penalty. More common in machine translation evals.
from aevyra_verdict import BleuScoreBleuScore() # 4-gram BLEU by defaultBleuScore(max_ngram=2) # bigram BLEU
CLI: --metric bleu
Exact match
Binary score — 1.0 if the response matches the ideal exactly, 0.0 otherwise.
Useful for classification, short answers, and code generation with deterministic output.
from aevyra_verdict import ExactMatchExactMatch() # case-insensitive, strips whitespaceExactMatch(case_sensitive=True)