Documentation Index Fetch the complete documentation index at: https://docs.aevyra.ai/llms.txt
Use this file to discover all available pages before exploring further.
Install
pip install aevyra-verdict
Set your API keys
export OPENAI_API_KEY = sk- ...
export ANTHROPIC_API_KEY = sk-ant- ...
Run aevyra-verdict providers to see which keys are configured.
Prepare a dataset
Create a JSONL file where each line is a conversation in OpenAI message format.
The ideal field is the reference answer used by scoring metrics.
{ "messages" : [{ "role" : "user" , "content" : "What is the capital of France?" }], "ideal" : "Paris" }
{ "messages" : [{ "role" : "user" , "content" : "Explain binary search in one sentence." }], "ideal" : "Binary search repeatedly halves a sorted array to find a target value in O(log n) time." }
Run your first eval
aevyra-verdict run examples/sample_data.jsonl -m openai/gpt-5.4-nano -m qwen/qwen3.5-9b
You’ll see a progress bar and a comparison table when it finishes:
Eval: dataset | Metric: rouge_rougeL
------------------------------------------------------------------------
Model Mean Stdev Latency Errors
------------------------------------------------------------------------
openai/gpt-5.4-nano 0.7823 N/A 312.4ms 0
qwen/qwen3.5-9b 0.7541 N/A 289.1ms 0
------------------------------------------------------------------------
Run against a local model
If you have Ollama running locally, you can benchmark against it
without any API keys:
ollama pull llama3.1
ollama pull mistral
aevyra-verdict run examples/sample_data.jsonl \
-m local/llama3.1 \
-m local/mistral \
--base-url http://localhost:11434/v1
Or with a local vLLM instance:
aevyra-verdict run examples/sample_data.jsonl \
-m openai/gpt-5.4-nano \
-m local/meta-llama/Llama-3.1-8B-Instruct \
--base-url http://localhost:8000/v1
This is useful for benchmarking a fine-tuned model against a hosted baseline before
deciding whether to deploy it.
Save results
aevyra-verdict run examples/sample_data.jsonl -m openai/gpt-5.4-nano -o results.json
Next steps
Compare more models Use a config file to manage multiple models including local vLLM instances
Add an LLM judge Score responses with an LLM judge instead of reference-based metrics