Documentation Index Fetch the complete documentation index at: https://docs.aevyra.ai/llms.txt
Use this file to discover all available pages before exploring further.
Install
pip install aevyra-reflex
That’s the entire setup. No YAML config, no framework to configure, no
separate server. This also installs
aevyra-verdict for evaluation.
Set your API keys
Set keys for whichever model provider you’re using:
export OPENROUTER_API_KEY = sk-or- ... # for OpenRouter models (eval + reasoning)
export ANTHROPIC_API_KEY = sk-ant- ... # if using Claude as the reasoning model
export OPENAI_API_KEY = sk- ... # if optimizing an OpenAI model
For local models (Ollama), no key is needed — everything runs on your machine:
aevyra-reflex optimize dataset.jsonl prompt.md \
-m local/llama3.1:8b \
--reasoning-model ollama/qwen3:8b
Run the example
The examples/ directory includes a ready-to-run dataset: 100 security
incident reports where the task is to produce a strict 3-sentence executive
brief. The starting prompt is four words. The model starts at 0.38 and finishes
at 0.89 — a 134% improvement, statistically significant on held-out data.
export OPENROUTER_API_KEY = sk-or- ...
aevyra-reflex optimize examples/security_incidents.jsonl \
examples/security_incidents_prompt.md \
-m openrouter/meta-llama/llama-3.1-8b-instruct \
--reasoning-model openrouter/qwen/qwen3-8b \
--judge openrouter/qwen/qwen3-8b \
--judge-criteria examples/security_incidents_judge.md \
--max-workers 4 \
-o examples/security_incidents_best_prompt.md
You’ll see a baseline eval, 4 strategy phases, and a final test set verification:
====================================================
OPTIMIZATION RESULTS
====================================================
Train/val/test : 45 / 20 / 35 samples
Baseline score : 0.3786 (on 35-sample test set)
Final score : 0.8857 (on 35-sample test set)
Improvement : +0.5071 (+134.0%)
Significance : p=0.0000 ✓ significant (α=0.05, paired test)
Iterations : 10
Converged : True
====================================================
See the full walkthrough for a
phase-by-phase breakdown of every decision reflex made.
Bring your own dataset
Use the same JSONL format as verdict — each line has messages and an ideal answer:
{ "messages" : [{ "role" : "user" , "content" : "What is the capital of France?" }], "ideal" : "Paris" }
{ "messages" : [{ "role" : "user" , "content" : "Explain binary search in one sentence." }], "ideal" : "Binary search repeatedly halves a sorted array to find a target value in O(log n) time." }
CSV is also supported:
aevyra-reflex optimize data.csv prompt.md -m openrouter/meta-llama/llama-3.1-8b-instruct
No ideal answers? Use an LLM judge instead of automated metrics — see
Label-free evaluation .
Write a starting prompt
Create a plain text file with your system prompt. It doesn’t need to be good —
reflex will improve it:
You are a helpful assistant. Answer questions concisely.
Use the Python API
from aevyra_verdict import Dataset, LLMJudge
from aevyra_verdict.providers import OpenRouterProvider
from aevyra_reflex import PromptOptimizer
from pathlib import Path
result = (
PromptOptimizer()
.set_dataset(Dataset.from_jsonl( "examples/security_incidents.jsonl" ))
.add_provider( "openrouter" , "meta-llama/llama-3.1-8b-instruct" )
.add_metric(LLMJudge(
judge_provider = OpenRouterProvider( model = "qwen/qwen3-8b" ),
criteria = Path( "examples/security_incidents_judge.md" ).read_text(),
))
.run(Path( "examples/security_incidents_prompt.md" ).read_text())
)
print (result.summary())
result.save_best_prompt( "best_prompt.md" )
Explore the run in the dashboard
Once you have a run, open the dashboard to see score trajectory, prompt diffs
between iterations, and the reasoning model’s analysis:
Opens http://localhost:8128. No separate server, no build step. Click into
any run to see what changed each iteration and why.
Set a real target (verdict → reflex)
Instead of an arbitrary threshold, set the target from a real benchmark. If
you already ran aevyra-verdict, pass the results file:
aevyra-reflex optimize dataset.jsonl prompt.md \
-m local/llama3.1:8b \
--verdict-results results.json \
-o best_prompt.md
Or let reflex benchmark for you in one command:
aevyra-reflex optimize dataset.jsonl prompt.md \
-m local/llama3.1:8b \
--target openai/gpt-4o-mini \
-o best_prompt.md
Next steps
Tutorial Full walkthrough of the security incidents example
Dashboard Score charts, prompt diffs, branch runs
Strategies Auto, iterative, structural, PDO, fewshot
Configuration Iterations, thresholds, parallelism, strategy params