aevyra-reflex diagnoses why scores are falling short and rewrites the prompt — iterating until it converges. It works in two modes: Standard mode — point it at an eval dataset. Reflex scores each candidate prompt against the dataset and rewrites until the score meets the target.Documentation Index
Fetch the complete documentation index at: https://docs.aevyra.ai/llms.txt
Use this file to discover all available pages before exploring further.
Dashboard
Every run is immediately explorable in the built-in dashboard:http://localhost:8128 with score
trajectory charts, prompt diffs, reasoning analysis, and token usage. Branch
from any iteration to continue with a different strategy.
Quick start
Optimize your first prompt in under 5 minutes
Tutorial: standard mode
Full walkthrough: 0.38 → 0.89 on a real format-compliance task
Tutorial: pipeline mode
Optimize a tool-calling agent with full execution trace evaluation
Open the dashboard
Score charts, prompt diffs, reasoning traces, and branch runs
Why reflex
No config files. No YAML. No framework to learn. Point it at a dataset and a prompt file and it runs. Lightweight. No heavy framework dependencies. Just Python, standard library, andnumpy for PDO math. Installs in seconds and has no opinion
about the rest of your stack.
Fully local. Ollama and vLLM are supported — run everything on your own
hardware so nothing leaves your machine:
--resume picks up exactly where it left off. Val history, best-prompt
selection, and token totals are all restored correctly across as many
interruptions as you need.
Full token accounting. Eval tokens and reasoning tokens are tracked per
iteration and accumulated across sessions, including resumed runs. The final
results show the true total cost of the optimization, not just the last session.
Overfitting protection. An optional validation split monitors generalization
throughout training. The best prompt is selected against the val set — so the
final test eval reflects real-world performance rather than a prompt tuned to
the specific examples it was optimized on.
What it does
In both modes, aevyra-reflex runs the same three-step loop:- Runs a baseline eval on a held-out test set to measure the starting score
- Optimizes the prompt on the training set, iterating until the score meets the target
- Re-evaluates on the held-out test set so reported improvement is honest
When to use it
Standard mode — when the prompt directly produces the output you want to score:- A model scores poorly on your eval — you want a better prompt, not a bigger model
- You ran verdict and model A beats model B — you want to close the gap through prompt engineering
- You’re iterating on a system prompt and want to automate the feedback loop
- You’re migrating a prompt from one model family to another (e.g. Claude → Llama)
- Your agent calls tools and you need the judge to verify the right ones were called
- A static (input, ideal) dataset can’t tell whether the model used its tools or answered from memory
- You want the optimizer to diagnose grounding failures, not just surface-level output differences
Optimization strategies
The auto strategy (default) picks the right technique for each phase. You can also run any strategy directly.Auto
Multi-phase pipeline — structural → iterative → fewshot, chosen adaptively
Iterative
Diagnose failures, revise wording, repeat. Label-free aware.
Structural
Reorganize formatting, sections, and hierarchy
PDO
Tournament-style search with dueling bandits and adaptive ranking
How it fits together
Standard mode Pipeline modeCLI reference
All commands and flags
Strategies
How each optimization axis works
Configuration
Tuning iterations, thresholds, and parallelism
Providers
OpenAI, Anthropic, Gemini, Ollama, and more