> ## Documentation Index > Fetch the complete documentation index at: https://docs.aevyra.ai/llms.txt > Use this file to discover all available pages before exploring further. # Introduction > Agentic prompt optimization built for production workloads — crash-safe, resumable, and fully auditable. aevyra-reflex diagnoses why scores are falling short and rewrites the prompt — iterating until it converges. It works in two modes: **Standard mode** — point it at an eval dataset. Reflex scores each candidate prompt against the dataset and rewrites until the score meets the target. ```bash theme={null} pip install aevyra-reflex aevyra-reflex optimize dataset.jsonl prompt.md -m local/llama3.1:8b -o best_prompt.md ``` **Pipeline mode** — point it at your agent. Reflex re-runs the full pipeline on every candidate prompt so the judge sees tool calls, intermediate outputs, and the final answer — not just the output string. Your existing agent code doesn't change; you add one wrapper function. ```bash theme={null} aevyra-reflex optimize prompt.md \ --pipeline-file pipeline.py \ --inputs-file inputs.json \ --judge openrouter/qwen/qwen3-30b-a3b \ --judge-criteria criteria.md ``` Works with any model — local Ollama or vLLM, OpenAI, Anthropic, Gemini, or any OpenAI-compatible endpoint. All evaluation runs through [aevyra-verdict](https://github.com/aevyraai/verdict). ## Dashboard Every run is immediately explorable in the built-in dashboard: ```bash theme={null} aevyra-reflex dashboard ``` No separate server, no build step — opens `http://localhost:8128` with score trajectory charts, prompt diffs, reasoning analysis, and token usage. Branch from any iteration to continue with a different strategy. Optimize your first prompt in under 5 minutes Full walkthrough: 0.38 → 0.89 on a real format-compliance task Optimize a tool-calling agent with full execution trace evaluation Score charts, prompt diffs, reasoning traces, and branch runs ## Why reflex **No config files.** No YAML. No framework to learn. Point it at a dataset and a prompt file and it runs. **Lightweight.** No heavy framework dependencies. Just Python, standard library, and `numpy` for PDO math. Installs in seconds and has no opinion about the rest of your stack. **Fully local.** Ollama and vLLM are supported — run everything on your own hardware so nothing leaves your machine: ```bash theme={null} aevyra-reflex optimize dataset.jsonl prompt.md \ -m local/llama3.1:8b\ --reasoning-model ollama/qwen3:8b ``` **Agentic, not scripted.** Each iteration the reasoning model explains *why* it made a change — you learn from the run, not just get an output. The causal rewrite log tracks what helped, what had no effect, and what hurt, so the model avoids repeating dead ends. **Crash-safe resumption.** Every iteration is checkpointed to disk as it completes. Kill the process, restart the machine, lose your connection — `--resume` picks up exactly where it left off. Val history, best-prompt selection, and token totals are all restored correctly across as many interruptions as you need. **Full token accounting.** Eval tokens and reasoning tokens are tracked per iteration and accumulated across sessions, including resumed runs. The final results show the true total cost of the optimization, not just the last session. **Overfitting protection.** An optional validation split monitors generalization throughout training. The best prompt is selected against the val set — so the final test eval reflects real-world performance rather than a prompt tuned to the specific examples it was optimized on. ## What it does In both modes, aevyra-reflex runs the same three-step loop: 1. Runs a baseline eval on a held-out test set to measure the starting score 2. Optimizes the prompt on the training set, iterating until the score meets the target 3. Re-evaluates on the held-out test set so reported improvement is honest The difference is what "eval" means. In standard mode it scores a model response against an ideal. In pipeline mode it re-runs your full agent and scores the resulting execution trace. ## When to use it **Standard mode** — when the prompt directly produces the output you want to score: * A model scores poorly on your eval — you want a better prompt, not a bigger model * You ran verdict and model A beats model B — you want to close the gap through prompt engineering * You're iterating on a system prompt and want to automate the feedback loop * You're migrating a prompt from one model family to another (e.g. Claude → Llama) **Pipeline mode** — when the prompt lives inside an agent and correctness depends on behaviour you can't see from the output string alone: * Your agent calls tools and you need the judge to verify the right ones were called * A static (input, ideal) dataset can't tell whether the model used its tools or answered from memory * You want the optimizer to diagnose grounding failures, not just surface-level output differences ## Optimization strategies The **auto** strategy (default) picks the right technique for each phase. You can also run any strategy directly. Multi-phase pipeline — structural → iterative → fewshot, chosen adaptively Diagnose failures, revise wording, repeat. Label-free aware. Reorganize formatting, sections, and hierarchy Tournament-style search with dueling bandits and adaptive ranking A typical **auto** run (from the [security incidents tutorial](/reflex/tutorial-security-incidents)): ```mermaid theme={null} flowchart LR B([baseline\n0.39]):::neutral S[structural\n0.86]:::big F[fewshot\n0.83]:::phase I[iterative\n0.83]:::phase P[PDO\n1.00]:::big T([test set\n0.89]):::neutral B --> S --> F --> I --> P --> T classDef neutral fill:#444,color:#fff,stroke:none classDef phase fill:#9B6BFF,color:#fff,stroke:none classDef big fill:#2ECC71,color:#fff,stroke:none ``` Each phase hands its best prompt to the next. Structural made the biggest jump (formatting was the main gap); PDO polished it to convergence. ## How it fits together **Standard mode** ```mermaid theme={null} flowchart LR DS[Dataset\nJSONL / CSV]:::data ST[Strategy\nauto]:::reflex RM[Reasoning\nmodel]:::reflex EV[verdict\nevals]:::eval OUT[Optimized\nprompt]:::output DS --> ST ST -->|rewrites| RM RM -->|revised prompt| EV EV -->|score feedback| ST ST --> OUT classDef data fill:#6E3FF3,color:#fff,stroke:none classDef reflex fill:#9B6BFF,color:#fff,stroke:none classDef eval fill:#3FBFFF,color:#fff,stroke:none classDef output fill:#2ECC71,color:#fff,stroke:none ``` **Pipeline mode** ```mermaid theme={null} flowchart LR IN[Raw inputs\nJSON array]:::data ST[Strategy\nauto]:::reflex RM[Reasoning\nmodel]:::reflex PL[Your pipeline\npipeline_fn]:::pipeline JD[Judge\nfull trace]:::eval OUT[Optimized\nprompt]:::output IN --> ST ST -->|rewrites| RM RM -->|candidate prompt| PL PL -->|AgentTrace| JD JD -->|score feedback| ST ST --> OUT classDef data fill:#6E3FF3,color:#fff,stroke:none classDef reflex fill:#9B6BFF,color:#fff,stroke:none classDef pipeline fill:#A78BFA,color:#fff,stroke:none classDef eval fill:#3FBFFF,color:#fff,stroke:none classDef output fill:#2ECC71,color:#fff,stroke:none ``` All commands and flags How each optimization axis works Tuning iterations, thresholds, and parallelism OpenAI, Anthropic, Gemini, Ollama, and more