> ## Documentation Index
> Fetch the complete documentation index at: https://docs.aevyra.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Introduction

> Agentic prompt optimization built for production workloads — crash-safe, resumable, and fully auditable.

aevyra-reflex diagnoses why scores are falling short and rewrites the prompt —
iterating until it converges. It works in two modes:

**Standard mode** — point it at an eval dataset. Reflex scores each candidate
prompt against the dataset and rewrites until the score meets the target.

```bash theme={null}
pip install aevyra-reflex
aevyra-reflex optimize dataset.jsonl prompt.md -m local/llama3.1:8b -o best_prompt.md
```

**Pipeline mode** — point it at your agent. Reflex re-runs the full pipeline on
every candidate prompt so the judge sees tool calls, intermediate outputs, and
the final answer — not just the output string. Your existing agent code doesn't
change; you add one wrapper function.

```bash theme={null}
aevyra-reflex optimize prompt.md \
  --pipeline-file pipeline.py \
  --inputs-file   inputs.json \
  --judge openrouter/qwen/qwen3-30b-a3b \
  --judge-criteria criteria.md
```

Works with any model — local Ollama or vLLM, OpenAI, Anthropic, Gemini, or
any OpenAI-compatible endpoint. All evaluation runs through
[aevyra-verdict](https://github.com/aevyraai/verdict).

## Dashboard

Every run is immediately explorable in the built-in dashboard:

```bash theme={null}
aevyra-reflex dashboard
```

No separate server, no build step — opens `http://localhost:8128` with score
trajectory charts, prompt diffs, reasoning analysis, and token usage. Branch
from any iteration to continue with a different strategy.

<CardGroup cols={2}>
  <Card title="Quick start" icon="bolt" href="/reflex/quickstart">
    Optimize your first prompt in under 5 minutes
  </Card>

  <Card title="Tutorial: standard mode" icon="book-open" href="/reflex/tutorial-security-incidents">
    Full walkthrough: 0.38 → 0.89 on a real format-compliance task
  </Card>

  <Card title="Tutorial: pipeline mode" icon="wand-magic-sparkles" href="/reflex/tutorial-dev-assistant">
    Optimize a tool-calling agent with full execution trace evaluation
  </Card>

  <Card title="Open the dashboard" icon="chart-line" href="/reflex/dashboard">
    Score charts, prompt diffs, reasoning traces, and branch runs
  </Card>
</CardGroup>

## Why reflex

**No config files.** No YAML. No framework to learn. Point it at a dataset and
a prompt file and it runs.

**Lightweight.** No heavy framework dependencies. Just Python, standard
library, and `numpy` for PDO math. Installs in seconds and has no opinion
about the rest of your stack.

**Fully local.** Ollama and vLLM are supported — run everything on your own
hardware so nothing leaves your machine:

```bash theme={null}
aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.1:8b\
  --reasoning-model ollama/qwen3:8b
```

**Agentic, not scripted.** Each iteration the reasoning model explains *why* it
made a change — you learn from the run, not just get an output. The causal
rewrite log tracks what helped, what had no effect, and what hurt, so the model
avoids repeating dead ends.

**Crash-safe resumption.** Every iteration is checkpointed to disk as it
completes. Kill the process, restart the machine, lose your connection —
`--resume` picks up exactly where it left off. Val history, best-prompt
selection, and token totals are all restored correctly across as many
interruptions as you need.

**Full token accounting.** Eval tokens and reasoning tokens are tracked per
iteration and accumulated across sessions, including resumed runs. The final
results show the true total cost of the optimization, not just the last session.

**Overfitting protection.** An optional validation split monitors generalization
throughout training. The best prompt is selected against the val set — so the
final test eval reflects real-world performance rather than a prompt tuned to
the specific examples it was optimized on.

## What it does

In both modes, aevyra-reflex runs the same three-step loop:

1. Runs a baseline eval on a held-out test set to measure the starting score
2. Optimizes the prompt on the training set, iterating until the score meets the target
3. Re-evaluates on the held-out test set so reported improvement is honest

The difference is what "eval" means. In standard mode it scores a model response against an ideal. In pipeline mode it re-runs your full agent and scores the resulting execution trace.

## When to use it

**Standard mode** — when the prompt directly produces the output you want to score:

* A model scores poorly on your eval — you want a better prompt, not a bigger model
* You ran verdict and model A beats model B — you want to close the gap through prompt engineering
* You're iterating on a system prompt and want to automate the feedback loop
* You're migrating a prompt from one model family to another (e.g. Claude → Llama)

**Pipeline mode** — when the prompt lives inside an agent and correctness depends on behaviour you can't see from the output string alone:

* Your agent calls tools and you need the judge to verify the right ones were called
* A static (input, ideal) dataset can't tell whether the model used its tools or answered from memory
* You want the optimizer to diagnose grounding failures, not just surface-level output differences

## Optimization strategies

The **auto** strategy (default) picks the right technique for each phase.
You can also run any strategy directly.

<CardGroup cols={2}>
  <Card title="Auto" icon="wand-magic-sparkles" href="/reflex/strategies">
    Multi-phase pipeline — structural → iterative → fewshot, chosen adaptively
  </Card>

  <Card title="Iterative" icon="arrows-spin" href="/reflex/strategies#iterative">
    Diagnose failures, revise wording, repeat. Label-free aware.
  </Card>

  <Card title="Structural" icon="table-columns" href="/reflex/strategies#structural">
    Reorganize formatting, sections, and hierarchy
  </Card>

  <Card title="PDO" icon="trophy" href="/reflex/strategies#pdo">
    Tournament-style search with dueling bandits and adaptive ranking
  </Card>
</CardGroup>

A typical **auto** run (from the [security incidents tutorial](/reflex/tutorial-security-incidents)):

```mermaid theme={null}
flowchart LR
    B([baseline\n0.39]):::neutral
    S[structural\n0.86]:::big
    F[fewshot\n0.83]:::phase
    I[iterative\n0.83]:::phase
    P[PDO\n1.00]:::big
    T([test set\n0.89]):::neutral

    B --> S --> F --> I --> P --> T

    classDef neutral fill:#444,color:#fff,stroke:none
    classDef phase   fill:#9B6BFF,color:#fff,stroke:none
    classDef big     fill:#2ECC71,color:#fff,stroke:none
```

Each phase hands its best prompt to the next. Structural made the biggest jump (formatting was the main gap); PDO polished it to convergence.

## How it fits together

**Standard mode**

```mermaid theme={null}
flowchart LR
    DS[Dataset\nJSONL / CSV]:::data
    ST[Strategy\nauto]:::reflex
    RM[Reasoning\nmodel]:::reflex
    EV[verdict\nevals]:::eval
    OUT[Optimized\nprompt]:::output

    DS --> ST
    ST -->|rewrites| RM
    RM -->|revised prompt| EV
    EV -->|score feedback| ST
    ST --> OUT

    classDef data    fill:#6E3FF3,color:#fff,stroke:none
    classDef reflex  fill:#9B6BFF,color:#fff,stroke:none
    classDef eval    fill:#3FBFFF,color:#fff,stroke:none
    classDef output  fill:#2ECC71,color:#fff,stroke:none
```

**Pipeline mode**

```mermaid theme={null}
flowchart LR
    IN[Raw inputs\nJSON array]:::data
    ST[Strategy\nauto]:::reflex
    RM[Reasoning\nmodel]:::reflex
    PL[Your pipeline\npipeline_fn]:::pipeline
    JD[Judge\nfull trace]:::eval
    OUT[Optimized\nprompt]:::output

    IN --> ST
    ST -->|rewrites| RM
    RM -->|candidate prompt| PL
    PL -->|AgentTrace| JD
    JD -->|score feedback| ST
    ST --> OUT

    classDef data     fill:#6E3FF3,color:#fff,stroke:none
    classDef reflex   fill:#9B6BFF,color:#fff,stroke:none
    classDef pipeline fill:#A78BFA,color:#fff,stroke:none
    classDef eval     fill:#3FBFFF,color:#fff,stroke:none
    classDef output   fill:#2ECC71,color:#fff,stroke:none
```

<CardGroup cols={2}>
  <Card title="CLI reference" icon="terminal" href="/reflex/cli">
    All commands and flags
  </Card>

  <Card title="Strategies" icon="chess" href="/reflex/strategies">
    How each optimization axis works
  </Card>

  <Card title="Configuration" icon="gear" href="/reflex/configuration">
    Tuning iterations, thresholds, and parallelism
  </Card>

  <Card title="Providers" icon="server" href="/reflex/providers">
    OpenAI, Anthropic, Gemini, Ollama, and more
  </Card>
</CardGroup>
