Documentation Index
Fetch the complete documentation index at: https://docs.aevyra.ai/llms.txt
Use this file to discover all available pages before exploring further.
Reflex ships with a callback system that lets you stream run data to experiment
tracking platforms. Callbacks are optional — they never affect core optimization
behavior, and a broken callback will never crash a run.
Usage
Pass a callbacks list to .run():
from aevyra_reflex import PromptOptimizer, MLflowCallback
result = (
PromptOptimizer()
.set_dataset(dataset)
.add_provider("openai", "gpt-4o-mini")
.add_metric(RougeScore())
.run("You are a helpful assistant.", callbacks=[MLflowCallback()])
)
Multiple callbacks can be composed freely:
result = optimizer.run(prompt, callbacks=[MLflowCallback(), MyCustomCallback()])
MLflow
MLflowCallback logs the full run to an MLflow experiment using MLflow’s
standard tracking API. No server required — by default it writes to a local
./mlruns directory that you can open with mlflow ui.
Install:
pip install aevyra-reflex[mlflow]
CLI:
aevyra-reflex optimize dataset.jsonl prompt.md -m openrouter/llama3 --mlflow
# Custom experiment name and remote tracking server
aevyra-reflex optimize dataset.jsonl prompt.md -m openrouter/llama3 \
--mlflow \
--mlflow-experiment security-incidents \
--mlflow-tracking-uri http://localhost:5000
Python API:
from aevyra_reflex import MLflowCallback
result = optimizer.run(prompt, callbacks=[MLflowCallback()])
With options:
cb = MLflowCallback(
run_name="summarization-v2",
tracking_uri="http://localhost:5000", # remote MLflow server
experiment_name="prompt-experiments",
tags={"team": "nlp", "dataset": "cnn-dm"},
log_prompt_each_iter=True, # save prompt artifact every iteration
)
result = optimizer.run(prompt, callbacks=[cb])
What gets logged
| When | What |
|---|
| Run start | Params: strategy, reasoning_model, max_iterations, score_threshold, temperature, max_workers, target_model, target_source |
| Baseline eval | Metric: score_test at step=0 — held-out test score before any optimization |
| Each iteration | Metrics: score_train, score_val (when --val-ratio is set), score_<metric> (e.g. score_rouge) |
| Each iteration | Artifact: iterations.json — table of iteration, score, prompt, reasoning (updated live) |
| Final eval | Metric: score_test at step=N+1 — held-out test score after optimization |
| Run end | Metrics: best_score_train, baseline_score, final_score_test, improvement, improvement_pct, converged, total_iterations |
| Run end | Artifact: prompts/best_prompt_*.txt — the winning prompt |
Viewing results
mlflow ui
# opens at http://localhost:5000
Navigate to your results in three steps:
- Experiments (left sidebar) → click the experiment name (e.g.
security-incidents)
- Click Evaluation Runs — the table lists every reflex run with its baseline and best score
- Click a run name to open it, then:
- Overview — run params (
strategy, reasoning_model, etc.) and summary metrics (best_score_train, baseline_score, final_score_test, improvement)
- Model metrics — score trajectory chart, one point per iteration
- Artifacts → iterations.json — interactive table showing the prompt and reasoning used at each iteration alongside its score
- Artifacts → prompts/ —
best_prompt_*.txt with the final winning prompt
Writing a custom callback
Implement any subset of the three lifecycle methods:
class MyCallback:
def on_run_start(self, config, initial_prompt: str) -> None:
"""Called once before the baseline eval."""
print(f"Starting {config.strategy} run")
def on_baseline(self, snapshot) -> None:
"""Called once after the baseline (test-set) eval completes.
snapshot fields:
mean_score (float) — baseline score on the held-out test set
scores_by_metric (dict) — per-metric breakdown
system_prompt (str) — the initial prompt that was evaluated
"""
print(f"Baseline: {snapshot.mean_score:.4f}")
def on_iteration(self, record) -> None:
"""Called after each iteration completes (after val eval if val-ratio is set).
record fields:
iteration (int) — 1-based iteration number
score (float) — mean train score this iteration
val_score (float|None) — mean val score, or None if no val split
scores_by_metric (dict) — per-metric breakdown
system_prompt (str) — prompt used this iteration
reasoning (str) — reasoning model's explanation
"""
print(f" #{record.iteration} train={record.score:.4f}")
def on_final(self, snapshot) -> None:
"""Called once after the final verification (test-set) eval completes.
snapshot fields:
mean_score (float) — final score on the held-out test set
scores_by_metric (dict) — per-metric breakdown
system_prompt (str) — the best prompt that was evaluated
"""
print(f"Final test: {snapshot.mean_score:.4f}")
def on_run_end(self, result) -> None:
"""Called once after on_final, with the full result object.
result fields:
best_prompt (str) — the best system prompt found
best_score (float) — best train score seen during optimization
baseline.mean_score — test-set score before optimization
final.mean_score — test-set score after optimization
improvement (float) — final - baseline
improvement_pct (float) — improvement as a percentage
score_trajectory (list) — train score after each iteration
val_trajectory (list) — val score after each iteration (empty if no val split)
phase_history (list) — auto strategy phase breakdown
"""
print(f"Done. {result.baseline.mean_score:.4f} → {result.final.mean_score:.4f}")
result = optimizer.run(prompt, callbacks=[MyCallback()])
You only need to implement the methods you care about. Callbacks that are
missing a method are skipped silently — there is no base class to inherit from.
The lifecycle order is: on_run_start → on_baseline → on_iteration (×N) → on_final → on_run_end.
Weights & Biases
WandbCallback logs the full run to a W&B project using the standard
wandb Python SDK.
Install:
pip install aevyra-reflex[wandb]
wandb login # one-time setup
CLI:
aevyra-reflex optimize dataset.jsonl prompt.md -m openrouter/llama3 --wandb
# Custom project name
aevyra-reflex optimize dataset.jsonl prompt.md -m openrouter/llama3 \
--wandb \
--wandb-project security-incidents
Python API:
from aevyra_reflex import WandbCallback
result = optimizer.run(prompt, callbacks=[WandbCallback()])
With options:
cb = WandbCallback(
project="prompt-experiments",
run_name="summarization-v2",
entity="my-team",
tags=["auto", "cnn-dm"],
log_prompt_each_iter=True, # log prompt text as a W&B Table each iteration
)
result = optimizer.run(prompt, callbacks=[cb])
Testing without a W&B account:
Use mode="offline" to write runs locally without any network access.
Run wandb sync ./wandb/offline-run-* later to push them.
cb = WandbCallback(mode="offline")
What gets logged
| When | What |
|---|
| Run start | Config: strategy, reasoning_model, max_iterations, score_threshold, temperature, max_workers, target_model, target_source |
| Baseline eval | Metric: score_test — held-out test score before any optimization |
| Each iteration | Metrics: score_train, score_val (when --val-ratio is set), score_<metric> (e.g. score_rouge) |
| Final eval | Metric: score_test — held-out test score after optimization |
| Run end | Summary: best_score_train, baseline_score, final_score_test, improvement, improvement_pct, converged, total_iterations |
| Run end | Artifact: best-prompt (type: prompt) containing best_prompt.txt |
The Charts tab shows score_train, score_val, and score_test (baseline + final) as separate series,
giving you a clear view of train vs. validation vs. held-out test performance across the run.
Summary fields appear in the run Overview table, making it easy to compare runs side by side.
Using both together
MLflow and W&B can run simultaneously — pass both in the callbacks list:
result = optimizer.run(prompt, callbacks=[
MLflowCallback(experiment_name="reflex"),
WandbCallback(project="reflex"),
])