What you’ll do
Run Forge’s autonomous config optimizer onQwen/Qwen3-8B-AWQ against a
shared-prefix workload. By the end you’ll have a working recipe that Forge
found by measuring real throughput — not by guessing.
The tutorial has two paths:
- Dry-run (this page) — no GPU or API key needed; synthetic bench results show the full keep/revert loop in minutes. Good for understanding the system before spending GPU time.
- Real run — live vLLM on a free Colab T4; needs an OpenRouter key. See the notebook for section 2.
Why vLLM config matters
vLLM exposes roughly 40 serving args. The defaults are conservative — designed to work safely on any GPU, not to max out any specific one. On a T4 with a chat workload the defaults leave significant throughput on the table:| Setting | Default | Effect of raising |
|---|---|---|
max_num_seqs | 32 | More requests batched together → higher throughput |
enable_prefix_caching | False | Shared system prompts computed once → lower TTFT |
gpu_memory_utilization | 0.90 | More VRAM for KV cache → more concurrent sequences |
enable_chunked_prefill | False | Hides prefill spikes → lower P99 on mixed batches |
max_num_seqs helps throughput until it stresses VRAM, at which point
latency spikes. Forge searches the joint space automatically and keeps only the
changes that improve the score on your actual workload.
The loop
Each iteration: the agent reads the playbook, the full experiment history, and the current recipe, then proposes one targeted change. Forge boots vLLM with the new config, replays the workload, scores the result, and either keeps the change (new best) or reverts to the prior best. The audit trail captures every decision.Setup
Dry-run demo
The dry-run uses a synthetic bench that returns plausible throughput numbers without starting vLLM. It’s identical to the real loop in every other way — the agent makes real LLM calls, the playbook is consulted, and the keep/revert logic is the same.max_num_seqs: 64:
max_num_seqs at default).
The loop continues through max_num_batched_tokens, gpu_memory_utilization,
and enable_chunked_prefill. After 8 experiments:
Reading the results
The best recipe
The four kept changes, applied jointly:.forge/runs/001_.../best_recipe.yaml and updated
after every kept experiment. If the run is interrupted, aevyra-forge tune resume picks up from the last checkpoint — no args needed.
Running for real
Switch from--dry-run --device cpu to a GPU host:
nvidia-smi, boots vLLM with the baseline
config, and runs the same loop. Each experiment takes 5–15 minutes (first
start includes weight download). Results land in .forge/.
For an overnight run against your real production traffic:
Key takeaways
One change per experiment keeps the audit trail readable. Every row inexperiments.tsv has a single rationale and a single config delta — you can
see exactly what moved the score.
Revert is cheap because Forge re-uses the running vLLM process when args
are unchanged. Only restarts that change serving args (like max_num_seqs)
incur the full boot cost.
The playbook encodes expertise — ranges that are safe on a T4 vs an A100,
combinations that help vs hurt, escalation rules. Pass --playbook with a
custom playbook to encode your own hardware-specific knowledge.
Dry-run first whenever you’re testing a new workload or playbook. The
synthetic bench won’t give you real numbers, but it will show you whether the
agent is proposing sensible mutations before you spend GPU time.