What you’ll do
Run Forge across both tuning layers onQwen/Qwen2.5-3B against a real workload.
By the end you’ll have a quantized deployment recipe that beats the hand-tuned
BF16 baseline by 40–60% in throughput — with a full audit trail of every
experiment.
This tutorial has two paths:
- CLI (this page) —
aevyra-forge tuneruns Layer 1 and Layer 2 automatically. Good for overnight production runs. - Notebook (interactive) — step through calibration, bench, and results cell by cell. Good for understanding what each layer does before committing GPU hours.
Why quantization matters
Layer 1 (config tuning) finds the best vLLM serving args for your BF16 model. It typically yields 20–40% throughput gains at zero accuracy cost. But BF16 is memory-heavy: a 3B model consumes ~6 GB VRAM, leaving less room for the KV cache. That’s the ceiling Layer 1 hits. Layer 2 cuts the model’s VRAM footprint:| Method | Weight precision | Typical VRAM saving | When available |
|---|---|---|---|
| INT4 AWQ | 4-bit (activation-aware) | ~60% vs BF16 | All GPUs |
| INT8 | 8-bit | ~50% vs BF16 | All GPUs |
| FP8 E4M3 | 8-bit float | ~50% vs BF16 | H100 / H200 / MI300X only |
The two-layer loop
Layer 1 runs first and searches vLLM serving args (batching, caching, parallelism). When it converges — no experiment improves score by more than 1% in 5 consecutive tries — Forge automatically escalates to Layer 2. Layer 2 first checks HF Hub for a pre-quantized checkpoint. If one exists (e.g.Qwen/Qwen2.5-3B-Instruct-AWQ), Forge loads it directly at zero
calibration cost. If not, it runs INT4 AWQ calibration using prompts sampled
from your workload JSONL — not a generic corpus.
Setup
Colab terminal note — if you seeIntel MKL FATAL ERROR: Cannot load libtorch_cpu.so, your terminal’s working directory was deleted when the runtime reset. Runcd /tmpfirst.
Full L1 + L2 run (recommended)
Let Forge run both layers automatically. Layer 1 runs until convergence, then escalates:Capping Layer 1 to save time
On a T4 with a small model, Layer 1’s config search space is narrow — a few experiments exhaust the useful combinations. Use--max-config-experiments to
cap it and move to quant faster:
--max-config-experiments 3, Forge runs exactly 3 Layer 1 experiments then
escalates to Layer 2 regardless of convergence.
Skipping Layer 1 entirely
If you already have a tuned config recipe (from a previous run) and only want the quantization gain:Hardware gates
Forge enforces quantization availability at the hardware level. You’ll see this in the search space output at the start of the run:| GPU | INT4 AWQ | INT8 | FP8 E4M3 |
|---|---|---|---|
| T4 (SM 7.5) | ✅ | ✅ | ✗ |
| A100 (SM 8.0) | ✅ | ✅ | ✗ |
| H100 / H200 (SM 9.0) | ✅ | ✅ | ✅ |
| AMD MI300X (CDNA3) | ✅ | ✅ | ✅ |
Reading the results
layer alongside each experiment so you can see exactly where
the score moved. Layer 1 got a 4.7% gain from prefix caching. Layer 2 got a
further 115% from INT4 AWQ — the dominant lever on VRAM-constrained hardware.
The best recipe
model field points to the quantized checkpoint on disk. To use this
recipe in production, copy the checkpoint to a stable path and update model
accordingly.
Resuming an interrupted run
Layer 2 calibration takes 4–6 minutes. If the run is interrupted mid-quant,aevyra-forge tune resume picks up from the last completed experiment — it
does not re-run calibration: