Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.aevyra.ai/llms.txt

Use this file to discover all available pages before exploring further.

A recipe is the complete deployment specification Forge searches over. Every experiment produces exactly one recipe; the best one is written to best_recipe.yaml at the end of a run.

Structure

model: meta-llama/Llama-3.2-1B-Instruct
hardware:
  vendor: nvidia
  gpu_type: Tesla T4
  count: 1
  memory_gb_per_gpu: 15
config:                          # Layer 1 — vLLM serving args
  max_num_seqs: 256
  max_num_batched_tokens: 8192
  block_size: 16
  gpu_memory_utilization: 0.9
  enable_prefix_caching: true
  enable_chunked_prefill: true
  swap_space: 4
  kv_cache_dtype: auto
  tensor_parallel_size: 1
  pipeline_parallel_size: 1
quant:                           # Layer 2 — quantization (v0: defaults only)
  method: bf16
  kv_cache_quant: none
kernels: []                      # Layer 3 — custom kernels (v0: empty)
generation: 3
parent_id: a1b2c3d4
id: e5f6a7b8

Layers

Layer 1 — Config (config:) is the primary search space in v0. Forge tunes vLLM serving arguments that control batching, memory, and caching behaviour. These have the highest leverage per experiment because they require no recompilation. Layer 2 — Quantization (quant:) is scaffolded but not yet implemented. In v0.2+ Forge will tune INT4/FP8/INT8 methods and KV cache precision jointly with Layer 1. Layer 3 — Kernel synthesis (kernels:) hooks into AutoKernel for custom op synthesis. Planned for v0.3+.

Key VLLMConfig fields

FieldvLLM defaultWhat it does
max_num_seqs256Max concurrent sequences in a batch
max_num_batched_tokens8192Max tokens processed per forward pass
enable_prefix_cachingfalseCache KV state for repeated prefixes
enable_chunked_prefilltrueBreak long prefills into chunks
gpu_memory_utilization0.9Fraction of GPU VRAM for KV cache
kv_cache_dtypeautoKV cache precision (auto/fp8/fp16/bf16)
tensor_parallel_size1Number of GPUs for tensor parallelism

Lineage

Each recipe records its parent_id and generation. This lets Forge detect convergence, build a diff between any two recipes, and render a clean audit trail.
from aevyra_forge.recipe import Recipe

r1 = Recipe.from_yaml(open("best_recipe.yaml").read())
diff = r1.diff(baseline)
# {"enable_prefix_caching": {"from": False, "to": True},
#  "max_num_seqs": {"from": 256, "to": 128}}