This tutorial diagnoses a real failure in a multi-step coding agent. The agent is asked to implementDocumentation Index
Fetch the complete documentation index at: https://docs.aevyra.ai/llms.txt
Use this file to discover all available pages before exploring further.
min_coins(coins, amount) — a classic dynamic
programming problem where greedy fails. It plans the algorithm, synthesizes
a recipe, writes code, runs tests, sees failures, asks a debugger to diagnose
them, revises, and re-runs. Both drafts fail in the same way.
You’ll follow Origin through the full diagnosis: running the pipeline against
a real LLM (Llama 8B), scoring the run with a deterministic test-driven judge,
and running all three attribution methods including causal ablation. By the end
you’ll see how the planner — not the coder — caused every failure, and how a
correct first draft got thrown away because of it.
The full code is in
examples/coding_agent/.
Setup
Two providers are involved. The pipeline uses Llama 8B through any OpenAI-compatible endpoint; that’s what produces the failure. Attribution runs through a stronger reasoning model because reading a 10-span trace benefits from more headroom.The pipeline
The agent is a plan-synthesize-code-test-revise loop. Each box is a span; tools nest under the spans that called them. The twowrite_code spans
share prompt_id="coder" on purpose — that’s how Reflex will optimize the
prompt once for every call site.
Run it:
got equal to the input. That’s not the function
returning a wrong value — it’s the function never being called at all. The
planner generated test inputs as bare Python expressions like [1, 3, 4]
(just the coins list) instead of full function calls like
min_coins([1, 3, 4], 6). eval("[1, 3, 4]") returns the list itself.
The coder’s first draft was a correct dynamic-programming implementation.
The tests were broken before the code even ran.
Run the diagnosis
Withtrace.json saved, pass it to the CLI. Supply the judge score manually
(read the last run_tests span: compiled, all tests failed → 0.4) and point
--runner at the example’s runner file to enable causal ablation:
--score 0.4— compiled, at least one test failed--rubric rubric.txt— evaluation criteria for a correctmin_coins--model— the model doing attribution reasoning, inprovider/modelformat--runner runner.py— enables causal ablation by re-executing the real pipeline
--resume picks it up where it left off:
plan — skips the first step entirely. When
the plan span is replaced with a null, the synthesizer runs against empty
input, the coder has nothing to go on, and the score drops to 0.0. That’s
the causal signal.
How Origin finds the issue
Analysis 1 — read the whole trace and ask “what went wrong?” An LLM reads every span’s input and output alongside the rubric and score, and returns a ranked list of suspicious spans. Here it flags theplan span
directly — the test case inputs are visibly wrong in the trace data.
Analysis 2 — break the rubric into criteria and check each step
Origin resolves the rubric to specific pass/fail criteria (returns correct minimum coins, passes all test cases, etc.) and attributes each failure
to the spans that owned the relevant work. The planner gets cited in two of
four failed criteria; the debugger gets cited for misdiagnosing what it saw.
Analysis 3 — break things on purpose and re-score
Ablation in this example is real re-execution, not trace replay. The
runner.py file calls coding_agent(task, overrides=overrides) with the
pipeline model restored from trace metadata — so every span downstream of
the override runs fresh against the new context. When plan is blanked,
the pipeline runs without test cases and scores 0.0. When synthesize is
blanked, the coder falls back to an unguided first draft and the tests still
fail. The score delta of +0.4 on plan is a real measurement, not an
in-place substitution.
What Origin finds
Reading the result
The planner is the root cause, not the coder. The coder’s first draft was a correct DP implementation. It got thrown away because all three tests showedgot == input — a symptom the debugger misread as the function
returning coin denominations instead of recognising a missing function call
in the test inputs. Fix the planner prompt and the first draft passes; the
debugger, revision, and responder never get a chance to go wrong.
The debugger contributed. Even knowing the tests were broken, a careful
debugger should notice that input == got is a test harness problem, not a
code problem. [1, 3, 4] coming back as output is not what a Python integer
looks like. The debugger missed this and prescribed a destructive fix
(remove the -1 guard), turning a correct first draft into a broken
revision. Confidence 0.71 — the second-highest culprit.
write_code is minor. The coder appears because the revision applied
the bad diagnosis. That’s downstream of the real problem. Fixing the planner
and debugger makes this irrelevant.
Why ablation matters here. Both critic and decomposition flagged the
planner. Ablation confirmed it causally: blanking plan collapsed the score
to 0.0 because the synthesizer, coder, and tester all had nothing to work
from. Multi-method agreement on the same span with a real score delta is
what separates a confident diagnosis from a guess.
What the planner prompt needs
Origin identifiedplanner as the prompt to fix. Here’s what that means
concretely.
The current prompt asks for test cases in this shape:
input with the coins argument
([1, 3, 4]) rather than the full function call (min_coins([1, 3, 4], 6)).
The model pattern-matches to “the input to the problem” rather than “a Python
expression that calls the function.”
The fixed prompt makes the requirement unambiguous:
eval(input)
calls the function, and the test runner gets real pass/fail results. The
coder’s first draft — which was already correct — would pass all three tests
on the first run.
Runner: real re-execution vs. trace replay
The ablation runner inrunner.py re-executes the real pipeline for each
ablation trial rather than replaying the captured trace with an override
substituted in. The difference matters:
- Trace replay replaces one span’s output in the stored trace and re-judges. Fast and deterministic, but downstream spans see the overridden context only in the final snapshot — they don’t re-run against it.
- Real re-execution calls
coding_agent(task, overrides=overrides)with the span’s output replaced. Every downstream span runs fresh LLM calls against the new context. Slower and noisier, but the score delta is causal: if blankingplandrops the score, it’s because the whole downstream pipeline degraded — not because the judge read a stalerun_testsoutput.
run_tests span, which trace replay leaves unchanged for
most overrides. Real re-execution runs the tests again and gets an honest
score.
Pass any Python file exporting runner(original, overrides) -> AgentTrace
and judge(trace) -> float to the CLI via --runner:
Next steps
Bring your own trace
Convert your existing logs (Langfuse, OTel, JSONL) and diagnose them
Methods
When ablation beats critic, and when it doesn’t
API reference
Full Attribution and NodeAttribution reference
Reflex quickstart
Feed Origin’s output to Reflex to rewrite the planner prompt