This tutorial diagnoses a real failure in a customer-support triage agent. The agent is given a duplicate-charge complaint and refuses to issue a refund — despite having the correct policy document and clear evidence in front of it. You’ll follow Origin through the full diagnosis step by step: running the pipeline, reading the trace, scoring the output, and interpreting the attribution. By the end you’ll see not just which span caused the failure, but what kind of fix it needs. The full code is inDocumentation Index
Fetch the complete documentation index at: https://docs.aevyra.ai/llms.txt
Use this file to discover all available pages before exploring further.
examples/support_triage/.
Setup
The pipeline
The triage agent is a plan-act-respond loop built on three MCP tool stubs: Run it:stripe_lookup returned two identical $29 charges on
the same date. kb_search returned the policy that explicitly makes them
refund-eligible. The round-2 planner had both in front of it — and ignored
them, inventing an “upgrade charge” that appears nowhere in the data.
Run the diagnosis
Withtrace.json saved, diagnose it with a single CLI command:
--score 0.2— the judge score for this trace (the agent refused a valid refund)--rubric rubric.txt— the evaluation criteria the judge used--model openrouter/qwen/qwen3-235b-a22b-thinking-2507— the model doing attribution reasoning, inprovider/modelformat--runner runner.py— enables causal ablation; exportsrunnerandjudgefor this pipeline
How Origin finds the issue
When you run the command above, Origin runs two analyses on your trace and combines them. It runs two because neither is reliable on its own. Analysis 1 — read the whole trace and ask “what went wrong?” An LLM reads every span’s input and output alongside your rubric and score, and returns a ranked list of suspicious spans. This is fast but can be fooled — if the trace contains plausible-sounding reasoning, the LLM might miss the real failure or blame the wrong span. Analysis 2 — break the rubric into criteria and check each step Your rubric says the agent should acknowledge the duplicate charge, cite the refund policy, and confirm the refund. Origin checks each span against each criterion individually. This is more systematic but can be noisy — a span might fail a criterion as a side-effect of an earlier mistake, not because it is itself the root cause. Analysis 3 — break things on purpose and re-score Ablation overrides one span’s output with a null value and re-judges. For this pipeline the stubs are deterministic, so the runner uses trace replay: blankingplan (round 2) collapses the responder’s reply and drops
the score to 0.0 — a causal confirmation that critic and decomposition alone
can’t provide.
Why all three together
Running multiple methods and combining them is the reliability mechanism. If
only one analysis flags a span, confidence stays low. If critic, decomposition,
and ablation all independently point at the same span, you’d need all three to
be wrong in the same direction for it to be a false positive. In this trace all
three agree on plan (round 2), giving confidence=0.86.
What Origin finds
Reading the result
plan is the root cause, not the tools. Round 1 dispatched the tools
correctly and they returned the right data. The failure happened when the
round-2 planner read that data and invented a narrative instead of following
the evidence.
respond is contributing but downstream. Ablation found that blanking the
responder’s output improved the score from 0.2 to 0.6, because the judge reads
the final reply directly. The responder’s specific wording (“prorated upgrade
charge”) is hurting. But it has no independent failure — it’s faithfully
repeating what the planner told it. fix=unknown is the right signal: there’s
no prompt to rewrite on the responder, only a bad upstream decision to fix.
Fix the planner and the responder cleans up automatically.
fix=prompt on the planner tells you where to look. The retrieval worked —
kb_search returned exactly the right policy. The problem is the planner
prompt doesn’t anchor the model to its tool results. One rewrite covers both
plan spans since they share prompt_id="planner".
fix tells you what not to do. If Origin had returned fix=retrieval,
you’d update the index and leave the prompt alone. Here’s how the same
pipeline maps to different fix types depending on what goes wrong:
| Scenario | Culprit | fix |
|---|---|---|
| Planner ignores tool evidence | plan | prompt |
kb_search returns the wrong doc | kb_search | retrieval |
stripe_lookup called with wrong args | stripe_lookup | tool_schema |
stripe_lookup timed out | stripe_lookup | infrastructure |
| Pipeline calls wrong tool entirely | routing logic | routing |
Confirming the culprit causally
The LLM analyses tell you which spans look suspicious. Ablation tells you which spans actually moved the score. Origin blanks out each span’s output one at a time, re-runs the pipeline, and measures the score delta. For this pipeline, because all tool calls and LLM responses are deterministic stubs, the runner uses trace replay — it clones the captured trace and substitutes one span’s output with a null value. This is faster and equivalent to real re-execution for a fully stubbed pipeline. Therunner.py file in the example exports both functions the CLI needs:
--runner runner.py and ablation runs automatically
alongside critic and decomposition. When Origin blanks the plan span’s
output, the responder has nothing to work from and the score drops to 0.0 —
confirming the planner is the root cause, not just a suspect.
For pipelines with real LLM calls or side-effectful tools (like the coding
agent), use a runner that re-executes the full pipeline rather than replaying
the trace — downstream spans need to re-run against the new context to get
an honest score delta. See the coding-agent tutorial for that pattern.
What the planner prompt needs
Origin identifiedplan as the culprit and fix=prompt. Here’s what that
means concretely.
The current prompt (what the planner is working from):
stripe_lookup actually returned.
The fixed prompt adds one grounding rule:
eligible_for_refund: true.
The responder then issues the refund — no other changes needed.
This targeted rewrite is what Reflex would generate and test automatically.
Next steps
Methods
When ablation beats critic, and when it doesn’t
API reference
Full Attribution and NodeAttribution reference
Reflex quickstart
Feed Origin’s output to Reflex to rewrite the planner prompt
Witness
How traces are captured and structured