This tutorial diagnoses an existing trace from a real observability stack — noDocumentation Index
Fetch the complete documentation index at: https://docs.aevyra.ai/llms.txt
Use this file to discover all available pages before exploring further.
@span decorators, no live agent run. The trace is a Langfuse export of a RAG
query that returned the wrong document and gave a confidently wrong answer.
You’ll write a 30-line adapter, hand the converted trace to Origin, and read
the attribution.
By the end you’ll have a working pattern for any observability source —
Langfuse, OpenTelemetry, LangSmith, Phoenix, Braintrust, or a home-grown JSONL
trace store. The Witness AgentTrace schema is the contract; everything else
is translation.
The full code is in
examples/byo_trace/.
Setup
sample_langfuse.json.
The trace
The captured run is a 4-span RAG query: User question: “What’s our refund policy for digital subscriptions?” Final answer: “You can return any product within 30 days of purchase for a full refund, as long as it is unopened and in original packaging.” That’s wrong. The user asked about digital subscriptions; the answer cites the physical-products return policy. The retrieval index didn’t have the digital-subscription doc — the top-ranked result iskb_returns_v3 (the
physical returns policy), and the synthesizer wrote a confident answer from
the wrong source.
Origin’s job is to tell us which span owns the failure and what kind of
fix it needs: updating the retrieval index, rewriting a prompt, or something
else entirely.
The AgentTrace contract
Origin’sdiagnose takes an AgentTrace — a flat list of TraceNodes with
parent_id links, an optional ideal, and a metadata dict. To diagnose
your own logs you write a function that produces one of these from your source
format.
The minimum you need per span:
| Field | Notes |
|---|---|
name | Human-readable label ("plan", "vector_search", "synthesize"). |
id | Unique within the trace. Required when names repeat in the DAG. |
parent_id | The id of the parent span, or None for a root. |
kind | One of "reason", "tool", "retrieve", "agent", "other". |
input / output | Any JSON-serializable value. |
prompt_id | Identity of the prompt behind this span. Multiple spans sharing one prompt_id are rolled up together by Attribution.by_prompt(). |
optimize | True when this span’s prompt is an optimization target. Reflex reads this. |
Adapter: Langfuse → AgentTrace
Langfuse stores a trace as a top-level record with nested observations of typesGENERATION, SPAN, EVENT. The mapping to AgentTrace is almost
1:1:
from_langfuse.py
Run the diagnosis
First convert the Langfuse export to anAgentTrace and save it as trace.json:
--score 0.1— the judge score for this trace (cited wrong policy)--rubric rubric.txt— the evaluation criteria the judge used--model— the model doing attribution reasoning, inprovider/modelformat--runner runner.py— enables causal ablation; exportsrunnerandjudgefor this pipeline
from_jsonl.py, then run the same CLI command.
How Origin finds the issue
Analysis 1 — read the whole trace and ask “what went wrong?” An LLM reads every span’s input and output alongside the rubric and score, and returns a ranked list of suspicious spans. Fast, but can be fooled when the trace contains plausible-sounding reasoning over defective inputs — exactly the case here, where the synthesizer wrote a confident answer from a confidently wrong document. Analysis 2 — break the rubric into criteria and check each step The rubric resolves to two criteria — cite the digital-subscription policy and do not cite the physical-returns policy — and Origin checks each span against each criterion individually. The synthesizer fails both. The retrieval span fails the first (it didn’t return the right doc). The reranker partially fails the first (it ranked an off-topic doc first without flagging the mismatch). Analysis 3 — break things on purpose and re-score Ablation overrides one span’s output at a time and re-judges. For this trace the most informative ablation is onvector_search — replacing its output
with a null and re-scoring shows whether the downstream pipeline was actually
depending on it. Because all spans are stored values (no live LLM calls), the
runner uses trace replay: it clones the captured trace and substitutes one
span’s output in-place. Fast and deterministic for a fully static trace.
Why all three together
For RAG failures specifically, this combination is load-bearing. Critic alone
can be misled by the synthesizer’s confident prose. Decomposition catches the
topic mismatch but doesn’t tell you which fix has the most leverage. Ablation
puts a number on the leverage — which is the question that actually matters
when deciding between “fix the index” and “rewrite the prompt.”
What Origin finds
Reading the result
vector_search is primary with fix=retrieval. That’s the critical
signal — prompt rewriting will not fix this. If you fed Origin’s
prompt-level rollup directly to Reflex, Reflex would rewrite answer_writer,
the answer would still cite the wrong policy (because retrieval still returns
the wrong doc), and you’d lose a debugging cycle. The fix_type is what
routes the work to the right team.
synthesize is also primary but downstream. Ablation confirms it:
blanking the synthesizer’s output improved the score from 0.1 to 0.5 — its
phrasing is actively making things worse. But its fix_type=unknown is the
right signal: there’s no independent prompt fix here. The synthesizer cited
what retrieval gave it. Fix the index and this span cleans up automatically.
query_rewrite and rerank are clean. Neither appears in the culprit
list. The rewrite “refund policy digital subscription” was exactly the right
query; Origin examined it and found nothing wrong. Confirming the
non-failures is as useful as finding the failures — it tells you where
not to spend engineering time.
fix tells you what not to do. Here’s how the same pipeline maps to
different fix types depending on what goes wrong:
| Scenario | Culprit | fix |
|---|---|---|
| Doc not in the index | vector_search | retrieval |
| Synthesizer cites without flagging gaps | synthesize | prompt |
| Reranker ranks off-topic docs first | rerank | prompt |
| Retriever called with malformed args | vector_search | tool_schema |
| Embedding model can’t distinguish topics | vector_search | retrieval |
Bringing your own format
Langfuse is one source. The same pattern works for any structured log. OpenTelemetry is the de facto standard for distributed tracing. Witness’s OTel adapter reads spans emitted by LangGraph, CrewAI, AutoGen, the Vercel AI SDK, and any framework using GenAI semantic conventions:TraceNode. The adapter is ~30
lines:
from_jsonl.py
--source jsonl to use it:
AgentTrace is the
contract.
Next steps
Support triage tutorial
A live pipeline failure diagnosed end-to-end with ablation
Coding agent tutorial
A real LLM, a real failure, three attribution methods agreeing
Methods
When ablation beats critic, and when it doesn’t
Reflex quickstart
Feed Origin’s prompt-level rollup to Reflex and ship a fix