Attribution methods

Origin ships three attribution methods. You can run any one individually or combine them with method="all" (the default).

LLM-as-critic (`method="critic"`)

One LLM call. Origin sends the full execution trace, the rubric, and the judge score to an LLM and asks: “given that this pipeline scored poorly, which span is most responsible and why?” The LLM reads the trace holistically — the way a senior engineer would scan logs — and returns a ranked list of culprit spans, each with a severity, confidence score, explanation, and fix_type. Think of it as asking a colleague to eyeball the trace and point at what looks wrong. It’s fast and works well when one span clearly dominates the failure. Best for: fast diagnosis, single-cause failures, and traces where one span clearly dominates the failure. Limitations: the critic sees the trace as text and can be misled by a span that looks suspicious but is not the root cause. It has no causal guarantee.

result = origin.diagnose(trace=trace, score=0.2, rubric=rubric, method="critic")

Score decomposition (`method="decomposition"`)

One LLM call. Instead of looking at the trace holistically, decomposition first breaks the rubric into its individual pass/fail criteria, then asks which span is responsible for each criterion that failed. For example, a rubric like “the agent should acknowledge the charge, cite the refund policy, and confirm the refund” gets split into three separate questions. For each failed criterion, the LLM identifies the responsible span. Blame is then aggregated per span across all the criteria it failed. This gives you a more structured view: you can see not just which span failed, but which requirement it failed and why — useful when a failure has multiple contributing factors. Best for: rubrics that bundle multiple requirements, distributed failures where two or three spans each contributed, and cases where you want a richer breakdown by criterion. Limitations: still an LLM judgment — the decomposition of the rubric into criteria can be imperfect.

result = origin.diagnose(trace=trace, score=0.2, rubric=rubric, method="decomposition")

Ablation (`method="ablation"`)

Causal. For each candidate span, Origin replaces its output with a neutral placeholder ("null" by default, or the ideal output if ablation_placeholder="ideal"), replays the pipeline via your runner, and re-scores via your judge. A large score drop when span X is ablated means span X is genuinely causal — removing its real output materially changed the outcome. Best for: confirming that a span is the root cause (not just suspicious), ruling out false positives, and pipelines where LLM confabulation is a risk. Limitations: requires a deterministic runner and a judge callable. Each ablated span costs one runner invocation + one judge call. Use ablation_budget=N to cap total invocations.

def my_runner(trace: AgentTrace, overrides: dict) -> AgentTrace:
    # Replay with overrides[span_id] forced as that span's output.
    ...

result = origin.diagnose(
    trace=trace, score=0.2, rubric=rubric,
    method="ablation",
    runner=my_runner,
)

How ablation confidence is calculated

When a span is ablated, Origin runs the pipeline and asks the judge to score the result. Confidence is the normalized score drop:

ablation_confidence = (score_original - score_ablated) / score_range

score_range is score_max - score_min (typically 1.0 - 0.0 = 1.0). A span that drops the score from 1.0 to 0.0 gets confidence 1.0; a span that changes nothing gets 0.0. Neutral placeholders by output type. The ablated output must be structurally valid so the rest of the pipeline doesn’t crash:

Span output type	Placeholder
`dict`	`{}`
`list`	`[]`
`str`	`""`

Judge scoring tiers. Deterministic judges often return one of a small set of scores based on observable facts rather than LLM opinion. In the coding agent example the judge uses three tiers:

Result	Score	Meaning
All tests pass	`1.0`	Fully correct
Compiles but tests fail	`0.4`	Partial credit
Compile error	`0.0`	Broken output

With score_original = 0.4 and score_ablated = 0.0, ablation confidence is (0.4 - 0.0) / 1.0 = 0.40.

Ablation cost control

result = diagnose_pipeline(
    my_agent, question,
    judge=judge, rubric=rubric, llm=llm, runner=my_runner,
    ablation_budget=5,          # cap at 5 runner+judge invocations
)

The raw on-ramp (Origin.diagnose) also exposes candidates=["span_a", "span_b"] to restrict the ablation sweep to specific span ids.

Combined (`method="all"`)

Runs critic and decomposition always (two LLM calls total). Ablation participates when a runner is supplied; it is silently skipped otherwise. Results are merged per span:

Confidence — spans named by multiple methods receive a corroboration bonus. Merged confidence lies between the arithmetic mean and the max of the individual confidences, weighted toward the max by the number of methods that agreed. A span all three methods agree on gets the highest possible merged confidence.
Severity — the max severity across methods wins.
fix_type — resolved to the most specific type across methods using a priority ordering: prompt > tool_schema > retrieval > routing > infrastructure > unknown. If critic says retrieval and decomposition says unknown, the merged fix_type is retrieval.

Corroboration formula

Let confidences be the list of per-method confidence scores for a span. The merged confidence is:

weight  = 1.0 - 1.0 / len(confidences)
avg     = mean(confidences)
peak    = max(confidences)
merged  = avg + (peak - avg) * weight

weight scales with the number of agreeing methods:

Methods that named the span	Weight	Effect
1	0.00	merged = avg (no bonus — only one signal)
2	0.50	merged halfway between avg and peak
3	0.67	merged pulled strongly toward the peak

The more methods agree, the closer the merged score is to the highest individual confidence. A span that only one method flags gets no bonus at all.

Worked example

In the coding agent tutorial, the planner span receives these per-method scores:

Method	Confidence	How it was derived
Critic	0.95	LLM judged the planner as the most suspicious span
Decomposition	0.28	Partial blame — planner contributed to one failed criterion
Ablation	0.40	Score dropped from 0.4 → 0.0 when planner output was ablated

Applying the formula with N=3 methods:

weight  = 1.0 - 1.0 / 3  = 0.667
avg     = (0.95 + 0.28 + 0.40) / 3  = 0.543
peak    = 0.95
merged  = 0.543 + (0.95 - 0.543) × 0.667  = 0.543 + 0.271  = 0.814

The planner ends up with confidence=0.81. Three independent methods all flagged it, so the corroboration bonus pulled the merged score well above the average of the individual confidences.

result = diagnose_pipeline(
    my_agent, question,
    judge=judge, rubric=rubric, llm=llm,
    runner=my_runner,   # enables ablation
    method="all",
)

Choosing a method

	Critic	Decomposition	Ablation
LLM calls	1	1	0 (+ runner×N)
Runner required	No	No	Yes
Causal guarantee	No	No	Yes
Multi-criterion rubrics	Partial	Yes	Partial
Cost	Low	Low	Medium–High

Start with method="all" (without a runner) for most use cases — two LLM calls, no runner needed, corroboration bonus when both methods agree. Add a runner when you want ablation’s causal confirmation.

Getting started

Guides

Tutorials

API reference

Attribution methods

LLM-as-critic (`method="critic"`)

Score decomposition (`method="decomposition"`)

Ablation (`method="ablation"`)

How ablation confidence is calculated

Ablation cost control

Combined (`method="all"`)

Corroboration formula

Worked example

Choosing a method

​LLM-as-critic (method="critic")

​Score decomposition (method="decomposition")

​Ablation (method="ablation")

​How ablation confidence is calculated

​Ablation cost control

​Combined (method="all")

​Corroboration formula

​Worked example

​Choosing a method

LLM-as-critic (`method="critic"`)

Score decomposition (`method="decomposition"`)

Ablation (`method="ablation"`)

How ablation confidence is calculated

Ablation cost control

Combined (`method="all"`)

Corroboration formula

Worked example

Choosing a method