Section 32.9: Evaluating Distributed Agentic Systems

"They asked me to grade the swarm. I asked which run. They said the run. I said there is no such thing as the run, only a distribution, and we have been arguing ever since."
An LLM-as-Judge Awaiting Calibration

Big Picture

A multi-agent system is the hardest kind of distributed system to evaluate, because its task has no single right answer, its success spans many steps where any one failure sinks the whole trajectory, and the same input produces a different run every time. The previous sections built orchestration: agents that plan, call tools, debate, share memory, and run on an orchestration engine. This section asks the question that decides whether any of that machinery earned its keep: does it actually work, and does it work better than the obvious cheaper alternative? A rigorous answer forces three measurements together (end-to-end success, trajectory quality, and cost) and one comparison that multi-agent research keeps avoiding: many agents against one strong agent at the same compute budget. The distributed-systems evaluation discipline of Chapter 5 applies in full, now with stochastic outputs that demand multiple runs and a judge that has biases of its own.

Every preceding section in this chapter added capability. A planner decomposes a goal, role-specialized agents divide the labor, tool calls reach outside the model, debate and reflection catch errors, shared memory keeps the agents consistent, and an orchestration engine routes the messages. None of that tells you whether the assembled system is any good. Evaluating a single classifier is a solved ritual: hold out a labeled test set, compute accuracy, report a confidence interval. Evaluating a multi-agent system breaks every assumption that ritual rests on. The tasks are open-ended, so there is rarely one correct output to compare against. The work is a long trajectory, so the final answer can be wrong for reasons that have nothing to do with the last step. And the system is non-deterministic, so a single run tells you almost nothing. This section is about doing the measurement anyway, and doing it without fooling yourself.

Figure 32.9.1: Evaluating an agent system. Left: a single traced run is a multi-step trajectory; here the executor step fails, sinking the whole task, and step-level credit assignment asks which step was at fault. Because each run is stochastic, one trajectory is never a verdict; you repeat across many runs to get a distribution. Right: the three numbers an honest evaluation reports together, end-to-end success rate (with variance), trajectory step quality, and cost or latency per task. The demo in this section computes all three.

1. Why Agent Evaluation Is Unusually Hard Beginner

Three properties of agentic tasks each break a different assumption that ordinary machine-learning evaluation depends on. The first is open-endedness. "Book me a flight under \$400 that connects through somewhere I can get lunch" has no canonical answer to compare against; many outputs satisfy it and many subtly fail it, so there is no held-out label to match. The second is the multi-step structure. An agent trajectory is a chain of decisions, tool calls, and intermediate results, and a failure anywhere can sink everything downstream. A correct plan with one bad retrieval produces a confident wrong answer, and the final output alone cannot tell you the plan was fine. The third is non-determinism. The same prompt, sampled twice, gives two different trajectories; temperature, tool latency, and model updates all inject randomness, so a single run measures luck as much as quality.

These three properties compound. Open-endedness means you cannot use exact-match scoring, so you reach for a model to judge the output, which introduces the judge's own biases. The multi-step structure means a single end-to-end number hides where the system actually broke, so you need per-step measurement, which requires tracing every agent's action. And non-determinism means every number you report is a sample from a distribution, so a single run is not evidence; you need many runs and an honest variance. The rest of this section takes these one at a time: what to measure, how to measure it, and how to compare without lying to yourself.

Key Insight: A Single Run Is Not a Result, It Is One Sample

Because an agentic system is stochastic, evaluating it on one run is like estimating a coin's bias from one flip. The unit of evaluation is not a trajectory but a distribution over trajectories, summarized by a mean success rate and its run-to-run variance. Any agent benchmark number reported without a variance, or from a single seed, is closer to an anecdote than a measurement. Treat "we ran it and it solved the task" the way Chapter 5 taught you to treat a single latency sample: interesting, not conclusive.

2. What to Measure: Success, Trajectory, Cost Beginner

Three families of metric matter, and reporting any one alone is a way to mislead. The first is end-to-end task success rate: across a benchmark of tasks, what fraction did the system actually complete. The benchmark has to be the right one. Agentic and tool-use benchmarks (function-calling suites, customer-service simulators), web and computer-use benchmarks (navigating real sites, driving a desktop), and software-engineering benchmarks (resolving real repository issues) each stress a different competence, and a system that excels at one can be helpless at another. The second family is trajectory or process quality: did each step make sense, not merely whether the final answer was right. A system that reaches the correct answer through three lucky guesses and a contradiction is more fragile than its success rate suggests, and only step-level inspection reveals it. The third family is cost and latency per task, the subject of Section 32.10, because a multi-agent system that doubles success while spending ten times the tokens has not obviously improved anything.

Let $T$ be the number of benchmark tasks and let $s_i \in \{0,1\}$ indicate whether task $i$ succeeded on a given run. The end-to-end success rate on that run is $\hat{p} = \frac{1}{T}\sum_{i=1}^{T} s_i$. Repeat the whole benchmark over $R$ independent runs to get rates $\hat{p}^{(1)}, \dots, \hat{p}^{(R)}$, and report the mean with its standard deviation,

$$\bar{p} = \frac{1}{R}\sum_{r=1}^{R}\hat{p}^{(r)}, \qquad \sigma_p = \sqrt{\frac{1}{R-1}\sum_{r=1}^{R}\big(\hat{p}^{(r)} - \bar{p}\big)^2.}$$

The trajectory view drills into the steps. If a pipeline has stages $1, \dots, M$ and step $m$ is attempted $a_m$ times across all tasks and runs, succeeding $o_m$ of them, then the step-level success $\hat{q}_m = o_m / a_m$ measures that stage in isolation, conditioned on the trajectory having reached it. This is the credit-assignment problem of Section 30.8 wearing an evaluation hat: when the task fails, which agent or step deserves the blame. The demo below computes $\bar p$, $\sigma_p$, the per-step $\hat q_m$, and the cost per task in one pass.

Thesis Thread: The Chapter 5 Discipline, Now With a Stochastic System Under Test

The core thesis that scale-out must be justified, not assumed, returns here in its sharpest form. Chapter 5 taught the distributed-systems evaluation discipline: report quality and cost and latency together, control for randomness with multiple runs, and never compare numbers computed under different configurations. A multi-agent system is exactly the distributed system that rule was written for, except that now the system itself is non-deterministic, so the "multiple runs" clause is no longer optional hygiene but the only thing standing between you and a fictional result. Distributing intelligence across many agents is a scale-out move, and like every scale-out move in this book it has to beat the single-machine baseline on a fair, cost-matched comparison before it counts as progress.

3. Methods: Judges, Trajectories, and Traces Intermediate

Open-ended outputs have no exact-match scorer, so the dominant method is LLM-as-judge: a strong model reads the task and the agent's output and scores it, optionally against a rubric or a reference answer. It scales to open-ended grading where humans cannot, but it has well-documented biases. It favors longer and more confident answers, prefers outputs in its own style, and exhibits position bias when comparing two candidates side by side, so the first-shown answer wins more often than it should. Mitigations are now standard practice: fix the rubric in advance, swap the order of paired comparisons and average, calibrate the judge against a small human-labeled set, and never use the same model family to both generate and grade when you can avoid it. Treat the judge as an instrument with a known error profile, not an oracle.

Trajectory evaluation goes beyond the final score to ask whether each step was justified, and step-level credit assignment localizes failure to a specific agent or action. Both require that every agent step be recorded: the prompt, the tool call, the result, the latency, the token count. That recording is distributed tracing, the same machinery Section 26.6 built for serving fleets, now applied to a chain of agent calls instead of a chain of microservices. A trace gives each step a span, links spans into the trajectory tree, and lets you replay exactly what the executor saw when it failed. Without tracing, a multi-agent failure is a black box; with it, you can point at the step that broke and the context it was handed.

Library Shortcut: LangSmith and Agent-Eval Tooling Do the Tracing and Judging

The demo in this section logs steps and aggregates them by hand, in roughly eighty lines. In production you do not build the trace store, the judge harness, or the run aggregation yourself. Tools such as LangSmith, the OpenAI Evals harness, Braintrust, and Ragas capture every agent span automatically, attach an LLM-judge scorer with a versioned rubric, and aggregate success and cost across runs with variance, all from a decorator and a dataset:

# pip install langsmith
from langsmith import traceable, Client
from langsmith.evaluation import evaluate

@traceable                      # every call is captured as a trace span
def agent_pipeline(task: str) -> str:
    plan = planner(task)        # each nested @traceable call nests as a child span
    docs = retriever(plan)
    return executor(plan, docs)

def correctness(run, example):  # an LLM-judge scorer with a fixed rubric
    return llm_judge(run.outputs["output"], example.outputs["answer"])

evaluate(agent_pipeline,                 # runs the whole benchmark dataset,
         data="agent-benchmark-v1",      # repeats for variance, logs cost/latency,
         evaluators=[correctness])       # and renders per-step traces in the UI

Code 32.9.1: The manual trace-and-aggregate loop of this section collapses to a @traceable decorator and one evaluate call. The library captures the trajectory spans, runs the judge, and reports success with variance and cost, the same three numbers Figure 32.9.1 demands, handling the tracing transport that Section 26.6 unpacks.

4. The Honest Comparison: Multi-Agent Versus One Strong Agent Intermediate

Here is the comparison multi-agent research most often gets wrong. The interesting claim is not "the multi-agent system solves the task"; it is "the multi-agent system solves the task better than a single strong agent would." Those are different claims, and only the second justifies the coordination machinery of the previous eight sections. The trap is comparing a three-agent pipeline that spends three model calls against a single agent that spends one, then crediting the pipeline's higher success to its architecture when it really bought that success with three times the compute. The fair comparison is cost-matched: give the single agent the same total budget the pipeline consumes, and ask whether the pipeline still wins.

Formally, if the multi-agent system attains success rate $\bar{p}_{\text{multi}}$ at cost $c_{\text{multi}}$ per task and the single agent attains $\bar{p}_{\text{single}}$ at cost $c_{\text{single}}$, the architecture is justified only when $\bar{p}_{\text{multi}} > \bar{p}_{\text{single}}$ at $c_{\text{multi}} \approx c_{\text{single}}$. A higher success rate at higher cost is not evidence for the architecture; it is evidence that more compute helps, which we already knew. This is the construct-mismatch error of Chapter 5 in agent clothing: comparing two numbers produced under different budgets and attributing the difference to the wrong cause. The demo runs exactly this cost-matched experiment, and its result is the uncomfortable one that recent work keeps rediscovering.

The code defines stub agents with fixed per-step success probabilities so the trajectory math is fully reproducible: a planner that succeeds 92 percent of the time, a retriever at 85 percent, an executor at 80 percent, chained so a failure at any stage sinks the task. It evaluates this pipeline across 30 runs of 500 tasks each, reports the mean success rate with variance, the per-step credit, and the cost per task, then pits it against a single agent given the pipeline's entire budget.

import random
import statistics

# Each stub agent is (name, per-step success prob, cost-per-call in tok/1k).
PIPELINE = [("planner", 0.92, 3.0), ("retriever", 0.85, 4.0), ("executor", 0.80, 5.0)]
# A single strong agent given the WHOLE pipeline budget (12.0), so cost-matched.
SINGLE = ("solo", 0.74, 12.0)

def run_pipeline(rng):                       # one stochastic trajectory
    per_step_ok, cost, task_ok = [], 0.0, True
    for _name, p, c in PIPELINE:
        cost += c
        ok = rng.random() < p
        per_step_ok.append(ok)
        if not ok:                           # a failure anywhere sinks the task
            task_ok = False
            break                            # downstream agents never run
    while len(per_step_ok) < len(PIPELINE):  # pad un-run steps as not-attempted
        per_step_ok.append(None)
    return task_ok, per_step_ok, cost

def evaluate(run_fn, n_tasks, n_runs, seed0, credit=False):   # repeat for variance
    rates, costs = [], []
    attempted = [0] * len(PIPELINE); passed = [0] * len(PIPELINE)
    for r in range(n_runs):
        rng = random.Random(seed0 + r)
        succ, total = 0, 0.0
        for _ in range(n_tasks):
            out = run_fn(rng)
            ok, cost = (out[0], out[2]) if len(out) == 3 else out
            if credit:                                    # tally per-step credit
                for i, st in enumerate(out[1]):
                    if st is not None:
                        attempted[i] += 1; passed[i] += int(st)
            succ += int(ok); total += cost
        rates.append(succ / n_tasks); costs.append(total / n_tasks)
    step_credit = [(passed[i] / attempted[i] if attempted[i] else 0.0)
                   for i in range(len(PIPELINE))]          # conditioned on reaching it
    return rates, costs, step_credit

def run_single(rng):
    _name, p, c = SINGLE
    return rng.random() < p, c

N_TASKS, N_RUNS, SEED = 500, 30, 12345
p_rates, p_costs, credit = evaluate(run_pipeline, N_TASKS, N_RUNS, SEED, credit=True)
s_rates, s_costs, _ = evaluate(run_single, N_TASKS, N_RUNS, SEED)
delta = statistics.mean(p_rates) - statistics.mean(s_rates)

print("=== multi-agent pipeline (planner -> retriever -> executor) ===")
print(f"task success rate : {statistics.mean(p_rates):.3f} +/- "
      f"{statistics.stdev(p_rates):.3f}  (over {N_RUNS} runs)")
print(f"cost per task     : {statistics.mean(p_costs):.2f} tok/1k")
print("per-step credit (attempted-only):")
for (name, _p, _c), q in zip(PIPELINE, credit):
    print(f"    {name:<9}: {q:.3f}")
print("\n=== single strong agent, cost-matched (budget 12.0) ===")
print(f"task success rate : {statistics.mean(s_rates):.3f} +/- "
      f"{statistics.stdev(s_rates):.3f}")
print(f"cost per task     : {statistics.mean(s_costs):.2f} tok/1k")
print(f"\nquality gap (multi - single) : {delta:+.3f} at equal cost")
print("verdict: single agent wins (multi did NOT justify its coordination)"
      if delta < 0 else "verdict: multi-agent justified its coordination")

Code 32.9.2: A pure-Python evaluation of a stub multi-agent pipeline against a cost-matched single agent. Stub agents with fixed success probabilities make the trajectory arithmetic reproducible; the real machinery being demonstrated is the evaluation protocol, multiple runs for variance and a budget-matched baseline, not the agents themselves.

=== multi-agent pipeline (planner -> retriever -> executor) ===
task success rate : 0.623 +/- 0.023  (over 30 runs)
cost per task     : 10.60 tok/1k
per-step credit (attempted-only):
    planner  : 0.921
    retriever: 0.850
    executor : 0.796

=== single strong agent, cost-matched (budget 12.0) ===
task success rate : 0.743 +/- 0.020
cost per task     : 12.00 tok/1k

quality gap (multi - single) : -0.120 at equal cost
verdict: single agent wins (multi did NOT justify its coordination)

Output 32.9.2: The cost-matched verdict. The three-agent pipeline reaches only $0.623 \pm 0.023$ success because its chained-failure structure multiplies the per-step error rates ($0.92 \times 0.85 \times 0.80 \approx 0.63$), while a single agent on the same budget reaches $0.743 \pm 0.020$. The per-step credit cleanly localizes the weakest link (the executor at $0.796$), and the negative gap is the result honest agent research must report rather than hide.

The numbers tell the story the section is built around. The pipeline's end-to-end rate of $0.623$ is exactly the product of its per-step probabilities, the mathematical signature of a chained trajectory where every step must hold. The per-step credit isolates the executor as the weakest stage, the place to invest if you wanted to improve the pipeline. And the cost-matched comparison delivers the uncomfortable verdict: the single agent, given the same budget, wins by twelve points. This does not prove multi-agent systems are useless; it proves that decomposition into a fragile chain can cost more reliability than it buys, and that you only discover this if you run the matched comparison instead of the flattering one.

Research Frontier: Agent Benchmarks and Cost-Matched Multi-Agent Evaluation (2024 to 2026)

The benchmarks that anchor agent evaluation have matured fast. SWE-bench (Jimenez et al., 2024) scores agents on resolving real GitHub issues with the project's own test suite as ground truth, and SWE-bench Verified filters it to human-validated tasks. GAIA (Mialon et al., 2023) poses general-assistant questions that are easy for humans and hard for agents, requiring tool use and multi-hop reasoning. WebArena and VisualWebArena evaluate agents on realistic self-hosted websites, and $\tau$-bench (Yao et al., 2024) measures tool-agent-user interaction in customer-service settings with explicit policy compliance, reporting a pass$^k$ metric that exposes how reliability collapses when the same task is attempted several times. The sobering thread running through 2024 to 2026 evaluations is the cost-matched finding: several studies report that elaborate multi-agent pipelines fail to beat a single strong model given the same token budget, and that naive "more agents" scaling can degrade reliability through error propagation. The methodological response, reporting success with variance and against a budget-matched single-agent baseline, is exactly the protocol Code 32.9.2 demonstrates.

5. Debugging Multi-Agent Failures Advanced

When the matched comparison goes against the pipeline, tracing tells you why, and three failure modes recur. Error propagation is the one Output 32.9.2 quantifies: an early mistake flows downstream because later agents trust the output of earlier ones, so a single bad retrieval poisons every step after it. Miscoordination is two agents working at cross purposes, duplicating effort or contradicting each other because the orchestration handed them overlapping or conflicting instructions; debate and reflection (Section 32.5) are partly attempts to catch this, but they add their own cost. Context inconsistency is agents operating on stale or divergent views of the shared state, the distributed-memory hazard of the shared-state section, where two agents read different versions of the same fact and reason confidently to incompatible conclusions.

Each failure mode has a tracing signature. Error propagation shows as a step that succeeds locally but hands down a subtly wrong artifact, visible only when you inspect what the next step received. Miscoordination shows as redundant or conflicting tool calls across spans. Context inconsistency shows as two agents reading different values for the same memory key within one trajectory. The per-step credit in the demo is the coarse version of this analysis: it told us the executor was the weakest link. A full trace would let you replay the executor's exact input and see whether it failed on its own or because the retriever handed it the wrong context. This is why tracing every agent step, not just logging the final answer, is the price of admission for debugging anything beyond a toy.

Practical Example: The Multi-Agent Pipeline That Was Quietly Worse Than One Call

Who: An applied-AI team at a fintech company building a customer-support agent.

Situation: They had shipped a four-agent pipeline (intent classifier, retriever, policy checker, responder) and measured 71 percent task success on their internal benchmark, up from an earlier prototype.

Problem: Token cost per resolved ticket had quadrupled, and an executive asked whether the extra agents were actually responsible for the accuracy, or just the extra spend.

Dilemma: Keep the impressive-looking 71 percent and the architecture the team had invested in, or run the comparison nobody wanted, the single strong model given the same total token budget.

Decision: They ran the cost-matched baseline and repeated each configuration over 25 seeds, following the protocol of Code 32.9.2 rather than trusting their single best run.

How: They wired every agent call through a tracing harness, computed per-step credit, and graded open-ended responses with an order-swapped LLM judge calibrated against 200 human labels.

Result: The single agent on the matched budget hit 73 percent, two points above the pipeline, and the traces showed the policy-checker agent was discarding correct answers a third of the time. They collapsed the pipeline to a single call with the policy rules in the prompt, cutting cost by 60 percent and raising success.

Lesson: Without the cost-matched comparison and per-step traces, a more expensive architecture can masquerade as a better one indefinitely. The honest baseline is the cheapest evaluation you can run and the most likely to change your mind.

6. Putting the Evaluation Together Intermediate

A defensible evaluation of a distributed agentic system has a fixed shape. Choose the benchmark that matches the deployment (tool-use, web, or software-engineering tasks, not a benchmark that flatters the system), run the whole benchmark many times to get a success rate with variance, trace every step so you can assign step-level credit and debug failures, report cost and latency alongside quality, and compare against a single strong agent at matched cost before claiming the architecture helped. Each of these guards against a specific way of fooling yourself, and skipping any one reintroduces exactly the error the others were meant to catch. The discipline is not new; it is the Chapter 5 evaluation contract applied to a system that happens to be stochastic, open-ended, and multi-step all at once.

Fun Note

The fastest way to make any multi-agent system look brilliant is to evaluate it once, on a task you already watched it solve, against a baseline you crippled by giving it a third of the budget. The fastest way to make it look honest is to run it thirty times and put the variance in the table. The gap between those two numbers is roughly the gap between a demo and a result.

Quality is only one of the three numbers, and the next section makes the other two first-class. Section 32.10 turns to cost, latency, and reliability at scale: how the token bill and the tail latency of a multi-agent system grow as the agent count and the request volume rise, and how to keep an orchestration of stochastic components reliable when any single agent can stall, hallucinate, or time out. Evaluation told you whether the system works; the next section tells you what it costs to keep it working under load.

Exercise 32.9.1: Why the Chain Multiplies Conceptual

The pipeline in Output 32.9.2 reached a success rate of $0.623$, almost exactly $0.92 \times 0.85 \times 0.80$. Explain why a sequential trajectory where every step must succeed produces a success rate equal to the product of the per-step probabilities, and what this implies as the number of steps grows. If a fourth agent at $0.90$ success were appended, predict the new end-to-end rate. Then argue, using only this multiplicative structure, why adding agents to a pipeline can lower reliability even when each new agent is individually competent.

Exercise 32.9.2: Make the Comparison Fair, or Unfair, on Purpose Coding

Starting from Code 32.9.2, first reproduce the cost-matched verdict, then deliberately make the comparison unfair by giving the single agent only a third of the pipeline's budget (set its cost to $4.0$ and lower its success probability to reflect the smaller budget, say $0.55$) and re-run. Show that the pipeline now appears to win, and explain in one paragraph exactly which construct-mismatch error this introduces. Finally, sweep the executor's per-step probability from $0.70$ to $0.99$ and find the value at which the pipeline first beats the cost-matched single agent, reporting it with the run-to-run variance.

Exercise 32.9.3: How Many Runs Do You Need? Analysis

The demo used $R = 30$ runs of $T = 500$ tasks. Treating each run's success rate as a sample, derive the standard error of the mean success rate as a function of $R$ and the per-run standard deviation, and use the reported $\sigma_p \approx 0.023$ to estimate how many runs you would need for the $95\%$ confidence interval on $\bar{p}_{\text{multi}}$ to exclude the single-agent rate of $0.743$. Then discuss the trade-off: more runs tighten the interval but multiply the token bill of Section 32.10, so state how you would choose $R$ when each run costs real money.