Section 40.9: Evaluation | Building Scalable AI

"They asked me to grade an agent that took a hundred steps to answer one question. I read the whole trajectory, declared it brilliant, and only later noticed it had answered the wrong question with great confidence."
A Judge Model, Grading an Agent That Took a Hundred Steps to Answer

Big Picture

Evaluating a distributed agentic system is not one measurement but a stack of them, and the whole stack must be computed at scale, over many trajectories, with both task quality and system efficiency reported side by side. A single accuracy number hides everything that matters: whether the retriever found the right context, whether each tool call was correct, whether the agent actually accomplished the goal or merely produced plausible text, whether the judge that scored it agrees with a human, and what the answer cost in latency and dollars. This section assembles the four evaluation layers introduced piecemeal across the book into one scorecard for the agentic application built in this chapter, then shows how to run that scorecard as a distributed batch job offline and as continuous monitoring online. The methodology is the methodology of Chapter 5; the agentic layer extends it; the parallel execution is the batch-processing pattern of Chapter 6.

The agent assembled in the previous sections of this chapter retrieves over a shared vector store, plans a sequence of steps, calls tools, and synthesizes an answer, with the orchestration spread across services as described in Chapter 32. Now we have to answer a deceptively simple question: is it any good, and is it good enough to ship? A naive practitioner reports one number, the fraction of test queries the agent got right, and stops. That number is necessary but profoundly incomplete. It cannot tell you why the agent failed, it conflates a retrieval miss with a reasoning error, it says nothing about the cost of each answer, and if it was produced by a language-model judge it may not even agree with what a human would call correct. A distributed agentic system has many moving parts, and a serious evaluation measures each part, measures the whole, validates the measurement instrument itself, and measures the systems cost of all of it. We treat these as four layers and build them in order.

Figure 40.9.1: The evaluation stack for the chapter's agentic system. Four layers (component metrics, end-to-end task success, the calibrated LLM judge, and systems metrics) roll up into a single scorecard computed in one pass on one configuration, so that quality and efficiency numbers are directly comparable and every entry is backed by the same run.

1. Layer One: Component Evaluation Beginner

The agent is a pipeline, and a pipeline is only as correct as its stages. Component evaluation isolates each stage and measures it against a metric that the literature for that stage has already settled. The retrieval stage is graded with the information-retrieval metrics developed for distributed retrieval in Chapter 25 and applied to RAG in Section 36.8: recall at $k$ and normalized discounted cumulative gain. For a query with a set $R$ of gold-relevant documents and a retrieved top-$k$ list, recall counts how much of the relevant set was found,

$$\text{recall@}k = \frac{|\,\text{top-}k \cap R\,|}{|R|}, \qquad \text{nDCG@}k = \frac{1}{\text{IDCG}_k}\sum_{i=1}^{k} \frac{2^{\,\text{rel}_i} - 1}{\log_2(i+1)},$$

where $\text{rel}_i$ is the graded relevance of the document at rank $i$ and $\text{IDCG}_k$ is the DCG of the ideal ranking, so that nDCG lands in $[0,1]$ and rewards putting relevant documents near the top. The generation stage is graded for faithfulness, the fraction of claims in the answer that are entailed by the retrieved context rather than invented, the central RAG-specific metric of Section 36.8. The tool-calling stage, which is new to the agentic setting, is graded for tool-call accuracy: of the calls the agent issued, how many selected the right tool with well-formed, correct arguments. These three component scores localize a failure. An answer that is wrong because the right document was never retrieved is a retrieval problem; an answer that is wrong despite correct retrieval is a generation or tool problem. Without the component layer, every failure looks the same.

Key Insight: A Component Metric Localizes the Failure; the End-to-End Metric Cannot

End-to-end task success tells you the agent failed; it never tells you which stage failed. Two systems with identical success rates can have opposite remedies: one needs a better retriever, the other a better planner. Component metrics (recall@k, faithfulness, tool-call accuracy) decompose the failure along the pipeline so that engineering effort lands where the binding error actually is. Always compute the component layer alongside the end-to-end layer, never instead of it.

2. Layer Two: End-to-End Task Success Intermediate

Good components do not guarantee a good agent. The stages compose, errors propagate, and the only measurement the user cares about is whether the agent accomplished the goal. The end-to-end layer scores each trajectory, the full sequence of plan, retrieve, call, and synthesize steps the agent took, against the task's success criterion. The headline metric is the task success rate, the fraction of $N$ independent tasks the agent completed, estimated as $\hat{p} = \frac{1}{N}\sum_{i=1}^{N} s_i$ with $s_i \in \{0,1\}$. Because $\hat{p}$ is an estimate over a finite sample, it must be reported with a confidence interval, exactly as Chapter 5 insists for any distributed-system measurement; the normal-approximation (Wald) interval is

$$\hat{p} \pm z_{1-\alpha/2}\,\sqrt{\frac{\hat{p}\,(1-\hat{p})}{N}}, \qquad z_{0.975} = 1.96.$$

A success rate of 0.71 reported without its interval is a number pretending to be a fact. Two agents whose intervals overlap are not distinguishable on the evidence you have, and shipping the more expensive one because its point estimate is higher is a measurement error. Beyond the binary outcome, agentic tasks reward trajectory and step evaluation: a task decomposed into subgoals admits partial credit, so an agent that completed three of four subgoals scores $0.75$ rather than $0$. Partial credit and per-step grading turn a coarse pass/fail into a signal dense enough to guide development, and they connect this layer to the trajectory-level agent evaluation of Chapter 32. The cost is that someone, or something, must grade the trajectory, which is where the third layer becomes unavoidable.

3. Layer Three: The LLM-as-Judge and Its Pitfalls Advanced

Grading thousands of free-form trajectories by hand does not scale, so the standard move is to use a strong language model as the judge, prompting it to score each trajectory against a rubric. This is the only practical way to evaluate open-ended agentic output at the volume distributed evaluation demands, and it is also a measurement instrument that can be miscalibrated in ways a ruler never is. The judge has biases the agentic-evaluation literature has catalogued: a position bias that favors the first of two compared answers, a verbosity bias that rates longer answers higher, a self-preference bias toward text from its own model family, and a sycophancy bias toward confident phrasing. A judge that took a hundred steps of reasoning at face value can declare a confidently wrong answer brilliant. The discipline that makes a judge trustworthy is calibration against humans, the same instrument-validation requirement that Chapter 5 places on any proxy metric and that the agentic-orchestration evaluation of Chapter 32 develops for multi-step agents.

Calibration means scoring a sample of trajectories with both the judge and human raters and measuring agreement with a statistic that corrects for chance. Cohen's kappa does exactly that:

$$\kappa = \frac{p_o - p_e}{1 - p_e},$$

where $p_o$ is the observed agreement and $p_e$ is the agreement expected if the judge and the human labeled independently. A kappa near $1$ means the judge tracks human judgment; a kappa near $0$ means its apparent agreement is no better than chance, and the judge's scores cannot be trusted to stand in for human ones. Only once kappa clears an acceptable bar (commonly $0.6$ to $0.8$ for substantial agreement) does it become defensible to let the judge grade the full benchmark. A judge that has not been calibrated is not a cheaper evaluator; it is an unmeasured one.

Practical Example: The Judge That Loved Verbose Wrong Answers

Who: An applied-science team shipping a customer-support agent backed by a distributed RAG store.

Situation: Nightly evaluation graded 5,000 trajectories with a language-model judge and reported a steady success rate near 0.80.

Problem: A new agent version scored 0.86 on the judge but generated more customer escalations in production, not fewer.

Dilemma: Trust the judge's higher score and ship, or stop and validate the instrument that produced it, delaying the release.

Decision: They sampled 300 trajectories, had two humans grade them, and computed Cohen's kappa against the judge.

How: Kappa was only 0.41. Error analysis showed the new version wrote longer, more hedged answers, and the judge's verbosity bias rewarded length over correctness; the human labels did not.

Result: They rewrote the judge rubric to penalize unsupported claims, added a faithfulness check from layer one, and recomputed kappa to 0.74 before re-running the benchmark. The corrected judge ranked the old version higher, matching production.

Lesson: A judge score is only as good as its agreement with humans. Calibrate before you trust, and recalibrate whenever the agent's output distribution shifts.

4. Layer Four: Systems Metrics, Co-Measured with Quality Intermediate

An agent that answers correctly but takes thirty seconds and a dollar per query may be worse, in production, than one that is slightly less accurate but ten times faster and cheaper. Quality without efficiency is half a picture, which is why this book co-measures them. The systems layer reports latency per step and end-to-end, summarized not by the mean but by tail percentiles, because the user who waits on the slow tail is the user who churns; the $p99$ end-to-end latency, the value below which 99% of requests complete, is the number that governs a latency budget, following the tail-latency reasoning of Chapter 3. Alongside latency sits cost-per-task, which for an agentic system aggregates billed input and output tokens across every step and every tool call:

$$\text{cost-per-task} = \frac{1}{N}\sum_{i=1}^{N}\Big( c_{\text{in}}\, t^{\text{in}}_i + c_{\text{out}}\, t^{\text{out}}_i + c_{\text{tool}}\, u_i \Big),$$

where $t^{\text{in}}_i, t^{\text{out}}_i$ are the input and output tokens summed over the trajectory, $u_i$ is the number of tool calls, and $c_{\text{in}}, c_{\text{out}}, c_{\text{tool}}$ are the per-unit prices. A multi-step agent multiplies token cost by its step count, so cost-per-task is the metric that exposes a planner that wanders. Reliability completes the layer: the fraction of trajectories that finished without an unrecovered error, a timeout, or a tool outage. Quality and these three systems numbers belong in one scorecard, computed on one configuration, so that a quality gain bought with a 5x cost increase is visible at the moment of the trade rather than discovered on the next billing cycle.

Thesis Thread: Evaluation Is Itself a Distributed, Multi-Layer Computation

The scorecard in Figure 40.9.1 is not measured on one machine in one pass over a handful of examples. A serious agentic benchmark runs thousands of trajectories, each itself a distributed computation across retrieval, model, and tool services, and the four metric layers are reduced over all of them. Evaluation thus inherits the same scale-out structure as the system it grades: the work is partitioned across trajectories, executed in parallel, and reduced into summary statistics, exactly the map-then-reduce shape of Chapter 6. Measuring a distributed agent is a distributed job, and both its task quality and its system efficiency are co-measured in that one job.

5. Running the Benchmark as a Parallel Batch Job Intermediate

Each trajectory is independent of every other, which makes the offline benchmark an embarrassingly parallel batch problem. The natural execution model is the one from Chapter 6: map each task in the benchmark set to a worker that runs the full agent and records its trajectory, then reduce the per-trajectory records into the four metric layers. With a benchmark of thousands of tasks and an agent that takes seconds per task, running serially would take hours; sharding the tasks across a pool of workers turns hours into minutes, bounded only by the slowest shard. The reduce step is where the confidence interval and the kappa are computed, over the pooled outcomes, so that the summary statistics reflect the entire suite rather than one shard. The demonstration below performs the reduce step on simulated trajectory outcomes: it computes the task success rate with its interval, the retrieval recall, the judge-versus-human kappa, and the cost-per-task, the four headline numbers of the scorecard, from one set of records.

import numpy as np

rng = np.random.default_rng(7)

# ---- 1. Task success rate with a Wald confidence interval over trajectories ----
N = 600
successes = rng.binomial(1, 0.72, size=N)         # 1 = goal achieved, 0 = failed
p_hat = successes.mean()
se = np.sqrt(p_hat * (1.0 - p_hat) / N)
z = 1.96                                           # 95% normal-approximation interval
ci_lo, ci_hi = p_hat - z * se, p_hat + z * se

# ---- 2. Retrieval recall@k of the agent's retrieval component ----
def recall_at_k(retrieved, relevant):
    relevant = set(relevant)
    return len(set(retrieved) & relevant) / len(relevant) if relevant else np.nan

Q, recalls = 200, []
for _ in range(Q):
    relevant = rng.choice(1000, size=rng.integers(1, 5), replace=False)
    found = [d for d in relevant if rng.random() < 0.8]   # retriever hit rate 0.8
    retrieved = list(found) + list(rng.choice(1000, size=10, replace=False))
    recalls.append(recall_at_k(retrieved[:10], relevant))
mean_recall = np.nanmean(recalls)

# ---- 3. LLM-as-judge vs human agreement: Cohen's kappa ----
M = 300
human = rng.binomial(1, 0.65, size=M)
flip = rng.random(M) > 0.85                         # judge disagrees 15% of the time
judge = np.where(flip, 1 - human, human)

def cohen_kappa(a, b):
    a, b = np.asarray(a), np.asarray(b)
    po = np.mean(a == b)                            # observed agreement
    pe = a.mean() * b.mean() + (1 - a.mean()) * (1 - b.mean())   # chance agreement
    return (po - pe) / (1 - pe)

kappa = cohen_kappa(human, judge)

# ---- 4. Cost per task (a systems metric co-measured with quality) ----
in_tok  = rng.integers(2000, 8000, size=N)
out_tok = rng.integers(200, 1500, size=N)
tool_calls = rng.integers(1, 12, size=N)
cost = in_tok * 3e-6 + out_tok * 15e-6 + tool_calls * 4e-3      # $/tok, $/tok, $/call
cost_per_task = cost.mean()

print(f"task success rate     : {p_hat:.3f}")
print(f"95% CI (Wald)         : [{ci_lo:.3f}, {ci_hi:.3f}]")
print(f"mean recall@10        : {mean_recall:.3f}")
print(f"Cohen's kappa         : {kappa:.3f}")
print(f"cost per task (USD)   : {cost_per_task:.4f}")

Code 40.9.1: The reduce step of a distributed agentic benchmark, computing all four scorecard layers from pooled trajectory records. In a real run the per-trajectory outcomes arrive from workers that each executed the full agent; here they are simulated so the statistics (success rate with interval, recall, kappa, cost) can be reproduced deterministically.

task success rate     : 0.710
95% CI (Wald)         : [0.674, 0.746]
mean recall@10        : 0.783
Cohen's kappa         : 0.719
cost per task (USD)   : 0.0513

Output 40.9.1: The scorecard from one configuration. The success rate carries a 95% interval of about $\pm 0.036$, the retriever recovers 78% of relevant documents, the judge agrees with humans at kappa $0.72$ (substantial, clearing the trust bar), and each task costs roughly five cents; quality and cost are read off the same run.

Library Shortcut: RAGAS and an Eval Harness Compute the Layers for You

Code 40.9.1 spells out the statistics by hand to make them transparent. In practice an evaluation harness wires the layers to your dataset and runs them as a job. RAGAS supplies the component and faithfulness metrics for the retrieval-and-generation layers, and a harness such as a thin wrapper over a dataset runner handles the map-reduce execution:

# pip install ragas datasets
from ragas import evaluate
from ragas.metrics import context_recall, faithfulness, answer_correctness
from datasets import Dataset

# one row per trajectory: question, retrieved contexts, answer, ground truth
ds = Dataset.from_dict({
    "question":     questions,
    "contexts":     retrieved_contexts,   # list[list[str]] per query
    "answer":       agent_answers,
    "ground_truth": gold_answers,
})

report = evaluate(ds, metrics=[context_recall, faithfulness, answer_correctness])
print(report)        # per-metric means over the whole suite, one call

Code 40.9.2: The component and faithfulness layers of Code 40.9.1, now a single evaluate call. RAGAS handles the per-row metric computation, the LLM-judge prompting for faithfulness, and the reduction to suite-level means; you supply the dataset and the metric list. The roughly forty lines of hand-rolled scoring collapse to one call, and the harness parallelizes the per-row work internally.

6. Online Monitoring and Regression CI for Agents Advanced

The offline benchmark answers "is this version good?" once; production answers "is it still good?" continuously, because the world drifts under a deployed agent in ways no static benchmark sees. Online monitoring instruments the live system with the same systems metrics from layer four, p99 latency, cost-per-task, and reliability streamed to dashboards, plus quality proxies that can be computed without ground truth: the rate of tool errors, the rate of empty or refused answers, the distribution of trajectory lengths, and a sampled LLM-judge score on live traffic. This is the agentic instance of the fleet monitoring of Chapter 26 and the reliability monitoring of Section 35.7; an agent that starts taking twelve steps where it used to take four, or whose tool-error rate doubles overnight, is degrading even if no single answer is obviously wrong. Monitoring catches the drift that the offline suite, frozen at release time, cannot.

The complement to monitoring is regression evaluation in continuous integration. Every change to a prompt, a tool, a model version, or the orchestration graph can silently move the scorecard, so the benchmark suite runs as a gate in CI: a curated set of trajectories is executed on each candidate, the four layers are computed, and the build fails if task success drops outside its confidence interval, if cost-per-task rises beyond a budget, or if judge kappa falls below the trust bar. Because the suite is the same parallel batch job from Section 5, it fits the CI time budget when sharded across workers. Pinning the judge model version and the random seed makes the comparison construct-matched, the same panel and configuration on each run, so a measured regression is a real change in the agent and not noise in the instrument. An agent without a regression gate is an agent that degrades silently between releases.

Research Frontier: Trustworthy Agentic Evaluation (2024 to 2026)

Because the LLM judge is now load-bearing, a vigorous research line is hardening it. Work on judge bias and calibration quantifies and mitigates position, verbosity, and self-preference effects, with panel-of-judges and debiasing-prompt methods reporting higher human agreement than a single judge. Agentic benchmarks have matured past single-turn question answering toward executable, multi-step tasks: AgentBench, the WebArena and tau-bench lineages, GAIA, and SWE-bench grade whether an agent actually accomplished a goal in a live environment, scoring trajectories and tool use rather than final text alone. A parallel thread on process supervision and trajectory-level reward modeling grades each step of a plan, not only the outcome, which sharpens partial-credit scoring and feeds back into training. The open problem the field is converging on is a benchmark that co-reports task success, judge-human agreement, and per-task cost as one figure, the very scorecard this section assembles, so that a reported quality gain can never hide an unmeasured cost or an uncalibrated judge.

Fun Note

An agent once scored a perfect 1.0 on an internal benchmark, which delighted everyone until someone noticed the benchmark's ground-truth answers had leaked into the agent's retrieval corpus. The agent was not solving the tasks; it was retrieving the answer key. The fix was a single line that excluded the eval set from the index, and the success rate settled to a far more honest 0.68. Contamination is the oldest way to ace a test, and language agents rediscovered it within a week.

The four layers, run as a parallel batch job offline and as monitoring plus a CI gate online, give the chapter's agentic system an honest, multi-faceted grade: each component measured, the whole task measured, the judge that grades it validated against humans, and the cost of all of it co-reported. With evaluation in hand, the chapter turns in Section 40.10 to a staged project that hands the system back to the reader to build out one layer at a time and extend.

Exercise 40.9.1: Which Layer Caught It? Conceptual

For each symptom, name which of the four evaluation layers would first detect it and explain why the others would miss or misattribute it: (a) the agent's answers are fluent and confident but cite documents that do not support them; (b) end-to-end success is unchanged but the average answer now costs three times as much; (c) the offline benchmark reports 0.82 success while production users complain the agent is wrong half the time; (d) success rate jumped from 0.74 to 0.81 between two builds but the new build's answers are merely longer. State, for (c) and (d), what validation step you would run before trusting the number.

Exercise 40.9.2: Tighten the Interval, Calibrate the Judge Coding

Starting from Code 40.9.1, (a) replace the Wald interval with the Wilson score interval and compare the two at $N = 600$ and again at $N = 30$, explaining why the Wilson interval is preferred for small samples and for $\hat{p}$ near $0$ or $1$. (b) Add a second human rater and compute Cohen's kappa between the two humans (inter-rater agreement); then argue why the judge-human kappa should be interpreted relative to the human-human kappa rather than against an absolute bar. (c) Determine how many judged trajectories $M$ you would need so that the judge-human kappa is estimated to within $\pm 0.05$, by bootstrapping the kappa over resamples of the labels.

Exercise 40.9.3: The Cost of a Quality Gain Analysis

An agent variant raises task success from 0.71 (95% CI $[0.674, 0.746]$) to 0.75 but its cost-per-task rises from \$0.051 to \$0.190 because it takes more reasoning steps. Using the interval, argue whether the 0.04 improvement is statistically distinguishable at $N = 600$ and how large $N$ would have to be for it to become so. Then, treating the system as serving one million tasks per day, compute the daily cost difference and state the decision rule you would use to accept or reject the variant. Explain why reporting the success gain without the cost number, or the cost number without the interval, would each lead to a wrong call.