Section 36.8: Evaluation | Building Scalable AI

"They asked me to grade a system spread across a thousand machines, and gave me a single number to do it with. I have since learned to ask for several numbers, and to ask which machine produced each one."
A Metric, Trying to Judge a System Too Big to See at Once

Big Picture

A distributed RAG system cannot be judged by a single score, because it has two intelligences and one machine, and all three can fail independently. The retriever can surface the wrong passages, the generator can write a fluent answer ungrounded in those passages, and the cluster underneath can be too slow, too expensive, or too stale to matter even when both models are correct. Evaluation here means measuring all three layers at once: retrieval quality (did we find the right documents?), answer quality (did the model use them faithfully?), and system efficiency (at what throughput, latency, and cost?). The discipline this section enforces is that these numbers are only comparable when they are co-computed in one pass, on one configuration: one index build, one model, one held-out set, one cluster size. A retrieval score from yesterday's index and a latency from today's hardware do not describe the same system, and stitching them together produces an audit that passes while describing nothing real.

By the time a query reaches the end of this chapter's pipeline it has passed through a sharded inverted index and a sharded vector store (Chapter 25), a reranker, and a distributed generation step, all running across many machines. The previous section assembled that pipeline; this one asks whether it is any good. The trap to avoid is the one the epigraph names: reducing a many-machine system to a single accuracy figure that hides whether the retriever, the generator, or the cluster is the thing that needs fixing. We therefore evaluate on two model layers and one systems layer, and we insist on the methodology from Chapter 5: measure the machine learning quality and the distributed-system efficiency together, on the same config, in the same run.

Figure 36.8.1: The evaluation scorecard for a distributed RAG system. Three layers (retrieval quality, generation quality, system efficiency) are each measured by their own metrics, then rolled into one verdict computed on a single configuration. Reading any single band in isolation hides whether a low end-to-end score comes from the retriever, the generator, or the cluster.

1. Two Intelligences, One Machine, Three Layers Beginner

A RAG system answers a question in two learned steps. First the retriever selects a small set of passages from a corpus too large to read in full; then the generator conditions on those passages to write an answer. Each step is a model, and each can be wrong in its own way, so a competent evaluation reports them separately before combining them. Layer one asks whether the retriever put the right evidence in front of the generator. Layer two asks whether the generator used that evidence faithfully or improvised. These two layers are the machine learning quality of the system, and they are necessary but not sufficient, because a RAG system is also a piece of distributed infrastructure that must answer at volume, within a latency budget, at a defensible cost, over an index that is fresh enough to be correct. That is layer three, the systems quality, and it is the layer the rest of this book has been teaching you to measure. Figure 36.8.1 lays out all three.

The reason to keep the layers distinct is diagnostic. When the end-to-end answer is wrong, the layers tell you where to look: a low recall at layer one means the generator never had a chance; a high recall but low faithfulness at layer two means the evidence was present and ignored; a perfect pair of model scores beside a p99 latency that blows the budget means the models are fine and the cluster is the problem. A single fused score cannot make that distinction, and a system you cannot diagnose is a system you cannot improve.

Key Insight: Quality and Efficiency Are One Measurement, Not Two

The temptation is to let the machine learning team report accuracy and the platform team report latency, each on its own benchmark and its own hardware. That split produces numbers that cannot be compared, because they describe different systems. A faithful evaluation of a distributed RAG pipeline co-computes the retrieval metrics, the answer metrics, and the system metrics in one pass over one held-out set on one cluster configuration, and reports them as one artifact. Only then does "this version is better" mean the same thing on every axis: the same index, the same model, the same hardware produced every number you are comparing.

2. Layer One: Retrieval Quality Intermediate

Retrieval quality is an information-retrieval problem with decades of settled metrics, introduced for distributed vector search in Chapter 25 and applied here to the full RAG retriever. We need a held-out set of queries, each annotated with the set of documents that are genuinely relevant, and then we score the ranked list the retriever returns against that ground truth. Three metrics carry most of the weight. The first, recall at $k$, asks what fraction of the relevant documents appear anywhere in the top $k$ results, which is the quantity that bounds how good the generator can possibly be, since it cannot cite evidence it never received:

$$\mathrm{Recall@}k = \frac{\lvert \{\text{relevant docs}\} \cap \{\text{top-}k\ \text{retrieved}\} \rvert}{\lvert \{\text{relevant docs}\} \rvert}.$$

Recall says how much relevant evidence arrived but ignores where in the ranking it landed; for a generator that reads the top results first, rank matters. The mean reciprocal rank captures it by averaging, over $Q$ queries, the reciprocal of the position of the first relevant hit, so a relevant document at rank 1 scores 1 and one at rank 5 scores only $0.2$:

$$\mathrm{MRR} = \frac{1}{Q} \sum_{q=1}^{Q} \frac{1}{\mathrm{rank}_q},$$

where $\mathrm{rank}_q$ is the position of the first relevant document for query $q$. To grade the whole ranked list rather than just its first hit, normalized discounted cumulative gain discounts each relevant document by the logarithm of its rank and divides by the best achievable arrangement, so a perfect ordering scores 1:

$$\mathrm{DCG@}k = \sum_{i=1}^{k} \frac{\mathrm{rel}_i}{\log_2(i+1)}, \qquad \mathrm{nDCG@}k = \frac{\mathrm{DCG@}k}{\mathrm{IDCG@}k},$$

where $\mathrm{rel}_i$ is the relevance of the document at rank $i$ (here $1$ if relevant, $0$ otherwise) and $\mathrm{IDCG@}k$ is the $\mathrm{DCG@}k$ of the ideal ranking. These three are computed for real in Code 36.8.1, alongside the system metrics, so that a single run produces the whole top band of Figure 36.8.1.

3. Layer Two: Generation and Answer Quality Intermediate

Good retrieval is wasted if the generator ignores it. The second layer therefore measures the answer itself against the passages that were retrieved, and three questions organize it. Faithfulness, sometimes called groundedness, asks whether every claim in the answer is supported by the retrieved context; an answer that asserts a fact absent from its evidence is a hallucination regardless of whether the fact happens to be true. Answer relevance asks whether the response actually addresses the question rather than wandering into adjacent, well-grounded but unhelpful territory. The hallucination rate is the complement of faithfulness aggregated over the held-out set, the fraction of answers containing at least one unsupported claim, and it is the number an operator watches most closely because an ungrounded answer in production erodes trust faster than a merely incomplete one.

Unlike the retrieval metrics, these quantities have no closed form, because judging whether a sentence is supported by a passage is itself a language-understanding task. The modern practice is to use a strong language model as the judge: present it the question, the retrieved context, and the generated answer, and ask it to decompose the answer into atomic claims and mark each as supported or not. The faithfulness score is then the fraction of claims supported. This is the LLM-as-judge pattern, and it is powerful and fallible at once; a judge model has its own biases, can be swayed by answer length or fluency, and must itself be validated against a sample of human labels before its verdicts are trusted at scale. We treat its limitations as a measurement problem, the same way Chapter 5 treats any estimator with variance, and we run the judge as a distributed batch job because evaluating tens of thousands of held-out answers with a large judge model is itself a scale-out workload.

Library Shortcut: RAGAS and trec_eval Compute These Metrics for You

Code 36.8.1 implements recall, MRR, and nDCG by hand to make their definitions concrete. In production you reach for established tools. For the retrieval layer, trec_eval (and its Python port pytrec_eval) is the standard scorer that consumes a TREC-format run file and a qrels file of relevance judgments and emits dozens of metrics including recall, recip_rank, and ndcg_cut. For the generation layer, the ragas library wraps the LLM-as-judge pattern into named metrics:

# pip install ragas pytrec_eval
import pytrec_eval

qrels = {"q0": {"d12": 1, "d3": 1}}                 # relevance judgments
run   = {"q0": {"d3": 9.1, "d20": 8.4, "d12": 7.7}} # retriever scores
ev = pytrec_eval.RelevanceEvaluator(qrels, {"recall_5", "ndcg_cut_5", "recip_rank"})
print(ev.evaluate(run)["q0"])    # -> recall_5, ndcg_cut_5, recip_rank in one call

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
# dataset columns: question, contexts (retrieved), answer
scores = evaluate(dataset, metrics=[faithfulness, answer_relevancy])  # LLM-as-judge

Code 36.8.2: Roughly thirty lines of hand-written metric code from Code 36.8.1 collapse to two library calls. pytrec_eval handles the ranked-list bookkeeping and tie-breaking that TREC standardized; ragas handles the prompt construction, claim decomposition, and judge-model orchestration behind faithfulness and answer relevance.

Practical Example: The Fluent Answer That Failed the Faithfulness Gate

Who: A platform engineer running a web-scale RAG assistant over a documentation corpus.

Situation: A new generator model shipped with higher answer-relevance and human-preference scores, and the team prepared to roll it out.

Problem: Recall at layer one was unchanged, yet the LLM-as-judge faithfulness score fell from 0.91 to 0.78, and the hallucination rate nearly doubled.

Dilemma: Ship the model that users rated more helpful, or block on a faithfulness regression that no single user complaint had yet surfaced.

Decision: They blocked, then audited a sample of the judge's verdicts against human labels and confirmed the regression was real, not a judge artifact.

How: The new model wrote longer, more confident answers that pulled in plausible facts absent from the retrieved passages; relevance rose because the prose was on-topic, but groundedness fell because the prose outran the evidence.

Result: They tuned the prompt to constrain the model to its context and re-ran the full scorecard on one config; faithfulness recovered to 0.90 while relevance held, and only then did the model ship.

Lesson: A more helpful-sounding answer is not a better answer. Without a faithfulness gate co-measured with relevance, the system would have shipped a regression dressed as an improvement.

4. Layer Three: System Efficiency, Co-Measured Intermediate

The third layer is the one this book evaluates every system against, and it is where the distribution itself is graded. Four families of numbers matter. Latency is reported as a distribution, not a mean: the p50 is the typical experience and the p99 is the tail that a sharded retriever, which must wait for its slowest shard before it can rank, tends to inflate, exactly the straggler tax that Chapter 3 models. Throughput is the sustained queries per second the fleet delivers under that latency budget. Cost per query divides the hourly cost of the serving nodes by the queries they answer in that hour,

$$\mathrm{cost\text{-}per\text{-}query} = \frac{(\text{node-hours}) \times (\text{cost per node-hour})}{\text{queries answered per hour}},$$

and it is the number that decides whether a quality gain is affordable. Freshness, the lag between a document entering the corpus and becoming retrievable, is a quality metric in disguise: a correct answer about a stale index is a wrong answer about the world.

The distribution itself is also graded, with the speedup and efficiency framing from Chapter 3. If a job takes time $T_1$ on one worker and $T_p$ on $p$ workers, the speedup and parallel efficiency are

$$S = \frac{T_1}{T_p}, \qquad E = \frac{S}{p},$$

where $E = 1$ is perfect linear scaling and $E$ below one is the communication and straggler overhead of spreading the work. This framing applies twice over in a RAG system: to the online serving fleet, and to the offline evaluation job itself, because scoring a large held-out set is a batch workload that we parallelize across workers exactly as Chapter 6 would parallelize any embarrassingly parallel map. Code 36.8.1 computes $S$ and $E$ from real timings beside the retrieval metrics, so the scorecard reports model quality and distribution efficiency from one run.

import numpy as np

# Ground truth: for each of 5 queries, the set of relevant document ids.
relevant = {0: {"d12", "d3"}, 1: {"d7"}, 2: {"d1", "d9", "d4"},
            3: {"d5"}, 4: {"d2", "d8"}}

# Ranked retrievals returned by the distributed RAG retriever, top-5 per query.
ranked = {0: ["d3", "d20", "d44", "d99", "d7"],   # finds d3, misses d12
          1: ["d11", "d7", "d2", "d9", "d5"],     # one relevant doc at rank 2
          2: ["d9", "d33", "d4", "d6", "d77"],    # finds d9, d4; misses d1
          3: ["d8", "d2", "d6", "d1", "d5"],      # relevant d5 buried at rank 5
          4: ["d2", "d6", "d55", "d9", "d3"]}     # finds d2, misses d8
K = 5

def recall_at_k(ret, rel, k):
    return sum(1 for d in ret[:k] if d in rel) / len(rel)

def reciprocal_rank(ret, rel):                    # 1 / position of first hit
    for i, d in enumerate(ret, start=1):
        if d in rel:
            return 1.0 / i
    return 0.0

def ndcg_at_k(ret, rel, k):
    dcg = sum((1.0 if d in rel else 0.0) / np.log2(i + 1)
              for i, d in enumerate(ret[:k], start=1))
    idcg = sum(1.0 / np.log2(i + 1) for i in range(1, min(len(rel), k) + 1))
    return dcg / idcg if idcg > 0 else 0.0

recalls = [recall_at_k(ranked[q], relevant[q], K) for q in relevant]
rrs     = [reciprocal_rank(ranked[q], relevant[q]) for q in relevant]
ndcgs   = [ndcg_at_k(ranked[q], relevant[q], K) for q in relevant]
print("=== Retrieval quality (averaged over 5 queries) ===")
print(f"Recall@{K} : {np.mean(recalls):.4f}")
print(f"MRR       : {np.mean(rrs):.4f}")
print(f"nDCG@{K}  : {np.mean(ndcgs):.4f}")

# System efficiency: speedup and efficiency of the distributed eval job.
T1 = 1840.0                                       # one worker, full held-out set
timings = {2: 940.0, 4: 487.0, 8: 262.0, 16: 158.0}
print("\n=== Distributed efficiency of the evaluation job itself ===")
print(f"{'workers p':>9} {'T_p (s)':>9} {'speedup S':>10} {'efficiency':>11}")
print(f"{1:>9} {T1:>9.0f} {1.0:>10.2f} {1.0:>11.2f}")
for p in sorted(timings):
    S = T1 / timings[p]
    print(f"{p:>9} {timings[p]:>9.0f} {S:>10.2f} {S / p:>11.2f}")

# Cost per query at the chosen serving point.
cost_per_query = (2.40 * 8) / (320.0 * 3600)      # node-cost * nodes / queries-per-hour
print(f"\ncost-per-query @ 8 nodes, 320 qps : ${cost_per_query:.6f}")

Code 36.8.1: One pass that computes the retrieval layer (recall@5, MRR, nDCG@5) and the system layer (speedup, efficiency, cost-per-query) together, so every number on the scorecard comes from the same configuration. The faithfulness layer of Figure 36.8.1 is added with the LLM-as-judge tool of Code 36.8.2.

=== Retrieval quality (averaged over 5 queries) ===
Recall@5 : 0.7333
MRR       : 0.7400
nDCG@5  : 0.5896

=== Distributed efficiency of the evaluation job itself ===
workers p   T_p (s)  speedup S  efficiency
        1      1840       1.00        1.00
        2       940       1.96        0.98
        4       487       3.78        0.94
        8       262       7.02        0.88
       16       158      11.65        0.73

cost-per-query @ 8 nodes, 320 qps : $0.000017

Output 36.8.1: The retriever recovers 73 percent of relevant documents on average and ranks the first hit near the top (MRR 0.74), but the lower nDCG@5 reveals relevant evidence buried deeper in the list. The eval job scales near-linearly to 8 workers (efficiency 0.88) and then degrades to 0.73 at 16, the point where coordination overhead starts to eat the gain.

Thesis Thread: The Same Scorecard, Now Across Machines

Evaluation is where scale-out stops being a means and becomes a thing to be measured. The retrieval and answer metrics judge the two AI models, but the speedup and efficiency numbers in Output 36.8.1 judge the distribution itself: how well the work spreads across $p$ workers, and where adding machines stops helping. A capstone system (Chapter 41) is not defended by its accuracy alone but by accuracy reported beside the efficiency of the cluster that achieved it. Throughout this book the combining step was the tax on scale-out; here that tax becomes a column on the scorecard, the $1 - E$ you pay to evaluate at the scale you serve at.

5. Offline at Scale, then Online A/B Advanced

Two evaluation regimes coexist, and a mature RAG program runs both. Offline evaluation replays a fixed held-out set through the pipeline and scores it, which is reproducible, cheap to repeat on every change, and the right gate before anything reaches users. At web scale the held-out set has tens or hundreds of thousands of queries and the judge model is large, so offline evaluation is itself a distributed batch job: shard the held-out queries across workers, score each shard independently, and reduce the per-shard metrics into one aggregate, the same map-then-reduce shape as Chapter 6, with the speedup and efficiency of that job graded exactly as in Output 36.8.1. The methodology, what to hold out, how to control variance, how to validate the judge against humans, is the subject of Chapter 5; this section applies it to a RAG pipeline specifically.

Offline scores predict but do not prove production quality, because the held-out set is never the live query distribution. Online A/B testing closes the gap: route a fraction of live traffic to the candidate system, hold the rest on the control, and compare not only the model metrics computed on logged interactions but the system metrics that only production reveals, p99 latency under real load, cost per query at real traffic, and downstream signals like answer acceptance or follow-up rate. The discipline of the offline run carries over: the A and B arms must differ in exactly one thing, so that a difference in outcome is attributable. Changing the retriever and the generator and the cluster size at once produces a result you cannot read, the online analogue of comparing numbers from different configs that the Big Picture warned against.

Research Frontier: Trustworthy LLM-as-Judge for RAG (2024 to 2026)

The faithfulness and relevance scores of layer two rest on a judge model, and a vigorous research line is making that judge trustworthy enough to gate releases. The RAGAS framework (Es et al., 2023) formalized reference-free faithfulness and answer-relevance scoring, and benchmarks such as RAGTruth (Niu et al., 2024) supply human-annotated hallucination labels to calibrate judges against ground truth. Work on judge bias documents systematic errors, position bias, verbosity bias, self-preference, and proposes panels of judges and bias-correction to counter them; the LLM-as-a-judge survey literature of 2024 to 2025 collects these. A parallel thread builds RAG-specific benchmarks (RGB, CRAG, the 2024 RAG evaluation suites) that stress retrieval robustness, noise tolerance, and refusal-when-unsupported. The frontier question is statistical: how many judge calls, over how diverse a held-out set, are needed for a faithfulness difference to be significant rather than noise, which is the variance-control problem of Chapter 5 applied to a learned, expensive, and biased estimator.

Fun Note

A RAG system that hallucinates confidently and a student who bluffs an exam answer share a failure mode: both are fluent, both are on-topic, and both are graded down the instant a grader checks the answer against the source rather than against its vibe. Faithfulness scoring is just the grader finally reading the citations.

6. Reading the Scorecard as One Verdict Beginner

The point of the three layers is to be read together. Table 36.8.1 collects the metrics, what each one tells you, and the chapter that develops it, so that a release decision consults a single scorecard rather than three disconnected dashboards. A candidate ships only when it improves or holds every layer on the same configuration: retrieval at least as good, faithfulness at least as good, and latency, cost, and freshness within budget at the throughput it must serve. A gain on one layer bought by a regression on another is not a gain, it is a trade, and trades are decisions to make in the open with all the numbers present, not surprises to discover in production.

Table 36.8.1: The distributed RAG evaluation scorecard. Three layers, the metrics that grade each, the question each answers, and where the book develops it. A release decision reads all three layers from one run on one configuration.

Layer	Metrics	Question it answers	Developed in
Retrieval quality	Recall@k, nDCG@k, MRR	Did we find the right evidence, and rank it well?	Chapter 25
Answer quality	Faithfulness, answer relevance, hallucination rate	Did the model use the evidence faithfully?	This chapter (LLM-as-judge)
System efficiency	p50/p99 latency, throughput, cost/query, freshness	Fast, affordable, and current enough to serve?	Chapter 3, Chapter 5
Distribution itself	Speedup $S$, efficiency $E$	How well does the work spread across machines?	Chapter 3

With evaluation in hand, the chapter has what it needs to argue that this web-scale RAG system is not merely assembled but good, on every axis at once. The next section turns from measuring the system to building it, handing the whole pipeline back as a staged project that grows a single-machine baseline into the distributed system one milestone at a time; that construction plan begins in Section 36.9.

Exercise 36.8.1: Which Layer Is Broken? Conceptual

For each symptom, name the layer of Figure 36.8.1 that is most likely at fault and the one metric you would check first: (a) answers are fluent and on-topic but frequently cite facts the retrieved passages do not contain; (b) the assistant gives correct answers about last month's documentation but wrong answers about a feature shipped yesterday; (c) recall@10 is 0.95 and faithfulness is 0.92, yet 3 percent of live requests time out during traffic peaks; (d) the first relevant document is almost always returned, but answers often miss a second relevant source that sits lower in the ranking. Explain why fixing the wrong layer would leave each symptom unchanged.

Exercise 36.8.2: Extend the Scorecard Coding

Starting from Code 36.8.1, add precision@k and a per-query breakdown so the output prints, for each query, its recall, reciprocal rank, and nDCG before the averages. Then add a fourth metric to the system layer: given a freshness log mapping each document to the seconds between its ingestion and its first retrievability, print the p50 and p99 of that lag. Confirm that adding metrics does not change the existing numbers, since every metric must be co-computed from the same retrieval results and timings rather than from separate runs.

Exercise 36.8.3: Where Adding Machines Stops Paying Analysis

Using the timings in Output 36.8.1, the efficiency $E = S/p$ falls from 0.88 at 8 workers to 0.73 at 16. Estimate the serial fraction $f$ of the evaluation job by fitting Amdahl's law $S = 1 / (f + (1-f)/p)$ to the measured speedups, and use it to predict the speedup at 32 and 64 workers. At what worker count does the marginal speedup from doubling the cluster fall below 1.2, and what does that imply for how many machines you should rent to run nightly offline evaluation? Tie your answer to the communication-cost reasoning of Chapter 3.