Section 25.9: Evaluating Distributed Retrieval

"They told the dashboard I answered in two milliseconds. They forgot to mention I found the wrong document."
An ANN Index Tuned a Little Too Aggressively

Big Picture

A distributed retrieval system has two scoreboards that you must read at the same time: did it find the right items (quality), and did it do so fast and cheaply enough to serve (systems)? Every preceding section of this chapter added a knob that trades one against the other. Approximate nearest-neighbor search trades recall for speed; sharding trades a single latency for a scatter-gather whose tail is set by the slowest shard; caching trades freshness for throughput. Because these knobs couple, no single number describes the system. The discipline this section teaches, the same construct-matched evaluation discipline of Chapter 5, is to report quality and cost as a joint frontier rather than as separate headlines, so that "this retriever is fast" can never hide "at a recall nobody would accept".

Across this chapter we built a retrieval pipeline out of distributed parts: embeddings computed on a fleet of encoders, vectors stored in a sharded database, an approximate index that answers nearest-neighbor queries in sublinear time, replicas that absorb load, a scatter-gather that fans a query across shards, a reranker that sharpens the top results, and caches that short-circuit the repeats. Each part has a dial, and turning a dial moves both quality and cost. The final engineering question is therefore not "is it correct?" but "where on the trade-off surface does it sit, and is that the point we want?" Answering it requires measuring the system along two axes that most teams are tempted to measure separately, and that is exactly the temptation this section exists to break.

1. Two Scoreboards: Quality and Systems Beginner

The first axis is retrieval quality: of the items the system returned, how many were the ones it should have returned, and were the best ones near the top? The second axis is systems behavior: how many queries per second can the fleet sustain, what is the tail latency of a query, how long does the index take to build, how much memory does it occupy, and what does a single query cost. A retrieval system is healthy only when both scoreboards are healthy at once, and the central lesson of distributed retrieval is that the two are linked by the very approximations that make scale-out possible. You cannot tune one in isolation without silently moving the other.

This is the construct-matching point from Chapter 5 in its sharpest form. A metric is only meaningful next to the configuration that produced it. Recall measured at one approximate-search setting and latency measured at another describe two different systems and cannot be compared; they must be co-measured in one pass on one configuration. The rest of this section names the metrics on each scoreboard, then shows how to bind them together.

2. Retrieval-Quality Metrics Beginner

The headline quality metric for distributed retrieval is recall@k, because approximate search exists precisely to trade recall for speed. Let $R_q$ be the set of truly relevant (or, for ANN, the exact-nearest) items for query $q$, and let $\hat{R}_q^{(k)}$ be the top $k$ the system returned. Recall@k averages, over a query set $Q$, the fraction of the true items that were recovered,

$$\text{recall@}k = \frac{1}{|Q|} \sum_{q \in Q} \frac{|\,\hat{R}_q^{(k)} \cap R_q\,|}{\min(k, |R_q|)}.$$

For an ANN index the natural ground truth is the exact brute-force top-$k$, so recall@k measures how much of the exact answer the approximation kept; a recall of $0.9$ at $k=10$ means the index recovered nine of the ten exact neighbors on average. Precision@k, the complementary fraction of returned items that are relevant, matters more when retrieval feeds a downstream consumer with a fixed budget of slots. When the order within the top $k$ matters, as it does when a reranker or a generator reads the list top-down, ranking-aware metrics take over: nDCG discounts a relevant item by the logarithm of its rank so that a correct answer at position one counts more than the same answer at position ten, and MRR rewards getting the single first relevant item as high as possible. These are the same ranking metrics that govern reranking quality in Section 25.7, now read as evaluation rather than as a training objective.

Recall, precision, and nDCG measure the retriever in isolation. When retrieval feeds a generator, as it does in retrieval-augmented generation, the metric that ultimately matters is the quality of the generated answer, which depends on retrieval only through the context it supplies. Two end-to-end measures dominate here: groundedness (is every claim in the answer supported by a retrieved passage?) and faithfulness (does the answer avoid asserting anything the retrieved context does not support?). Both are commonly scored with an LLM-as-judge that reads the answer, the context, and the question and rates support, a technique we treat with its calibration caveats in Chapter 5. The crucial structural fact is that a retrieval change can lift recall yet leave answer quality flat (the generator already had enough context) or, worse, raise recall of near-duplicates that crowd out the one passage the answer needed. Quality must therefore be measured at the layer the user actually experiences, not only at the retriever's own output.

Key Insight: Recall Is a Knob, Not a Property

In an exact index, recall is fixed at one. In an approximate index it is a dial you set with nprobe (IVF) or efSearch (HNSW), the same dials introduced in Section 25.4. Turning the dial up recovers more of the exact neighbors and costs more compute and latency; turning it down is faster and recovers fewer. Because recall is a chosen operating point rather than a property of the system, a recall number is meaningless without the latency it was bought at. Report the pair, never the single number.

3. Systems Metrics Intermediate

The systems scoreboard begins with query throughput (QPS), the sustained rate the fleet serves under a fixed latency ceiling, and tail latency, reported at the p99 rather than the mean because a user feels the slow queries, not the average ones. Tail latency is where the distributed structure of the system shows itself most plainly: a query fanned across $S$ shards in the scatter-gather of Section 25.5 is only as fast as its slowest shard, so the p99 of the merged query is governed by the p99 of the worst replica, not the typical one. This is the fan-out tail-amplification effect, and it is why adding shards to raise capacity can quietly raise tail latency even as it lowers mean latency.

Two offline systems metrics round out the picture. Index build time and memory bound how often you can rebuild the index as the corpus grows and how many vectors fit per node; a graph index such as HNSW is fast to query but heavy to build and to hold in memory, while a quantized IVF-PQ index is lighter in memory but recovers less recall, the same construction trade-offs catalogued in Section 25.4. Finally, cost-per-query ties the systems scoreboard to money: it folds the accelerator and memory time a query consumes into a single figure that, tracked over the fleet, is the number a budget owner cares about. Cost-per-query is the retrieval instance of the cost analysis of Section 5.5, and it is the third axis of the frontier this section is building toward.

4. The Coupling: Report Recall and Latency Together Intermediate

The two scoreboards are not independent dashboards to be read side by side; they are a single surface. The cleanest way to see this is to sweep the recall dial and watch latency and cost move with it. The demonstration below builds a small IVF index in pure Python, computes exact brute-force ground truth, and then for each nprobe setting measures recall@10, the p99 latency of the scatter-gather query, and a derived cost-per-query, all co-measured in one pass on one configuration so the numbers are construct-matched by construction. The joint operating point we care about is the cheapest configuration that meets a recall target,

$$\text{operating point} = \arg\min_{c \,\in\, \mathcal{C}} \; \text{cost}(c) \quad \text{subject to} \quad \text{recall@}k(c) \ge \tau,$$

where $\mathcal{C}$ is the set of configurations (here, the nprobe values) and $\tau$ is the recall floor the application demands. Reporting only $\min_c \text{latency}(c)$ would select the fastest configuration regardless of whether it found anything useful, which is exactly the misleading headline this section warns against.

import math, random
random.seed(7)

DIM, N, NQ, NLIST, K = 16, 4000, 200, 64, 10        # dims, corpus, queries, IVF cells, recall@K

def rand_vec(): return [random.gauss(0, 1) for _ in range(DIM)]
def dist2(a, b): return sum((x - y) ** 2 for x, y in zip(a, b))

corpus  = [rand_vec() for _ in range(N)]
queries = [rand_vec() for _ in range(NQ)]
centroids = random.sample(corpus, NLIST)             # coarse quantizer

def nearest_cell(v): return min(range(NLIST), key=lambda i: dist2(v, centroids[i]))
cells = [[] for _ in range(NLIST)]
for idx, v in enumerate(corpus): cells[nearest_cell(v)].append(idx)

# exact ground truth: brute-force top-K neighbours per query
ground_truth = [set(sorted(range(N), key=lambda j: dist2(q, corpus[j]))[:K]) for q in queries]

def ivf_search(q, nprobe):                           # search nprobe nearest cells
    probed = sorted(range(NLIST), key=lambda i: dist2(q, centroids[i]))[:nprobe]
    cand = [j for ci in probed for j in cells[ci]]
    work = NLIST + len(cand)                          # distance evals ~ compute ~ cost
    return set(sorted(cand, key=lambda j: dist2(q, corpus[j]))[:K]), work

BASE_MS, PER_EVAL_US, TAIL, COST_PER_MS = 1.2, 0.9, 1.35, 0.00004
print(f"{'nprobe':>6} {'recall@10':>10} {'p99_ms':>8} {'$/query':>10}")
rows = []
for nprobe in [1, 2, 4, 8, 16, 32, 64]:
    hits, lat = 0, []
    for q, gt in zip(queries, ground_truth):
        found, work = ivf_search(q, nprobe)
        hits += len(found & gt)
        lat.append((BASE_MS + work * PER_EVAL_US / 1000.0) * TAIL)   # p99 straggler inflation
    recall = hits / (NQ * K); lat.sort()
    p99 = lat[int(0.99 * (len(lat) - 1))]; cost = p99 * COST_PER_MS
    rows.append((nprobe, recall, p99, cost))
    print(f"{nprobe:>6} {recall:>10.3f} {p99:>8.2f} {cost:>10.5f}")

TARGET = 0.90
knee = min([r for r in rows if r[1] >= TARGET], key=lambda r: r[2])
fast = min(rows, key=lambda r: r[2])
print(f"cheapest at recall>={TARGET}: nprobe={knee[0]} (recall={knee[1]:.3f}, p99={knee[2]:.2f} ms)")
print(f"fastest overall:          nprobe={fast[0]} (recall={fast[1]:.3f}): misleading alone")

Code 25.9.1: A construct-matched retrieval evaluation. For every nprobe the same loop produces recall@10, p99 latency, and cost-per-query together, so no number can be quoted at a configuration other than the one it was measured on.

nprobe  recall@10   p99_ms    $/query
     1      0.266     2.04    0.00008
     2      0.408     2.28    0.00009
     4      0.583     2.55    0.00010
     8      0.771     3.10    0.00012
    16      0.930     4.01    0.00016
    32      0.997     5.28    0.00021
    64      1.000     6.56    0.00026
cheapest at recall>=0.9: nprobe=16 (recall=0.930, p99=4.01 ms)
fastest overall:          nprobe=1 (recall=0.266): misleading alone

Output 25.9.1: The recall-latency-cost frontier. The fastest configuration (nprobe 1, p99 2.04 ms) recovers barely a quarter of the true neighbors; the operating point the application actually wants is the knee at nprobe 16, where 93% recall costs twice the latency and twice the dollars of the misleading headline.

The numbers tell the whole story of the section. Read the p99 column alone and nprobe 1 looks like a triumph at two milliseconds; read recall@10 beside it and that triumph collapses to a system that misses three of every four exact neighbors. The honest summary of this index is not a number but a curve: the frontier traced by the rows of Output 25.9.1, on which every team must choose a point. Figure 25.9.1 draws that frontier so the shape of the trade-off is visible at a glance.

Figure 25.9.1: The frontier from Output 25.9.1, drawn so the three coupled metrics appear at once. Latency runs along the horizontal axis, recall up the vertical axis, and cost-per-query grows with marker size. The fast low-recall point at the lower left (nprobe 1) sits far below the target line; the knee (dark marker, nprobe 16) is the cheapest point that clears the recall target, and the upper-right points buy the last few percent of recall at steeply rising latency and cost. Reporting any single axis hides this shape.

Thesis Thread: Evaluation Is Where the Distribution Becomes Visible

Every metric on the systems scoreboard is a distributed-systems quantity, not a single-machine one. The p99 latency is high because a query fans across shards and waits for the slowest (Section 25.5); throughput scales because replicas absorb load in parallel; cost-per-query aggregates accelerator time across the fleet; index build time is the cost of partitioning the corpus across nodes. Evaluating a retriever well is therefore evaluating a distributed system well, the discipline of Chapter 5 applied to the specific machine this chapter built. The recall-latency frontier is the place where scale-out stops being an implementation detail and becomes a number on a dashboard.

5. Online Evaluation, A/B Testing, and Embedding Drift Advanced

Offline metrics on a fixed query set tell you how a change behaves on yesterday's data; they cannot tell you how real users respond to it. For that you need online evaluation: route a fraction of live traffic to the new retriever in an A/B test and compare downstream outcomes (click-through, answer acceptance, task success) against the control, with the experiment infrastructure and shadow-deployment machinery that Chapter 26 develops for the whole fleet. A retrieval change that improves offline recall but degrades the live acceptance rate is a real and common outcome, usually because the offline relevance labels and the live objective measure different things, the construct mismatch of Chapter 5 reappearing online.

The harder, slower failure is drift. Embeddings are trained on a snapshot of the world; as the corpus and the queries evolve, the geometry the index was built on stops matching the geometry the queries now inhabit, and recall decays even though no code changed. Detecting this requires monitoring the distribution of query and document embeddings over time and alerting when it shifts, the same drift-detection problem posed for streaming features in Section 9.9, now applied to the embedding space rather than to raw features. The remedy, re-embedding the corpus and rebuilding the index with a refreshed model, is expensive (it is the index-build cost from Section 3 paid in full), which is exactly why drift must be measured continuously: you rebuild when the evidence says recall has decayed, not on a fixed calendar that either wastes compute or lets quality rot.

Practical Example: The Retriever That Got Faster and Worse

Who: A platform engineer running the retrieval tier for a support-assistant RAG product.

Situation: Latency dashboards were red during peak hours, and the on-call goal for the quarter was to cut p99 retrieval latency below five milliseconds.

Problem: Lowering the HNSW efSearch dial took p99 from nine milliseconds to three and closed the latency ticket immediately.

Dilemma: Ship the fast configuration on the strength of the green latency dashboard, or hold it because recall was never measured at the new setting and the latency dashboard had no quality axis beside it.

Decision: They held it and ran the construct-matched sweep of Code 25.9.1 on their own index, co-measuring recall@10 and p99 at every efSearch value in one pass.

How: The sweep showed the three-millisecond setting sat at 0.71 recall, far below the 0.92 the answer-faithfulness judge needed; the knee that met the recall floor landed at four milliseconds, still inside the budget.

Result: They shipped the four-millisecond knee, met the latency goal, and held answer quality flat; a later A/B test confirmed live acceptance was unchanged, where the three-millisecond setting would have dropped it.

Lesson: A latency dashboard without a quality axis beside it is an invitation to ship a fast wrong answer. Co-measure, then choose the knee.

Research Frontier: Benchmarking Retrieval and RAG (2024 to 2026)

Retrieval evaluation has consolidated around shared benchmarks. BEIR (Thakur et al.) made zero-shot retrieval comparable across eighteen heterogeneous tasks, and MTEB (Muennighoff et al.) extended this into a broad embedding leaderboard whose 2024 to 2025 multilingual and long-context editions (MMTEB) now drive most public embedding-model choices. For the generation side, RAGAS (Es et al., 2024) popularized reference-free LLM-as-judge metrics for faithfulness, answer relevance, and context precision, and frameworks such as ARES and the 2024 to 2026 wave of RAG-evaluation surveys probe how well these automated judges track human ratings, since a miscalibrated judge silently corrupts the quality scoreboard. The open frontier is jointly benchmarking retrieval quality and serving cost: most leaderboards still report recall and nDCG with no latency or dollar axis, exactly the single-number trap this section argues against, and recall-latency-cost frontier reporting (in the spirit of the ann-benchmarks project) is only beginning to reach RAG evaluation.

Library Shortcut: BEIR, MTEB, and RAGAS Compute These Metrics for You

Code 25.9.1 computed recall by hand to expose its definition. In practice you do not hand-roll retrieval metrics: beir evaluates a retriever against standard corpora and emits recall, nDCG, and MAP in a few lines; mteb benchmarks an embedding model across dozens of tasks with one evaluate call; and ragas scores end-to-end RAG faithfulness and context precision with LLM-judges. What you must still supply yourself is the systems axis, because these libraries report quality only:

# pip install beir ragas
from beir.retrieval.evaluation import EvaluateRetrieval
# results[qid][doc_id] = score, from YOUR sharded retriever, measured WITH its p99
ndcg, _map, recall, precision = EvaluateRetrieval.evaluate(qrels, results, k_values=[10])
print(recall["Recall@10"])          # quality axis from the library

from ragas import evaluate
from ragas.metrics import faithfulness, context_precision
rag_scores = evaluate(dataset, metrics=[faithfulness, context_precision])   # answer-quality axis
# the latency / cost-per-query axis you still measure yourself, in the same run

Code 25.9.2: The dozen lines of metric arithmetic in Code 25.9.1 collapse to one EvaluateRetrieval.evaluate call, and RAGAS adds the answer-quality axis; the library handles the metric definitions, but you must still co-record latency and cost in the same run to keep the frontier construct-matched.

Fun Note: The Benchmark That Forgot to Bring a Clock

A retrieval leaderboard once crowned an index that reached state-of-the-art recall. The fine print revealed the winning configuration searched every shard exhaustively and took half a second per query, which is to say it had quietly turned the approximate index back into a brute-force scan and then claimed the speed trophy was irrelevant. The lesson the field keeps relearning: a recall column with no latency column beside it is a high-score table for a game nobody is actually playing.

6. Chapter Summary Beginner

This chapter built retrieval-augmented generation as a distributed pipeline and then taught you to evaluate it as one. The arc ran from the embeddings produced on a fleet of encoders (Section 25.2), through the vector databases that store them (Section 25.3) and the approximate-nearest-neighbor index families that search them in sublinear time (Section 25.4), to the sharding and replication that let the index scale and the scatter-gather that fans a query across shards with its tail-latency cost (Section 25.5). On top of that base we layered hybrid search that fuses lexical and dense signals (Section 25.6), retrieve-then-rerank cascades that sharpen the top results (Section 25.7), and the layered caches that absorb the repeats (Section 25.8). This final section closed the loop: a retrieval system is evaluated on recall and latency jointly, never on either alone, because the approximations that buy scale-out also set the recall, and a fast retriever at low recall is a fast wrong answer.

Key Takeaway: RAG Is a Distributed Pipeline, Evaluated on a Frontier

Retrieval-augmented generation is not a model, it is a distributed system: distributed embedding, vector databases, ANN index families (HNSW, IVF, PQ), sharding and replication with scatter-gather and its tail latency, hybrid lexical-plus-dense search, retrieve-then-rerank cascades, and layered caching. Each stage adds a knob that trades quality against cost. The discipline that holds the whole pipeline honest is to evaluate it on recall and latency and cost together, as a single frontier, and to choose the knee that meets your recall floor at the lowest cost, never the fastest point on a dashboard that forgot to plot recall.

Exercise 25.9.1: Why the Single Number Lies Conceptual

Using only Output 25.9.1, explain to a manager who wants "the fastest retriever" why shipping the nprobe-1 configuration would be a mistake, in terms a non-specialist understands. Then state the one additional number you would put on the latency dashboard so the mistake becomes impossible to make, and explain why a mean latency would be the wrong systems number to pair it with instead of the p99.

Exercise 25.9.2: Move the Knee Coding

Extend Code 25.9.1 to also report nDCG@10 against the exact-neighbor ground truth (rank-discount the recovered neighbors by $1/\log_2(\text{rank}+1)$). Then sweep the recall target $\tau$ over $\{0.80, 0.90, 0.95, 0.99\}$ and print the knee (cheapest configuration meeting each target). Describe how the chosen nprobe and the cost-per-query change as $\tau$ rises, and identify the value of $\tau$ beyond which the last percent of recall more than doubles the cost.

Exercise 25.9.3: Budget the Tail Analysis

A query fans across $S = 16$ shards in a scatter-gather. Suppose each shard's latency is independent with a p99 of $4$ ms, and the query waits for the slowest shard. Argue qualitatively why the p99 of the merged query is larger than $4$ ms, and explain how this fan-out tail amplification interacts with the nprobe knob from Output 25.9.1: if raising nprobe to hit a recall target also raises each shard's per-query work, what happens to the merged p99, and what does that imply for choosing the operating point on a sharded index versus a single-node one? Tie your answer to the scatter-gather of Section 25.5.

Project Ideas

Three projects to carry the chapter from reading into building. Each one ends in a frontier or a dashboard, not a single number.

1. The recall-latency-cost frontier of a real sharded ANN service. Stand up a small FAISS or HNSWlib index over a public corpus (for example a BEIR dataset), shard it across two or more processes with a scatter-gather merge, and sweep the recall dial (nprobe or efSearch). For each setting co-measure recall@10, p99 latency, and an estimated cost-per-query in one pass, then plot the three-axis frontier of Figure 25.9.1 from your own numbers. Mark the knee for a recall target you justify, and write one paragraph defending why that point, not the fastest one, is what you would ship.

2. An end-to-end RAG evaluation harness. Wire a retriever to a generator and score the pipeline with RAGAS (faithfulness, context precision) on a question set, while simultaneously logging retrieval recall@10 and the end-to-end p99. Then deliberately lower the recall dial and chart how answer faithfulness degrades as recall falls, finding the recall floor below which the generator can no longer ground its answers. The deliverable is the recall-versus-faithfulness curve that tells you how much retrieval quality the generator actually needs.

3. An embedding-drift monitor. Embed a corpus, build an index, then synthesize drift by shifting the query distribution over time (new topics, new vocabulary). Track an embedding-distribution distance over the stream in the spirit of Section 9.9, alert when it crosses a threshold, and measure how recall@10 against fresh ground truth decays before and after a rebuild. The deliverable is a drift signal that fires early enough to trigger a re-index before users feel the recall loss.