"I started as a notebook over three thousand documents. They kept feeding me the web, and one stage at a time I learned to be a cluster."
A Prototype, Dreaming of the Cluster It Will Grow Into
This chapter taught the web-scale RAG pipeline as a sequence of distributed stages; this final section hands it back to you as a buildable project that grows a single-machine prototype into that pipeline one stage at a time, with a measurable milestone at every step. You will start where every practitioner starts, a single notebook that crawls, cleans, indexes, embeds, retrieves, and generates over a few thousand documents on one machine. Then you will scale each stage out in the order the chapter introduced it: distributed crawl and dedup (Section 36.2, Section 36.3), distributed indexing (Section 36.4), distributed embedding (Section 36.5), sharded scatter-gather retrieval (Section 36.6), a RAG serving fleet (Section 36.7), and an evaluation harness that proves nothing broke (Section 36.8). The point of the exercise is not the corpus; it is that one project, carried end to end, touches all six axes of distribution from Section 1.2. The project is the book in miniature: one system that forces you to make, and defend, every distribution decision the rest of the book makes for you.
A case study read passively teaches the shape of a system; a case study rebuilt teaches its cost. The eight sections before this one walked the web-scale RAG pipeline as a finished artifact, naming the distributed mechanism behind each stage and the chapters that own it. This section inverts that posture. It gives you a staged construction plan in which you begin with a baseline small enough to fit one machine and large enough to be honest, then remove one single-machine assumption per stage, measuring what the distribution bought you before moving on. Each stage draws on a specific earlier chapter, so the project doubles as a guided tour back through the book: when you shard the index you are applying Chapter 25, when you distribute embedding you are applying Chapter 15, and when you serve the fleet you are applying Chapter 24. The discipline that makes the project worth its time is the same one Section 1.1 opened with: distribute a stage only when a ceiling forces it, and prove with a number that the distribution helped.
1. The Baseline You Scale Out From Beginner
Every honest scale-out project begins with a single-machine baseline, for two reasons. The first is correctness: the distributed system must return the same answers as the baseline, and you cannot check that against a baseline you never built. The second is measurement: speedup, the headline metric of Chapter 3, is defined only relative to a one-machine reference time. Your baseline is a single process over a few thousand documents that performs the entire pipeline in memory: crawl a seed list, clean and deduplicate the text, build an inverted index and an embedding matrix, retrieve the top passages for a query, and template them into a prompt for a generator. The corpus is small enough to fit one machine and varied enough that retrieval quality is meaningful; a few thousand Wikipedia or documentation pages is the right size.
The code below is that baseline compressed to one dependency-free file. It embeds the corpus, retrieves by brute-force nearest neighbor, then immediately shows the first scale-out move, sharding the index and replacing the brute-force scan with scatter-gather (Section 36.6), and verifies that the sharded answer is identical to the baseline. That identity, the retrieval analogue of the gradient identity in Section 1.1, is the invariant your project must preserve at every stage.
import hashlib, math, re
DIM = 64 # toy embedding dimension; a real encoder emits 384-1024
DOCS = {
0: "data parallelism splits training gradients across many worker machines",
1: "an inverted index maps each term to a posting list of documents",
2: "scatter gather retrieval fans a query out to every shard then merges",
3: "distributed embedding generation encodes the corpus across a gpu fleet",
4: "rag retrieves passages then conditions a language model on them",
5: "sharding partitions the index so no single machine holds the whole corpus",
}
def _h(tok): # deterministic hash so the demo is reproducible
return int(hashlib.md5(tok.encode()).hexdigest(), 16)
def embed(text): # hashing bag-of-words: a stand-in for an encoder
v = [0.0] * DIM
for tok in re.findall(r"[a-z0-9]+", text.lower()):
v[_h(tok) % DIM] += 1.0
n = math.sqrt(sum(x * x for x in v)) or 1.0
return [x / n for x in v]
def cosine(a, b):
return sum(x * y for x, y in zip(a, b))
corpus = {d: embed(t) for d, t in DOCS.items()}
query = "scatter gather retrieval fans a query out to every shard"
q = embed(query)
# Baseline: one machine, brute-force nearest neighbor over the whole corpus.
baseline = sorted(corpus.items(), key=lambda kv: cosine(q, kv[1]), reverse=True)
# First scale-out move: shard the index, scatter the query, gather top-k, merge.
SHARDS, K = [[0, 1, 5], [2, 3, 4]], 3
def shard_topk(ids, qv, k):
return sorted(((d, cosine(qv, corpus[d])) for d in ids),
key=lambda kv: kv[1], reverse=True)[:k]
partials = [shard_topk(s, q, K) for s in SHARDS] # parallel across shards
merged = sorted((p for part in partials for p in part),
key=lambda kv: kv[1], reverse=True)[:K]
print("baseline top-k :", [d for d, _ in baseline[:K]])
print("sharded top-k :", [d for d, _ in merged])
print("identical :", [d for d, _ in merged] == [d for d, _ in baseline[:K]])
context = "\n".join(f"[{i+1}] {DOCS[d]}" for i, (d, _) in enumerate(merged))
print("\nprompt context:\n" + context)
baseline top-k : [2, 1, 0]
sharded top-k : [2, 1, 0]
identical : True
prompt context:
[1] scatter gather retrieval fans a query out to every shard then merges
[2] an inverted index maps each term to a posting list of documents
[3] data parallelism splits training gradients across many worker machines
The temptation is to start distributed, because distributed is the interesting part. Resist it. Without a single-machine baseline you cannot compute a speedup, cannot detect a recall regression, and cannot tell whether the distributed system is correct or merely plausible. The identical top-$k$ in Output 36.9.1 is exactly the kind of check a baseline makes possible. The baseline is cheap, it runs in seconds, and every milestone in this project is measured against it. A scale-out project without a baseline is not a project; it is a hope.
2. Staging the Scale-Out, Milestone by Milestone Intermediate
With the baseline in hand, you scale the pipeline out one stage at a time, in the order of Figure 36.9.1, never advancing until the current stage hits its milestone. The discipline of one-stage-at-a-time matters because it isolates cause and effect: when throughput jumps or recall drops, exactly one thing changed, and you know which chapter's mechanism to blame. Table 36.9.1 is the project plan. Each row names the stage, the binding ceiling that forces the distribution, the chapter that supplies the mechanism, and the measurable milestone that tells you the stage is done.
| Stage | Binding ceiling | Mechanism (chapter) | Milestone to hit |
|---|---|---|---|
| 1. Crawl + clean + dedup | Data too big / slow for one node | MapReduce, MinHash (Ch 6, Section 36.3) | Near-linear crawl throughput; dedup at corpus scale |
| 2. Distributed indexing | Index exceeds one node's memory | Sharded inverted index (Ch 25, Section 36.4) | Corpus larger than any single node holds |
| 3. Distributed embedding | Encoding the corpus is GPU-bound | Data-parallel encode (Ch 15, Section 36.5) | Efficiency $E \ge 0.8$ across the GPU fleet |
| 4. Sharded retrieval | Query latency over a huge index | Scatter-gather, ANN (Ch 12, Ch 25) | Recall@$k$ held within 1 point of baseline |
| 5. RAG serving fleet | Request volume exceeds one server | Replicated serving (Ch 24, Section 36.7) | p99 latency under the SLO at target QPS |
| 6. Evaluation harness | Quality must survive distribution | Distributed eval (Ch 5, Section 36.8) | Answer quality and cost per query reported |
The milestones are quantitative on purpose. "Scaled out the crawler" is not a milestone; "crawl throughput rose from one node's rate to within ten percent of eight times that rate" is. Stages 1 through 3 are throughput-and-capacity stages, judged by speedup and the size of corpus they unlock. Stage 4 is a quality stage, where the move from exact brute-force search to approximate nearest neighbor (Chapter 12) trades a little recall for a large latency win, and the milestone forbids trading away more than one point of recall. Stage 5 is a tail-latency stage, where replication (Chapter 24) buys throughput under a hard p99 budget. Stage 6 closes the loop by proving, with the distributed evaluation methods of Chapter 5, that the fully distributed pipeline still answers questions as well as the baseline did, at a cost per query you can defend.
Who: A graduate student building this exact project as a term capstone on a four-node lab cluster plus two cloud GPUs.
Situation: The single-machine baseline over four thousand documentation pages ran the whole pipeline in under a minute and answered queries acceptably.
Problem: The assignment required a web-scale design, but jumping straight to a distributed everything produced a system that was slower than the baseline and impossible to debug.
Dilemma: Rebuild the whole thing distributed in one leap, fast to write but a black box when it underperformed, or scale one stage at a time against milestones, slower to start but debuggable throughout.
Decision: The staged plan of Table 36.9.1. The student scaled the embedding stage first because profiling showed encoding dominated wall-clock, and stopped at each milestone before touching the next stage.
How: Data-parallel encoding across the two GPUs hit efficiency $0.86$; the index was then sharded across the four nodes; retrieval moved to approximate nearest neighbor with recall held within half a point; the serving fleet was three replicas behind a load balancer.
Result: The final pipeline matched the baseline's answer quality on the held-out questions, ran roughly seven times faster end to end, and, because each stage was measured, every number in the writeup was defensible.
Lesson: Profile to find the binding stage, scale that one first, and never advance past an unmet milestone. A staged build is the only kind whose final numbers you can trust.
3. The Numbers Your Project Must Hit Intermediate
A scale-out project lives or dies by whether the speedups it claims are real, so each milestone is a target you compute, not a feeling. Speedup on $p$ machines is the baseline time over the distributed time, $S(p) = T_1 / T_p$, and parallel efficiency is that speedup per machine, $E(p) = S(p) / p$. For the embedding and crawl stages, which are nearly embarrassingly parallel, your milestone is high efficiency, say $E(p) \ge 0.8$, meaning eight machines deliver at least the work of six and a half. Amdahl's law from Chapter 3 sets the ceiling: if a fraction $f$ of the pipeline is inherently serial (the final merge, the single generator call), then
$$S(p) = \frac{1}{f + \dfrac{1 - f}{p}}, \qquad \lim_{p \to \infty} S(p) = \frac{1}{f}.$$This equation is the reality check on every milestone. If the serial fraction of your pipeline is $f = 0.1$, no number of machines takes you past a tenfold speedup, so a milestone demanding twentyfold is unphysical and the right response is to shrink $f$, not add nodes. For retrieval, the relevant number is sizing rather than speed: a corpus of $N$ documents with $b$ bytes per embedding needs $N b$ bytes of index, and the shard count is $\lceil N b / M \rceil$ for a per-node memory budget $M$. Concretely, $N = 10^9$ passages at $b = 384 \times 4$ bytes is about $1.5$ terabytes of vectors, so with $M = 64$ gigabytes per node you need at least $\lceil 1.5 \times 10^{12} / (64 \times 10^9) \rceil = 24$ shards before any redundancy, which is the stage-2 capacity argument from Section 36.4 made arithmetic. Compute these targets before you build, so the milestone is a prediction you test rather than a result you rationalize.
The hand-rolled scatter-gather in Code 36.9.1 is for understanding; in the real project each stage maps to a framework that handles the distribution for you. Code 36.9.2 names that mapping, turning the staged plan into a near-deployment plan exactly as the design checklist did in Section 1.8:
# Each staged milestone -> the production tool that owns it.
STACK = {
"crawl + dedup": "Spark / Ray Data # shard the crawl, MinHash-LSH dedup",
"distributed index":"pyserini / Elastic # build + shard an inverted index",
"distributed embed":"sentence-transformers + torch DDP # data-parallel encode",
"sharded retrieval":"FAISS-IVF / Qdrant # ANN shards, scatter-gather merge",
"RAG serving fleet": "vLLM + Ray Serve # replicate, batch, autoscale",
"evaluation": "Ray / RAGAS # distributed recall + answer scoring",
}
for stage, tool in STACK.items():
print(f"{stage:20s}-> {tool}")
4. Extension Challenges Worth the Cluster Advanced
Once the six stages hit their milestones you have a working distributed RAG pipeline, and the project becomes a platform for the harder questions the chapter only gestured at. Each extension below adds one capability that a real web-scale system needs, and each reaches into a different part of the book, so finishing them turns the capstone from a pipeline into a system. Add hybrid retrieval, fusing the lexical BM25 scores of Section 36.4 with the dense vector scores of Section 36.6 by reciprocal-rank fusion, and measure the recall lift over either signal alone. Add a reranking stage, a cross-encoder that rescores the top candidates from each shard before generation, and quantify the precision gain against the added latency it costs the stage-5 budget. Handle incremental updates, the problem of a corpus that grows after the index is built, by designing a segment-merge scheme so new documents become searchable without rebuilding every shard, the streaming counterpart from Chapter 9.
Two further extensions matter only when the system is multi-tenant, and they connect the project to the reliability and privacy material of Part VII. If your retrieval fleet aggregates results from shards you do not fully trust, add Byzantine-robust merging so a single corrupted shard cannot poison the top-$k$, the robust-aggregation idea from Chapter 35. If the corpus contains tenant-private documents and you compute corpus-level statistics across tenants, add differential privacy to those statistics so no single document leaks through an aggregate, the mechanism introduced in Chapter 14 and deepened in Chapter 35. Each extension is a small, bounded change to a working system, which is exactly the posture in which distributed-systems concepts are learned best: against a baseline you can measure, in a pipeline you already understand.
The extensions above track live research lines, so a capstone that implements them is working at the current edge. Hybrid and learned retrieval has moved past simple fusion toward learned sparse representations such as SPLADE and late-interaction models in the ColBERT lineage, which keep dense quality at lexical-index cost and are being pushed to billion-passage shards. On the generation side, long-context models have reopened the question of how much to retrieve versus how much to stuff into the prompt, and retrieval-aware training (the RETRO and Self-RAG lineage) folds the retriever and generator into one trained system rather than a pipeline of two. Incremental and streaming indexing is an active systems problem as corpora refresh continuously; fresh-index research targets sub-minute document visibility at web scale. Finally, agentic RAG, where a planning agent issues multiple retrieval rounds and reasons over them, is the bridge from this chapter to the agentic case study in Chapter 40; treating the retriever as a tool an agent calls repeatedly is, as of 2026, the most active frontier in the whole pipeline.
5. Chapter Summary and What You Built Beginner
This section closes Chapter 36, so it is worth stating the through-line the whole chapter built. We began with the problem definition (Section 36.1): answer questions over a web-scale text corpus that no single machine can hold, crawl, clean, index, embed, or serve. From there the chapter walked the pipeline stage by stage, and every stage was the same move applied to a different part of the system, partition the work across machines, move only what must be moved between them, and recombine the result correctly. The crawl, clean, and dedup stages (Sections 36.2, 36.3) distributed the data, applying the MapReduce model of Chapter 6 and MinHash deduplication at corpus scale. Distributed indexing (Section 36.4) and sharded retrieval (Section 36.6) partitioned the index and turned a query into the scatter-gather pattern of Chapter 25. Distributed embedding (Section 36.5) applied the data parallelism of Chapter 15 to encode the corpus across a GPU fleet. RAG integration across a fleet (Section 36.7) replicated the serving path of Chapter 24, and evaluation (Section 36.8) proved the distributed pipeline preserved the quality of the single-machine reference. The chapter is, end to end, one distributed system that spans all six axes of Section 1.2: data, training, model, inference, cluster coordination, and, once an agent drives the retriever, intelligence.
The book's spine is that AI at scale is the engineering of systems whose data, computation, models, inference, and decisions are distributed across many machines, and that each distribution is forced by a ceiling, not chosen for elegance. Web-scale RAG is the clearest single demonstration of that thesis in the book, because one pipeline hits every ceiling in turn: the corpus forces distributing data, the encoder forces distributing training-style compute, a large generator forces distributing the model, the request volume forces distributing inference, the fleet forces cluster coordination, and the agentic extension forces distributing intelligence. The staged project in this section is the thesis made buildable: you do not read about the six axes, you implement them, one ceiling at a time, and watch the single-machine prototype become the cluster it was always going to be. That is the bridge into the remaining case studies, each of which leans hard on a different subset of these same axes.
Web-scale RAG is not six unrelated tricks; it is one distributed system in which every stage is the same partition-move-recombine move applied to a different resource. (1) Distribute the data to crawl, clean, and deduplicate a corpus no node can hold. (2) Shard the index so the corpus exceeds any single machine. (3) Data-parallel the embedding so encoding scales near-linearly with the GPU fleet. (4) Scatter-gather retrieval to query the sharded index while holding recall. (5) Replicate the serving path so request volume meets a hard tail-latency budget. (6) Evaluate across the fleet to prove distribution did not cost quality. Built in this order against milestones, the pipeline exercises all six axes of distribution, which is why it is the book in miniature.
Each idea is sized so that carrying it through the staged plan of Table 36.9.1 becomes a capstone in the sense of Chapter 41. Core build: start from the Code 36.9.1 baseline over a few thousand documents and scale all six stages out on a small cluster (or cloud spot instances), recording the speedup, recall, and cost milestone at each stage; the deliverable is a writeup in which every number is measured against the baseline. Hybrid retrieval: add lexical BM25 to the dense retriever and fuse with reciprocal-rank fusion, reporting the recall lift over either signal alone. Reranking: insert a cross-encoder reranker before generation and plot the precision gain against the latency it adds to the stage-5 p99 budget. Incremental updates: design a segment-merge scheme so new documents become searchable without rebuilding the shards, borrowing the streaming machinery of Chapter 9. Trust and privacy (multi-tenant): add Byzantine-robust top-$k$ merging (Chapter 35) so one corrupt shard cannot poison results, and differentially private corpus statistics (Chapter 14) so no private document leaks through an aggregate. Agentic extension: wrap the retriever as a tool an agent calls across multiple reasoning rounds, the bridge to Chapter 40.
For each of the six stages in Table 36.9.1, state the single binding ceiling (data, model, or throughput) that forces the distribution, and the axis from Section 1.2 it maps to. Then identify the one stage whose milestone is a quality target rather than a speed target, and explain why distributing that stage can make the answer worse, not just slower, if you push the approximation too far. Finally, argue which stage you would scale out first on a real corpus and why profiling, not intuition, should decide that order.
Starting from Code 36.9.1, (a) add a lexical scorer (a simple term-overlap or BM25 score) and fuse it with the cosine score by reciprocal-rank fusion, then show on a handful of queries where the fused ranking differs from the dense-only ranking. (b) Add a third shard and confirm the scatter-gather top-$k$ still equals the brute-force baseline. (c) Replace the exact per-shard scan with an approximate one that scores only a random half of each shard's documents, and measure how often the approximate top-$k$ still matches the exact top-$k$. Report your recall as a fraction over twenty queries and relate it to the stage-4 milestone in Table 36.9.1.
A target corpus has $N = 2 \times 10^9$ passages, each embedded to a $768$-dimensional float32 vector. (a) Compute the total vector volume and the minimum shard count for a per-node memory budget of $M = 48$ gigabytes, before any replication. (b) Suppose the end-to-end pipeline has a serial fraction $f = 0.08$ (the final merge plus one generator call). Using Amdahl's law from Section 3, compute the maximum achievable speedup and the speedup at $p = 16$ and $p = 64$ machines. (c) Given those numbers, state the largest worker count at which parallel efficiency $E(p)$ still exceeds $0.7$, and explain what this implies for which stage-1-through-3 milestone in Table 36.9.1 is actually reachable.