Section 36.1: Problem Definition | Building Scalable AI

"One user pressed Enter. By the time the answer came back, I had fanned the question out across a thousand shards, gathered their best guesses, and reassembled myself before anyone noticed I was never in one place to begin with."
A Single Query, About to Fan Out Across a Thousand Shards

Big Picture

A retrieval-augmented generation system over the open web is the moment every axis of distribution in this book is forced to operate at once, in a single product, under a clock. The corpus is billions of documents that no machine can hold; the embeddings that make it searchable cost thousands of accelerator-hours to compute; the index that serves them spans hundreds of nodes; and the generator that reads the retrieved passages must answer thousands of users per second within a latency budget a person will actually wait for. This chapter takes one such system apart end to end. This opening section states the problem precisely, fixes the scale numbers that make a single machine impossible, decomposes the pipeline into the stages the rest of the chapter builds, and maps each stage onto the six axes of distribution from Chapter 1. Everything that follows is engineering against the requirements written down here.

Retrieval-augmented generation, or RAG, answers a question by first finding relevant passages in a corpus and then conditioning a language model on those passages while it writes the answer. On a laptop, over a few thousand documents, RAG is a weekend project: embed the documents, drop the vectors into an in-memory index, and call a model. The interesting version, and the one this chapter studies, is the version that serves the open web. When the corpus is measured in billions of pages, refreshed continuously, and queried by a live user population, every component that was trivial on the laptop becomes a distributed system in its own right. The purpose of this case study is to show how the machinery built in Parts II through VII assembles into one coherent product, and where the seams between those parts actually fall.

We will treat the system as a concrete target rather than an abstraction. Throughout the chapter, the running specification is a web-scale RAG service with a corpus on the order of ten billion documents, a freshness requirement measured in hours for the most active slice of the web, a query path that must return grounded answers within a latency a user tolerates, and a cost envelope that rules out brute force. Those four pressures, corpus size, freshness, latency and throughput, and cost, are the requirements; the rest of the chapter is the design that satisfies them.

1. The Requirements: Four Numbers That Rule Out One Machine Beginner

A problem definition is only useful if it commits to numbers, because the numbers decide which ceilings bind and therefore which axes of distribution we must reach for. We fix four requirements and carry them through the entire chapter. The first is corpus size: the system indexes on the order of $N_{\text{docs}} = 10^{10}$ documents, each split into a handful of retrievable passages. The second is freshness: a meaningful fraction of the web changes daily, and the most active pages must be re-crawled and re-indexed within hours, not weeks, so the pipeline is never a one-time batch job but a continuously running flow. The third is the service-level objective, or SLO, on the query path: the system must sustain a throughput measured in thousands of queries per second while keeping per-query latency inside a budget on the order of a second, including both retrieval and generation. The fourth is cost: the design must hit the first three at a price that a real operator would pay, which immediately disqualifies any approach that scans the whole corpus per query or re-embeds the whole corpus per update.

These four requirements interact, and the interaction is the whole subject. Freshness fights cost, because re-embedding more often burns more accelerator time. Latency fights corpus size, because a larger index means more candidates to search per query. Throughput fights latency, because every concurrent query competes for the same retrieval and generation hardware. A single machine fails all four at once and for independent reasons: it cannot store the corpus, cannot embed it in any reasonable time, cannot hold the index, and cannot serve the request volume. The next section makes that failure quantitative.

Key Insight: A Case Study Is a Constraint-Satisfaction Problem Across Axes

Earlier chapters each optimized one axis in isolation: how to process data, how to train, how to serve. A real system is the joint problem, where improving one axis pressures another. Web-scale RAG is the cleanest example in the book because all six axes are simultaneously load-bearing and visibly in tension. The skill this chapter teaches is reading a specification, locating which ceiling each requirement hits, and assigning each to the axis and the earlier chapter that owns its remedy, before writing a line of system code.

2. Why a Single Machine Is Not Even Close Beginner

The argument for distribution here is not rhetorical; it falls out of arithmetic on the four requirements. Take the corpus of $N_{\text{docs}} = 10^{10}$ documents, split each into a few passages, and embed every passage into a vector of dimension $d$ stored in half precision. The number of vectors to store and search is

$$N_{\text{vec}} = N_{\text{docs}} \cdot c, \qquad S_{\text{index}} = N_{\text{vec}} \cdot d \cdot b,$$

where $c$ is passages per document, $d$ is the embedding dimension, and $b$ is bytes per dimension. With $c = 3$, $d = 1024$, and $b = 2$ bytes, the index alone runs to tens of terabytes of raw vectors, far past the memory of any single node. The number of shards the index must be split into is the index size divided by the memory each node can devote to it,

$$K_{\text{shard}} = \left\lceil \frac{S_{\text{index}}}{M_{\text{node}}} \right\rceil,$$

which for a node offering tens of gigabytes of fast memory to the index lands in the hundreds. The compute to build the index once is equally decisive. If embedding costs a fixed throughput $r$ passages per second on one accelerator, the total one-pass embedding work is

$$T_{\text{embed}} = \frac{N_{\text{vec}}}{r},$$

independent of how we parallelize it; parallelizing only divides the wall-clock by the fleet size. Finally, the query path obeys Little's law: the number of queries in flight at any instant is the arrival rate times how long each one stays in the system,

$$C = \lambda \cdot L,$$

so a throughput $\lambda$ of thousands of queries per second at a latency $L$ near a second means thousands of queries are being served concurrently, which no single server's compute or memory can hold. Figure 36.1.1 shows the four pressures and the stages they fall on; Code 36.1.1 puts real numbers behind every one of these formulas.

Figure 36.1.1: The end-to-end web-scale RAG pipeline as a distributed dataflow. The offline build path (top) crawls, cleans, embeds, and shards the corpus continuously; the online query path (middle) embeds a query, fans it out across index shards, ranks the candidates, and conditions a sharded generator on the retrieved passages; the evaluation loop (bottom, dashed) measures quality and cost and feeds corrections back into crawling and ranking. Each stage is labeled with its approximate scale (from Code 36.1.1) and the axis of distribution it loads most heavily.

The diagram fixes the vocabulary for the chapter. The top row is the offline build path that runs continuously to satisfy the freshness requirement; the middle row is the online query path that runs per request under the latency SLO; the dashed bottom loop is the evaluation that keeps both honest. The numbers on each box come from the same back-of-envelope model, which we now run with real arithmetic so that no estimate in this chapter is a guess.

import math

N_docs            = 10_000_000_000   # corpus size (requirement 1)
chunks_per_doc    = 3                 # retrievable passages per document
bytes_per_chunk   = 1024             # cleaned text per passage, bytes
dim               = 1024             # embedding dimension d
bytes_per_dim     = 2                # fp16 vector storage, b
embed_rate        = 2000             # passages/sec on one accelerator, r
mem_per_node_gb   = 80              # memory one node devotes to the index

n_vec   = N_docs * chunks_per_doc                       # total vectors N_vec
text_tb = n_vec * bytes_per_chunk / 1e12               # cleaned text storage
S_index = n_vec * dim * bytes_per_dim                  # index size in bytes
vec_tb  = S_index / 1e12
shards  = math.ceil(S_index / (mem_per_node_gb * 1e9)) # K_shard
gpu_h   = n_vec / embed_rate / 3600                    # T_embed in GPU-hours

print(f"total vectors N_vec        : {n_vec:,}")
print(f"cleaned text storage       : {text_tb:,.1f} TB")
print(f"fp16 vector index S_index  : {vec_tb:,.1f} TB")
print(f"index shards K_shard       : {shards:,}")
print(f"one-pass embed T_embed     : {gpu_h:,.0f} GPU-hours")

for fleet in (1, 1000):                                 # wall-clock = T_embed / fleet
    print(f"  embed wall-clock on {fleet:>4} GPUs : {gpu_h/fleet:>10,.1f} h "
          f"({gpu_h/fleet/24:,.1f} days)")

for qps, lat in ((5000, 0.8), (20000, 0.8)):            # Little's law: C = lambda * L
    print(f"  QPS={qps:>6}, latency={lat}s -> queries in flight C = {qps*lat:,.0f}")

Code 36.1.1: The four-requirement back-of-envelope model for the running specification. Each printed quantity corresponds to one formula in this section: $N_{\text{vec}}$, $S_{\text{index}}$, $K_{\text{shard}}$, $T_{\text{embed}}$, the embedding wall-clock under a fleet, and Little's-law concurrency $C$.

total vectors N_vec        : 30,000,000,000
cleaned text storage       : 30.7 TB
fp16 vector index S_index  : 61.4 TB
index shards K_shard       : 768
one-pass embed T_embed     : 4,167 GPU-hours
  embed wall-clock on    1 GPUs :    4,166.7 h (173.6 days)
  embed wall-clock on 1000 GPUs :        4.2 h (0.2 days)
  QPS=  5000, latency=0.8s -> queries in flight C = 4,000
  QPS= 20000, latency=0.8s -> queries in flight C = 16,000

Output 36.1.1: Real output. Building the index once on a single accelerator would take roughly half a year; the vector index alone is over sixty terabytes and must be split across hundreds of nodes; and the query path keeps thousands of requests in flight at once. Every number here is a wall the single machine hits, and each maps to a different axis of distribution.

The output settles the question. The single accelerator needs about $173$ days to embed the corpus once, which alone violates the freshness requirement before any user has typed a query. The index is over sixty terabytes and demands on the order of $768$ shards. The query path keeps thousands of requests in flight simultaneously. None of these is a marginal overshoot that a faster chip would absorb; each is one to three orders of magnitude beyond a single node, and each lands on a different stage of the pipeline. That separation is exactly what makes the six-axis decomposition the right tool.

3. Decomposing the Pipeline Onto the Six Axes Intermediate

The six axes of distribution from Section 1.1, distribute data, distribute training, distribute the model, distribute inference, coordinate the cluster, and distribute intelligence, give us a map onto which every stage of Figure 36.1.1 places cleanly. Mapping the stages is the planning act of the whole chapter: it tells us which earlier part of the book owns each stage and which later section of this chapter develops it. Table 36.1.1 is that map, and it doubles as the chapter's table of contents.

Table 36.1.1: Each pipeline stage assigned to the axis of distribution it loads most heavily, the earlier chapter that owns the underlying machinery, and the later section of this chapter that builds it.

Pipeline stage	Primary axis	Owning earlier chapter	Built in this chapter
Crawl the web	Distribute data	Ch 6 (MapReduce)	Section 36.2
Clean and deduplicate	Distribute data	Ch 7, Ch 8	Section 36.3
Shard and index	Distribute the model	Ch 25 (vector search)	Section 36.4
Chunk and embed	Distribute inference	Ch 15 (encoders)	Section 36.5
Retrieve and rank	Coordinate the cluster	Ch 25	Section 36.6
RAG generation and orchestration	Distribute inference	Ch 24, Ch 32	Section 36.7
Evaluate end to end	Distribute intelligence	Ch 5	Section 36.8
Extend and rebuild	Coordinate the cluster	Ch 41 (capstone)	Section 36.9

Reading the table top to bottom traces the dataflow of Figure 36.1.1, and reading the third column traces the path back through the book. The crawl and the cleaning stages are pure data distribution: they are MapReduce and Spark jobs over a petabyte-class stream, owned by Chapter 6 and Chapter 7, with the storage and loading machinery of Chapter 8 underneath. Embedding is distributed inference applied to a fixed corpus rather than a live request stream, which is why it borrows the encoder and batching techniques of Chapter 15. Sharding the index is distributing a structure too large for one node, the same partitioning logic that splits a model across devices, here applied to the vector store of Chapter 25. Retrieval and ranking are a scatter-gather over those shards, a cluster-coordination problem. Generation is distributed LLM serving from Chapter 24, multiplied across the fleet to meet the throughput SLO. Orchestration that holds the per-query flow together is the agent-orchestration material of Chapter 32. Evaluation closes the loop. No stage is new machinery; the contribution of this chapter is the assembly and the seams.

Thesis Thread: The Scale-Out Moment of an Entire Product

Most chapters mark a single primitive being scaled out. This case study marks the thesis itself: a complete, user-facing product in which scale-out is not one optimization but the only way the product can exist. Six axes that earlier chapters treated separately are here load-bearing at the same instant, and the design is the act of keeping them coherent. When you reach the capstone in Chapter 41, this is the worked example you will imitate: read the requirements, locate each ceiling, assign it to an axis, and defend the assignment with the arithmetic of Code 36.1.1. Scale-out stops being a technique and becomes the shape of the system.

4. The Build Path Versus the Query Path Intermediate

One structural decision organizes the rest of the chapter, and it is visible as the two horizontal bands in Figure 36.1.1. The offline build path and the online query path have opposite cost profiles and must be engineered against opposite metrics. The build path is throughput-bound and latency-tolerant: it processes the whole corpus, so it is measured in total accelerator-hours and dollars, and a passage embedded a few minutes late costs nothing. The query path is latency-bound and throughput-replicated: each request must finish inside the SLO, so it is measured in tail latency and in how many concurrent requests the fleet sustains, and being a few seconds slow costs a user. Conflating these two paths is the most common design error in real RAG systems, because the techniques that make the build path cheap, enormous batches and relaxed deadlines, are exactly the techniques that ruin the query path.

The two paths meet at one artifact: the sharded index. The build path produces it; the query path consumes it. Freshness is precisely the rate at which updates flow from the build path into the index that the query path is already serving, which is why the index cannot be a static file but must be a live, incrementally updated structure, a point Section 36.5 takes up in detail. The concurrency the query path must hold, the $C = \lambda L$ from Output 36.1.1, is the budget that Section 36.7 spends on replicated generation. Keeping the two paths logically separate while letting them share exactly one artifact is the architectural spine of the chapter.

Practical Example: The RAG Demo That Could Not Become a Product

Who: A founding engineer at an early-stage company building a research-assistant product over a large document collection.

Situation: A prototype over fifty thousand documents worked beautifully in a notebook: one embedding pass, an in-memory index, one model call per query.

Problem: The first real customer arrived with two hundred million documents and a contractual sub-second latency requirement, and the notebook architecture fell over on every axis at once.

Dilemma: Scale the prototype up on one very large machine, which postponed the wall by a single order of magnitude and still capped throughput at one server, or re-architect into a distributed build path and query path before the corpus grew further.

Decision: They re-architected, treating embedding as an offline batch job, sharding the index across nodes, and replicating the generator behind a queue, the exact split of Figure 36.1.1.

How: They ran the back-of-envelope model of Code 36.1.1 on the customer's numbers first, which told them the index needed dozens of shards and the embedding pass needed a fleet, and they sized the cluster from those numbers rather than from guesses.

Result: The product met the latency SLO at the customer's volume, and because the build and query paths were separated, growing the corpus tenfold later meant adding shards and embedding workers without touching the serving code.

Lesson: A RAG prototype and a RAG product are different systems. Doing the arithmetic of Section 2 before writing code is what turns the demo into something that survives its first real corpus.

Library Shortcut: The Laptop RAG That Hides Every Distributed Problem

It is worth seeing the system this chapter does not study, because it shows exactly which problems scale removes the luxury of ignoring. A few lines of a modern RAG framework give a working pipeline on a small corpus:

# pip install langchain-community faiss-cpu sentence-transformers
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings

emb   = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
store = FAISS.from_texts(documents, emb)          # embeds + indexes, all in RAM
hits  = store.similarity_search(query, k=5)        # exact scan over the whole index
answer = llm(prompt_with(hits))                    # one model call, one machine

Code 36.1.2: A complete single-machine RAG pipeline in five lines. It silently assumes the corpus fits in RAM, the embedding pass finishes instantly, the index lives in one process, and one model call meets the deadline. Each of those four assumptions is one of the four requirements from Section 1, and each fails at web scale; the rest of this chapter is what those five lines become when none of them hold.

5. What This Chapter Builds, Section by Section Beginner

With the requirements fixed, the scale proven, and the axes mapped, the path through the chapter is set. Section 36.2 builds the distributed crawler that produces the raw corpus. Section 36.3 cleans and deduplicates that corpus at scale, the stage that most determines final answer quality. Section 36.4 builds the sharded lexical index over the cleaned corpus, the term-to-document structure too large for one node. Section 36.5 embeds the cleaned passages across a fleet, spending the $T_{\text{embed}}$ budget of Output 36.1.1 and writing the resulting vectors into the live index, the artifact both paths share. Section 36.6 builds the fan-out retrieval and ranking that turn a query into candidate passages. Section 36.7 conditions a replicated, sharded generator on those passages to produce grounded answers inside the latency SLO, and orchestrates the per-query flow. Section 36.8 closes the loop with end-to-end evaluation of retrieval quality, faithfulness, and cost. Section 36.9 hands the pipeline back as a staged project that grows a single-machine baseline into the full distributed system one milestone at a time. Each section opens with the slice of Figure 36.1.1 it owns and the requirement from Section 1 it must satisfy.

Research Frontier: Where Web-Scale RAG Is Moving (2024 to 2026)

The boundary of this problem is shifting on every axis at once. Long-context generators have reopened the question of how much retrieval is even needed, yet careful studies continue to find that retrieval beats context-stuffing on cost and on faithfulness at web scale, so the pipeline of Figure 36.1.1 remains the economical design. On the index side, learned and disk-resident approximate-nearest-neighbor structures in the DiskANN and SPANN lineage push billions of vectors per node and cut the shard count $K_{\text{shard}}$ that Section 2 computed. On the generation side, RAG is folding into agentic loops that retrieve iteratively and decide when they have read enough, which is where this case study connects forward to the agentic applications of Chapter 40 and the shared-memory retrieval of Chapter 32. The constant across all of it is the four-requirement tension of Section 1: every advance is judged by whether it improves one of corpus size, freshness, latency, or cost without surrendering the other three.

Fun Note: The Half-Year That Becomes Four Hours

Output 36.1.1 says one accelerator needs about $173$ days to embed the corpus once. A fleet of a thousand does the same work in roughly four hours. Nothing about the total compute changed; the $4{,}167$ accelerator-hours are spent either way. Scale-out did not make the work smaller, it made the wait survivable. That trade, identical total cost for a wall-clock a human can stand, is the bargain underneath every chapter of this book, and it is rarely as vivid as it is here.

Exercise 36.1.1: Read the Requirements, Name the Ceilings Conceptual

For each of the four requirements in Section 1 (corpus size, freshness, query latency and throughput, cost), name the single pipeline stage in Figure 36.1.1 that it pressures hardest, the axis of distribution from Table 36.1.1 that stage sits on, and one concrete failure that occurs if that stage is left on a single machine. Then identify which two of the four requirements are in the most direct tension with each other and explain why satisfying one cheaply makes the other harder.

Exercise 36.1.2: Resize the System Coding

Modify Code 36.1.1 to model a corpus ten times larger ($N_{\text{docs}} = 10^{11}$) and an embedding dimension of $1536$ instead of $1024$. Report the new index size $S_{\text{index}}$, shard count $K_{\text{shard}}$, and embedding budget $T_{\text{embed}}$, and compute the fleet size needed to keep the one-pass embedding wall-clock under twelve hours. Then add a freshness model: assume one percent of the corpus must be re-embedded every day and compute the steady-state number of embedding accelerators that requirement alone demands, independent of the initial build.

Exercise 36.1.3: The Concurrency Budget Analysis

Using Little's law $C = \lambda L$ from Section 2, suppose the query path must sustain $\lambda = 12{,}000$ queries per second at a target $p50$ latency of $0.7$ seconds, of which retrieval consumes $0.2$ seconds and generation consumes the rest. Estimate the number of concurrent queries in each of the two stages separately, and argue from those two numbers which stage dominates the fleet sizing. Discuss how the answer changes if a long tail makes the $p99$ latency three times the $p50$, and connect your reasoning to the replicated-serving treatment forthcoming in Section 36.7 and the LLM-serving economics of Chapter 24.