Part V: Distributed Inference and Serving
Chapter 25: Distributed Retrieval and Vector Search

Retrieval-Augmented Generation as a Distributed System

"The model asked me one question. I asked nine billion vectors. We agreed to split the difference inside nine hundred milliseconds, and somehow nobody noticed how many of us it took."

A Vector Shard That Answered in Time
Big Picture

Retrieval-augmented generation looks like a single clever prompt, but it is a pipeline of four distributed systems wired in series under one latency budget: a query is embedded by a replicated encoder, searched against a sharded index of millions to billions of vectors, reranked by a dedicated scorer, and handed as context to a distributed language model. Each stage outgrows one machine for its own reason, and each is built and scaled separately in the sections that follow. What ties them into a system is not a shared algorithm but a shared deadline: the embed, search, rerank, and generate latencies must sum to less than the service-level objective, or the answer arrives too late to matter. This section frames that pipeline, derives its latency budget, and shows which stage dominates as the corpus grows, so the rest of the chapter can attack each stage knowing exactly how much of the budget it owns.

A language model knows only what was in its training data, frozen at a point in time, and it cannot cite a source it never saw. Retrieval-augmented generation, or RAG, fixes both problems by giving the model fresh, specific text at query time: before the model answers, a retrieval system finds the most relevant passages from a corpus and places them in the prompt. The idea is simple to state and is often demonstrated on a laptop with a few thousand documents. That demonstration hides the engineering. A production RAG system serves a corpus that does not fit on one machine, answers more queries per second than one server can handle, and must do both inside a strict latency budget. The moment any of those three pressures binds, and in a real deployment all three bind at once, RAG stops being a prompt-engineering trick and becomes a distributed system. This chapter is about that system; the quality of the retrieved text matters, but the distribution is the point.

Those three pressures are the same ones that opened this book in Section 1.1: data too big for one machine, model too big for one device, throughput beyond one server. Retrieval makes all three concrete. The corpus and its embeddings are the data; a billion passages at 768 dimensions in 32-bit floats is roughly three terabytes of vectors alone, before any index structure, which is why the index must be sharded across machines. The language model is the model, already distributed across a fleet by Chapter 24. And every user-facing query competes for the same index and the same model, so every stage must also be replicated for throughput. RAG is therefore not one axis of distribution but several, stacked, which is exactly why it earns its own chapter rather than a paragraph.

1. The Four Stages, Each Its Own Subsystem Beginner

Trace a single query through the pipeline and four distinct distributed systems light up in turn. Figure 25.1.1 lays them out left to right; we walk them in the same order. The first stage is embedding. The raw query text is meaningless to a vector index, so a transformer encoder maps it to a dense vector in the same space as the corpus embeddings. That encoder is replicated across machines because query volume exceeds one accelerator's throughput, and the corpus side of the same embedding job, encoding billions of passages offline, is a large distributed batch pipeline in its own right, the subject of Section 25.2.

The second stage is search. The query vector is compared against every corpus vector, in effect, to find the nearest neighbors, except that exact comparison against billions of vectors is far too slow, so the system uses an approximate nearest-neighbor (ANN) index that trades a little recall for orders of magnitude less work. That index does not fit on one machine, so it is partitioned into shards, each holding a slice of the corpus, and a query is scattered to all shards and the partial results gathered back, a pattern Sections 25.4 and 25.5 develop in full. The third stage is reranking. ANN search is fast but coarse; a heavier cross-encoder rescores the top candidates with the query and passage read together, which is more accurate but too expensive to run over the whole corpus, so it runs only over the few dozen candidates that search returned, on its own replicated service (Section 25.7). The fourth stage is generation: the reranked passages are concatenated into the prompt and sent to the distributed language model of Chapter 24, which produces the answer.

Query text Embed replicated encoder throughput Sharded ANN search s1 s2 sK scatter to shards, gather top-k corpus too big data + throughput Rerank cross-encoder top candidates accuracy Distributed LLM gpu gpu gpu prefill context, decode answer model too big model + throughput Answer + sources One latency budget t_embed + t_search + t_rerank + t_gen ≤ SLO
Figure 25.1.1: The distributed RAG pipeline. A query flows through four subsystems in series: a replicated embedding encoder, a sharded approximate-nearest-neighbor (ANN) search that scatters to $K$ shards and gathers the top-$k$, a replicated cross-encoder reranker, and the distributed language model of Chapter 24. The small caption above each box names the pressure that forces it off one machine. The dashed band underneath is the constraint that makes the four into one system: their latencies must sum to less than the service-level objective.

The crucial observation in Figure 25.1.1 is that the four boxes do not share an algorithm; they share a deadline. The embedding service knows nothing about graph-based indexes; the ANN shards know nothing about transformer decoding. What couples them is that a user is waiting, and the four stages run in series, so their latencies add. That sum, and not any single stage's cleverness, is what the architect must hold under control. The next section makes the sum precise.

Key Insight: RAG Is a Series Pipeline of Independent Distributed Systems Under One Deadline

Each stage of RAG scales out for a different reason (the encoder for throughput, the index for corpus size, the reranker for accuracy, the model for parameters and throughput) and is engineered with different machinery. They become a single system only because they run in series under a shared latency budget. Optimize a stage in isolation and you may win nothing; optimize the stage that owns the largest slice of the budget and the whole pipeline speeds up. The budget, not any individual index trick, is the object of design.

2. The End-to-End Latency Budget Intermediate

Because the stages run in series, the user-perceived latency of one query is the sum of the per-stage latencies plus the network hops between them. Writing $t_{\text{embed}}$, $t_{\text{search}}$, $t_{\text{rerank}}$, and $t_{\text{gen}}$ for the four stage latencies, the end-to-end time must fit a service-level objective (SLO):

$$t_{\text{e2e}} \;=\; t_{\text{embed}} + t_{\text{search}} + t_{\text{rerank}} + t_{\text{gen}} \;\le\; \text{SLO}.$$

This is a budget in the literal sense: a fixed quantity of milliseconds to be allocated across stages, where spending more on one leaves less for the others. The subtlety is that the relevant number is not the average latency of each stage but its tail. A query waits for the slowest of the $K$ ANN shards, not the average shard, so $t_{\text{search}}$ is governed by the tail of the shard-latency distribution, and the more shards a query fans out to, the worse the slowest-of-$K$ tends to be. The same logic applies wherever a stage fans out. The honest budget is therefore written in tail latency, typically the 95th or 99th percentile:

$$t_{\text{search}}^{p99} \;\approx\; \mathbb{E}\!\left[\max_{1 \le j \le K} t_j\right] \;\ge\; \max_j \, \mathbb{E}[t_j],$$

and the gap between the expected maximum and the maximum expectation is exactly the tail-amplification effect that Section 5.3 analyzes as the central hazard of fan-out systems. Budgeting RAG in average latency and then being surprised by the p99 is the most common way these pipelines miss their SLO. We allocate the budget at the tail and measure it at the tail.

To see where the budget actually goes, and how it shifts as the corpus grows, we model the four stages with simple closed-form latencies and sum them across corpus sizes from one million to ten billion vectors. The model is deliberately coarse; its job is to show the shape of the budget, namely which stage dominates and how that stays stable as data scales, not to predict any specific deployment to the millisecond.

import math

# Per-stage latency models (milliseconds), all simple closed forms so the
# numbers are reproducible and the dominant term is easy to read off.

def embed_ms():
    # One query embedded on a replicated encoder: fixed forward pass + network
    # hop to the embedding service. Independent of corpus size.
    forward = 6.0          # transformer encoder forward pass for one short query
    network_hop = 1.5      # client -> embedding replica round trip
    return forward + network_hop

def search_ms(n_vectors, shards):
    # Sharded ANN index. Each shard searches n_vectors/shards points; graph-based
    # ANN visits ~ c * log(points_per_shard) nodes, then a scatter-gather merges
    # the per-shard top-k. Fan-out adds one network round trip plus tail-straggler
    # slack that grows slowly with shard count.
    per_shard = n_vectors / shards
    visit = 0.45 * math.log2(max(per_shard, 2.0))   # ANN node visits per shard
    fanout = 1.5                                     # scatter to shards round trip
    straggler = 0.25 * math.log2(max(shards, 2.0))   # slowest-of-S tail slack
    merge = 0.4                                       # top-k merge of shard results
    return visit + fanout + straggler + merge

def rerank_ms(candidates):
    # Cross-encoder reranker scores the candidate passages in a batch on a
    # dedicated replica; cost is roughly linear in the candidate count.
    per_candidate = 0.35
    network_hop = 1.5
    return per_candidate * candidates + network_hop

def generate_ms(prompt_tokens, output_tokens):
    # Distributed LLM serving (Chapter 24): prefill over the retrieved context,
    # then autoregressive decode. Decode dominates and is corpus-independent.
    prefill = 0.02 * prompt_tokens     # context (query + retrieved passages)
    decode = 11.0 * output_tokens      # 11 ms per output token, served distributed
    return prefill + decode

SLO_MS = 900.0          # end-to-end p95 latency budget for the answer
SHARDS_PER_BILLION = 100  # we add index shards as the corpus grows
CANDIDATES = 50          # top-k passages handed to the reranker
PROMPT_TOKENS = 1200     # query + retrieved passages fed to the LLM
OUTPUT_TOKENS = 64       # generated answer length

corpus_sizes = [1_000_000, 10_000_000, 100_000_000, 1_000_000_000, 10_000_000_000]

print(f"{'corpus':>14} {'shards':>7} {'embed':>7} {'search':>7} "
      f"{'rerank':>7} {'gen':>7} {'total':>8} {'dominant':>9} {'SLO':>5}")
print("-" * 86)
for n in corpus_sizes:
    shards = max(1, round(n / 1_000_000_000 * SHARDS_PER_BILLION))
    e = embed_ms()
    s = search_ms(n, shards)
    r = rerank_ms(CANDIDATES)
    g = generate_ms(PROMPT_TOKENS, OUTPUT_TOKENS)
    total = e + s + r + g
    stages = {"embed": e, "search": s, "rerank": r, "generate": g}
    dominant = max(stages, key=stages.get)
    ok = "ok" if total <= SLO_MS else "OVER"
    print(f"{n:>14,} {shards:>7} {e:>7.1f} {s:>7.1f} "
          f"{r:>7.1f} {g:>7.1f} {total:>8.1f} {dominant:>9} {ok:>5}")
Code 25.1.1: An end-to-end RAG latency budget built by summing four distributed stages. The search model adds one index shard per ten million vectors so the points searched per shard stay bounded; every other stage is corpus-independent. The dominant stage is the one consuming the largest slice of the budget at each corpus size.
        corpus  shards   embed  search  rerank     gen    total  dominant   SLO
--------------------------------------------------------------------------------------
     1,000,000       1     7.5    11.1    19.0   728.0    765.6  generate    ok
    10,000,000       1     7.5    12.6    19.0   728.0    767.1  generate    ok
   100,000,000      10     7.5    13.2    19.0   728.0    767.7  generate    ok
 1,000,000,000     100     7.5    14.0    19.0   728.0    768.5  generate    ok
10,000,000,000    1000     7.5    14.9    19.0   728.0    769.4  generate    ok
Output 25.1.1: The corpus grows by a factor of ten thousand, yet the total latency moves by under four milliseconds. Search rises only from 11.1 to 14.9 ms because sharding keeps the per-shard work logarithmic, and generation sits at a fixed 728 ms floor that no amount of corpus growth touches. The pipeline stays comfortably inside the 900 ms budget throughout.

Output 25.1.1 carries the chapter's central practical lesson. As the corpus grows ten-thousand-fold, from one million to ten billion vectors, the total latency barely moves, because the search stage grows only logarithmically once the shard count tracks the corpus size, exactly the property that Section 25.5 engineers on purpose. The stage that dominates the budget is generation, the distributed language model from Chapter 24, and it is corpus-independent. This is liberating: it means the retrieval half of RAG, the part this chapter is about, can scale to web-sized corpora without eating the latency budget, provided the index is sharded well. The danger is not that search gets slow as data grows; it is that an unsharded index makes search grow linearly, at which point a few hundred million vectors blow the budget on their own. The whole point of distributing the index is to convert that linear growth into the flat line you see above.

Fun Note: The Tyranny of the Slowest Friend

A fan-out query is a group of friends leaving a restaurant: you cannot go home until the slowest one finishes their coffee. With two friends, the wait is usually fine. With a thousand friends, one of them is always, at this very moment, ordering a second espresso. That is why the search row in Output 25.1.1 creeps upward even though each shard does less work: more shards means a higher chance that one of them is having a slow millisecond. The cure is not to make every friend faster, it is to stop waiting for the very last one, which is what hedged requests and tail-cutting do in Section 5.3.

3. Why Retrieval Must Scale Out at All Beginner

It is worth pausing on why retrieval cannot simply live on one large machine, because the answer determines the shape of the whole chapter. Two ceilings force it out, and they are the same data and throughput ceilings from Section 1.1, now wearing retrieval's clothes. The first ceiling is the corpus and its embeddings. A web-scale text corpus runs to billions of passages, and each passage becomes a dense vector. At 768 dimensions and four bytes per number, a billion vectors is about three terabytes before the index adds its own overhead, and a graph or quantized index can multiply that. No single machine holds that in memory at search speed, so the index is partitioned: shard $j$ holds a disjoint slice of the corpus, and a query is searched against all shards in parallel and their results merged. This is the same partitioning-and-sharding arc the book has followed from Chapter 11's sharded embedding tables onward, now applied to a search index.

The second ceiling is query throughput. A single index replica can sustain only so many searches per second; a service answering thousands of user queries per second must therefore replicate the index, running several copies behind a load balancer so that capacity scales with traffic. Sharding and replication are orthogonal: sharding splits one logical index across machines so it fits and so each query does less work, while replication duplicates the sharded index so more queries can run at once. A production index is laid out as a grid, sharded one way and replicated the other, which is precisely the design space Section 25.5 opens up. The classical foundations for the search inside each shard, locality-sensitive hashing and approximate nearest neighbors, were built in Section 12.7, and the distributed top-$k$ gather that merges shard results is the same reduce pattern as the MapReduce aggregation of Section 6.7; this chapter scales those ideas to the index and binds them to a latency budget.

Practical Example: The Support Bot That Missed Its Budget on the Reranker

Who: A platform engineer running a customer-support RAG assistant over a 40-million-document knowledge base.

Situation: The assistant answered well in testing but its p99 latency drifted past the 1.2-second SLO under real traffic, and product threatened to cut the retrieval step entirely.

Problem: The team's instinct was to blame the vector search and buy a faster ANN index, the stage they had spent the most time tuning.

Dilemma: Spend the quarter optimizing search, which felt like the hard distributed-systems problem, or first measure where the budget actually went, which risked finding that their pride-and-joy index was not the culprit.

Decision: They instrumented every stage separately, exactly as Code 25.1.1 sums them, and read off the dominant term before touching any code.

How: Per-stage traces showed sharded search at 22 ms p99, embedding at 9 ms, generation at 700 ms, and a cross-encoder reranker scoring 200 candidates at 380 ms p99, the reranker, not search, was eating the slack.

Result: Cutting the reranker's candidate set from 200 to 40 and batching it cut its p99 to 70 ms with no measurable drop in answer quality, and the pipeline came in under SLO with room to spare.

Lesson: Distribute and optimize the stage that owns the budget, not the stage you find most interesting. The budget in Section 2 is a diagnostic instrument before it is a design constraint.

4. The Map of the Chapter Beginner

Each stage of the pipeline in Figure 25.1.1 expands into its own section, and a few cross-cutting concerns get sections of their own. The embedding stage, including the offline pipeline that encodes the whole corpus into vectors, is Section 25.2. The store that holds those vectors and answers searches, the vector database, is Section 25.3. The approximate-nearest-neighbor algorithms that make search fast inside one shard, HNSW graphs, inverted files (IVF), and product quantization (PQ), are Section 25.4, and the way those indexes are sharded and replicated across machines is Section 25.5, the structural heart of the chapter.

Beyond pure vector search, Section 25.6 covers distributed hybrid search, combining dense vectors with classical keyword retrieval so the two signals reinforce each other. The reranking stage and its distributed, multi-stage refinement is Section 25.7. Because the same queries and the same hot passages recur, Section 25.8 treats distributed caching for retrieval, the layer that lets a system skip work it has already done. Finally, Section 25.9 closes the loop with evaluation, how to measure recall, latency, and freshness across a sharded, replicated retrieval fleet, building directly on the evaluation methodology of Chapter 5. Once the pieces are in place, the web-scale and agentic case studies of Chapter 36 and Chapter 40 assemble them into complete systems serving real traffic.

Thesis Thread: Retrieval Is the Six Axes, Stacked

RAG is the clearest example in the book of the six axes of distribution from Section 1.1 meeting in one system. It distributes data (the sharded corpus and its embeddings), distributes inference (the replicated encoder, reranker, and language model), distributes the model (the LLM's parameters across Chapter 24's fleet), and coordinates the cluster (the scatter-gather across index shards and the latency budget that governs them). Reading retrieval as a single subsystem misses the lesson; the engineering payoff is in seeing four distributed systems composed under one deadline, and in knowing which axis to scale when the budget tightens. That is the thesis of this book applied to one of its most demanding workloads.

Library Shortcut: LangChain and LlamaIndex Wire the Pipeline in a Few Lines

The four stages of Figure 25.1.1 are real services you operate, but orchestration frameworks let you express the wiring declaratively while you point each stage at a distributed backend. In LlamaIndex, an embedding model, a sharded vector store, a reranker, and an LLM are four components composed into a query engine; the framework handles the embed-search-rerank-generate flow and lets each component be a remote, replicated service rather than an in-process object:

# Each component is a handle to a distributed service, not a local object.
from llama_index.core import VectorStoreIndex
from llama_index.core.postprocessor import SentenceTransformerRerank

index = VectorStoreIndex.from_vector_store(sharded_vector_store)  # 25.3-25.5
reranker = SentenceTransformerRerank(top_n=40, model="BAAI/bge-reranker-base")  # 25.7

query_engine = index.as_query_engine(
    similarity_top_k=200,          # ANN search breadth, fanned out to shards
    node_postprocessors=[reranker],# rerank stage trims 200 -> 40 candidates
    llm=distributed_llm,           # Chapter 24's served model
)
answer = query_engine.query("How do index shards bound search latency?")
Code 25.1.2: The embed-search-rerank-generate pipeline of Figure 25.1.1 expressed in LlamaIndex; LangChain and Haystack offer equivalent retriever-reranker-generator compositions. The framework supplies the orchestration that the case studies in Chapter 36 run at scale, but it does not relieve you of owning the latency budget: each handle points at a service you must shard, replicate, and keep inside its slice of the SLO.

5. A Frontier in Motion: Retrieve, or Just Read Everything? Advanced

The premise of RAG, that you retrieve a small relevant slice rather than feed the model the whole corpus, is under active pressure from two directions at once, and the resolution is shaping how these systems are built right now. It is worth knowing where the argument stands before committing a design to it.

Research Frontier: Long-Context Models, Agentic Retrieval, and the Future of the Pipeline (2024 to 2026)

One direction questions retrieval itself. Models with context windows of hundreds of thousands to millions of tokens (Gemini 1.5, 2024, and its successors; Anthropic's and OpenAI's long-context releases) raised the question of whether you can skip retrieval and simply read the relevant documents into the prompt. The empirical answer so far is no, not at corpus scale: studies of the "lost in the middle" effect (Liu et al., 2024) show models attend unevenly across a long context, and stuffing a million tokens into every prompt is ruinous for both latency and cost, since the generation stage that already dominates Output 25.1.1 grows with context length. Retrieval survives as the way to keep the prompt small and relevant, and long context becomes a larger candidate budget for the reranker rather than a replacement for search.

The other direction makes retrieval more active. Agentic and iterative retrieval (the Self-RAG line of Asai et al., 2024, and the broader move to retrieval-using agents) turns the single embed-search-rerank-generate pass of Figure 25.1.1 into a loop: the model decides whether to retrieve, issues several queries, reads results, and retrieves again before answering. This multiplies the number of search round trips per user request, which tightens the per-search latency budget rather than relaxing it, and turns the index into a shared memory that many agent steps hit concurrently, the pattern Chapter 32 develops. Both frontiers leave the distributed retrieval substrate of this chapter not just intact but more heavily loaded, which is the strongest argument for building it well.

Whichever way these debates settle, the distributed machinery is the durable part. A longer context window still needs a fast way to choose what goes in it; an agent that retrieves ten times per task needs each retrieval to cost a tenth of the budget. The sections that follow build that machinery stage by stage, starting with the pipeline that turns a corpus into the vectors everything else searches. We have the pipeline, the budget that governs it, and the map; Section 25.2 opens the first stage, the distributed embedding pipeline.

Exercise 25.1.1: Allocate the Budget Conceptual

You are given a 600 ms p99 SLO for a RAG answer and the following measured stage tails: embed 10 ms, search 25 ms, rerank 60 ms, generate 540 ms. State whether the pipeline meets its SLO, and by how much it misses if not. Then, using Output 25.1.1's lesson about which stage dominates, name the single most effective change to bring it under budget, and explain why shrinking the search stage (the one most associated with "distributed retrieval") would be the wrong first move here.

Exercise 25.1.2: Make Search the Bottleneck Coding

Modify Code 25.1.1 so the index is not sharded: fix shards = 1 for every corpus size and change search_ms so that visiting cost grows linearly in the points per shard (replace the log2 term with a small constant times per_shard). Re-run across the same corpus sizes and report the first corpus size at which the total exceeds the 900 ms SLO. Explain, in terms of the difference between linear and logarithmic growth, why sharding is what converts the flat search line in Output 25.1.1 into a budget-busting one when removed, and what this implies for the layout choices of Section 25.5.

Exercise 25.1.3: Tail Amplification Across Shards Analysis

Suppose each ANN shard's search latency is independent and uniformly distributed between 5 and 15 ms. A query fans out to $K$ shards and waits for the slowest. Using the fact that for $K$ independent uniforms on $[a,b]$ the expected maximum is $a + (b-a)\frac{K}{K+1}$, compute the expected fan-out latency for $K = 1, 10, 100, 1000$. Compare each to the single-shard mean of 10 ms, and relate your numbers to the slow upward creep of the search column in Output 25.1.1 and to the tail-latency treatment of Section 5.3. What does this say about the cost of adding shards purely to fit a larger corpus?