"I hold one shard of a billion-vector index. I do not know what the other shards contain, I only know that when a query arrives I must search my slice fast, return my best guesses, and trust the merge step to sort us all out. I am a small library that has made peace with not being the whole library."
A Shard of an Index That Has Never Seen the Full Corpus
This is the chapter where retrieval-augmented generation stops being a single function call and becomes a distributed pipeline: a corpus too large for one machine, embedded across a fleet, indexed in shards, searched in parallel by approximate nearest-neighbor algorithms, fused across dense and sparse signals, reranked in stages, and cached, all fast enough to sit on the critical path of a generation request. The popular picture of RAG hides every hard part: "embed the query, look up the nearest documents, paste them into the prompt." Each of those steps dissolves into a distributed-systems problem the moment the corpus grows past what one node can hold. Embedding billions of documents is a streaming batch job across many workers, the same shape as the data-processing pipelines of Part II. The index that makes lookup fast cannot live on one machine, so it is sharded across nodes and replicated for throughput and availability, the same sharding-and-replication tension that runs through the whole book. The lookup itself is approximate, because exact nearest-neighbor search over billions of high-dimensional vectors is too slow, so algorithms like HNSW, IVF, and product quantization trade a controlled loss of recall for orders-of-magnitude speed, and each runs in parallel across shards whose partial results must be merged. Dense vector search is fused with sparse lexical search to cover what embeddings miss, candidates are refined by a heavier reranker in a second stage, and a cache layer soaks up the repeated and popular queries so the index is not asked the same question twice. The distribution is the point: this chapter is the general theory of searching a corpus that no single machine can hold, returning the right passages quickly enough that a language model can read them before it answers.
Chapter Overview
This chapter takes the retrieval layer that the serving stack of Chapter 24 depends on and builds it as a distributed system from the ground up. A retrieval-augmented application is only as scalable as its retrieval, and retrieval at scale is not a library call but a pipeline of distributed stages: documents must be embedded in bulk, the resulting vectors must be stored in an index that no single machine can hold, that index must be searched approximately and in parallel, and the results must be fused, reranked, cached, and evaluated. The nine sections walk the pipeline in the order a query and a document each travel through it, treating retrieval quality only where it changes how the work is distributed.
The sections fall into three movements. The first establishes the pipeline and the data that flows into it: Section 25.1 frames retrieval-augmented generation as a distributed system rather than a single call, and Section 25.2 builds the distributed embedding pipeline that encodes a corpus into vectors across a fleet of workers. The second movement is the index and the search over it: Section 25.3 introduces vector databases as the systems that store and serve those vectors, Section 25.4 develops the approximate nearest-neighbor algorithms that make search fast, and Section 25.5 shards and replicates the index so it spans machines. The third movement is the layers that wrap around the index: Section 25.6 fuses dense and sparse retrieval into distributed hybrid search, Section 25.7 adds multi-stage retrieval and distributed reranking, Section 25.8 caches retrieval results across the fleet, and Section 25.9 evaluates the whole distributed system end to end.
Read in order, the nine sections take you from "RAG is just a lookup" to a working mental model of retrieval as a distributed pipeline: encode a streaming corpus into vectors across many workers, store those vectors in a database built for nearest-neighbor queries, search them approximately with HNSW, IVF, and product quantization, shard the index across nodes and replicate it for throughput and availability, fan a query out across the shards and merge their partial results, fuse dense and sparse signals where each alone falls short, refine the top candidates with a heavier reranker, cache the popular and repeated queries, and measure recall, latency, and end-to-end quality across the whole apparatus. The argument is cumulative and it extends Part V's inference arc outward: the serving fleet of Chapter 24 becomes the consumer of this retrieval system, and the data-processing patterns of Part II reappear as the embedding and indexing pipelines that feed it.
Prerequisites
This chapter assumes the classical-ML and nearest-neighbor foundations of Part III and the serving stack of the preceding chapter. From Chapter 12: Distributed Classical Machine Learning you carry the nearest-neighbor and similarity-search vocabulary this chapter scales up: what an approximate nearest-neighbor query is, how high-dimensional similarity search behaves, and how classical methods partition data across machines, all of which become the index and search stages here. From Chapter 24: Distributed LLM Serving you carry the consumer of this retrieval system: the prefill phase and prefix cache into which retrieved passages flow, and the latency budget that forces retrieval to answer fast enough to sit on the critical path of a generation request. The chapter also leans on the distributed data-processing patterns of Part II, since the embedding pipeline of Section 25.2 is a streaming batch job, and on the sharding-and-replication tension that runs through the whole book, which Section 25.5 specializes to a vector index. Beyond these the chapter assumes comfortable Python, a working picture of how a text encoder turns a passage into a vector, and the distributed-systems vocabulary of fan-out, merge, and replication. No prior experience with a specific vector database is needed; Section 25.1 builds the why-retrieval-is-distributed argument from the ground up before any system appears.
Learning Objectives
- Explain why retrieval-augmented generation is a distributed system rather than a single lookup, in particular how corpus size, embedding cost, and the generation latency budget force retrieval across many machines.
- Design a distributed embedding pipeline that encodes a streaming corpus into vectors across a fleet of workers, reasoning about throughput, batching, and fault tolerance.
- Describe what a vector database provides over a raw index: storage, metadata filtering, consistency, and the operational surface of serving nearest-neighbor queries at scale.
- Reason about approximate nearest-neighbor algorithms, in particular HNSW, IVF, and product quantization, and the recall-versus-latency-versus-memory trade-offs each one makes.
- Shard and replicate a vector index across nodes, fan a query out across the shards, and merge their partial results without losing recall.
- Combine dense vector search with sparse lexical search into a distributed hybrid retrieval system, and reason about how the two signals are fused.
- Build multi-stage retrieval that refines a cheap first-stage candidate set with a heavier distributed reranker under a latency budget.
- Apply distributed caching to absorb popular and repeated retrieval queries, and reason about cache placement, invalidation, and consistency across the fleet.
- Evaluate a distributed retrieval system end to end, measuring recall, latency, and downstream answer quality with benchmarks built for the task.
If you keep one thing from this chapter, keep this: distributed retrieval turns the lookup hidden inside retrieval-augmented generation into a pipeline that spans many machines, embedding a corpus into vectors across a fleet, sharding and replicating the index so no single node holds it all, searching it approximately and in parallel with HNSW, IVF, and product quantization, fusing dense and sparse signals, reranking in stages, and caching the popular queries, all fast enough to feed the serving stack of Chapter 24. Read forward, the sections build that pipeline in the order its parts arrive: first why retrieval is distributed, then the distributed embedding pipeline, then the vector database, then approximate nearest-neighbor search, then index sharding and replication, then distributed hybrid search, then multi-stage retrieval and reranking, then distributed caching, and finally evaluation. Read as a question, the chapter asks of any corpus too large for one machine: how is it embedded into vectors at scale, where do those vectors live, how is the index searched without scanning every vector, how is it split across nodes and the partial results merged, how are dense and sparse signals combined, how are the top candidates refined, what can be cached, and how do you know the whole thing works. The roadmap below walks the nine sections that answer it, and the last one hands you the benchmarks that tell you whether the answer is right.
Chapter Roadmap
- 25.1 Retrieval-Augmented Generation as a Distributed System Reframes retrieval-augmented generation from a single lookup into a distributed pipeline, showing how corpus size, embedding cost, and the generation latency budget force retrieval across many machines from the first query.
- 25.2 Distributed Embedding Pipelines Encodes a streaming corpus into vectors across a fleet of workers, treating embedding as a distributed batch job with the throughput, batching, and fault-tolerance concerns of Part II's data pipelines.
- 25.3 Vector Databases Introduces the systems that store and serve embedding vectors, covering what a vector database adds over a raw index: persistence, metadata filtering, consistency, and the operational surface of serving nearest-neighbor queries.
- 25.4 Approximate Nearest Neighbor Search Develops the algorithms that make billion-scale search fast, HNSW graphs, IVF inverted lists, and product quantization, and the recall-versus-latency-versus-memory trade-offs each one makes.
- 25.5 Index Sharding and Replication Splits a vector index across nodes so no single machine holds the whole corpus, replicates it for throughput and availability, and fans a query out across shards whose partial results must be merged without losing recall.
- 25.6 Distributed Hybrid Search Fuses dense vector search with sparse lexical search into one distributed retrieval system, reasoning about how the two complementary signals are combined and scored across the fleet.
- 25.7 Multi-Stage Retrieval and Distributed Reranking Refines a cheap first-stage candidate set with a heavier distributed reranker, trading extra computation for precision under a latency budget across multiple retrieval stages.
- 25.8 Distributed Caching for Retrieval Absorbs popular and repeated retrieval queries with a distributed cache, reasoning about cache placement, invalidation, and consistency so the index is not asked the same question twice.
- 25.9 Evaluating Distributed Retrieval Measures the whole distributed retrieval system end to end, covering recall, latency, and downstream answer quality with benchmarks built to test retrieval and retrieval-augmented generation.
Read the nine sections in order and you will hold a working model of retrieval as a distributed pipeline: Section 25.1 frames RAG as a distributed system, Section 25.2 embeds a corpus across a fleet, Section 25.3 stores the vectors in a database built for the task, Section 25.4 searches them approximately with HNSW, IVF, and product quantization, Section 25.5 shards and replicates the index across nodes, Section 25.6 fuses dense and sparse signals, Section 25.7 reranks the top candidates in stages, Section 25.8 caches the popular queries, and Section 25.9 evaluates the whole system. The thread to watch is two parts of the book reappearing at once: the streaming data pipelines of Part II return as the embedding and indexing stages, and the serving fleet of Chapter 24 becomes the consumer that reads the passages this pipeline returns.
What's Next?
This chapter built the retrieval layer that feeds context into a serving stack, a distributed pipeline that embeds, indexes, searches, fuses, reranks, and caches a corpus too large for any single machine. The next chapter steps back from any single component and asks how the whole apparatus is run in production over time. Chapter 26: MLOps for Distributed AI turns to the operational lifecycle that wraps around everything Part V has built: how distributed models and indexes are versioned, deployed, monitored, and updated; how data and embeddings are refreshed without taking the service down; and how a fleet of training, serving, and retrieval systems is kept healthy, observable, and reproducible. Where this chapter asked how to search a corpus at scale, Chapter 26 asks how to operate the entire distributed AI system that the corpus, the index, and the model belong to. The retrieval pipeline developed here becomes one of the production workloads that MLOps must keep running. Read it next, and watch the components of Part V become a system that lives and changes in production.
Bibliography & Further Reading
Approximate Nearest Neighbor Search
Malkov, Y. A., Yashunin, D. A. "Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs." arXiv:1603.09320, 2016 (IEEE TPAMI 2020). arxiv.org/abs/1603.09320
The HNSW paper that introduced the hierarchical navigable small-world graph index, the dominant in-memory approximate nearest-neighbor algorithm and a core reference for Section 25.4.
Jégou, H., Douze, M., Schmid, C. "Product Quantization for Nearest Neighbor Search." IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011. hal.inria.fr/inria-00514462
The paper that introduced product quantization, the compression scheme that makes billion-scale vector search fit in memory, the memory-side foundation of the IVF and PQ methods in Section 25.4.
Subramanya, S. J., Devvrit, Kadekodi, R., Krishnaswamy, R., Simhadri, H. V. "DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node." Advances in Neural Information Processing Systems (NeurIPS), 2019. papers.nips.cc
The graph index that scales nearest-neighbor search to a billion points using SSD-resident storage, the reference for index structures that exceed RAM in Sections 25.4 and 25.5.
Vector Search Systems and Databases
Johnson, J., Douze, M., Jégou, H. "Billion-scale Similarity Search with GPUs." arXiv:1702.08734, 2017 (IEEE Transactions on Big Data 2021). arxiv.org/abs/1702.08734
The FAISS paper that built GPU-accelerated billion-scale similarity search, the canonical library behind much of the approximate nearest-neighbor and indexing material in Sections 25.4 and 25.5.
Wang, J., Yi, X., Guo, R., Jin, H., Xu, P., Li, S., Wang, X., Guo, X., Li, C., Xu, X., et al. "Milvus: A Purpose-Built Vector Data Management System." Proceedings of the 2021 ACM SIGMOD International Conference on Management of Data, 2021. dl.acm.org/doi/10.1145/3448016.3457550
The system paper for Milvus, a distributed vector database with sharding, replication, and a serving layer, the primary reference for the vector databases of Section 25.3.
Retrieval-Augmented Generation and Dense Retrieval
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., Kiela, D. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." arXiv:2005.11401, 2020 (NeurIPS 2020). arxiv.org/abs/2005.11401
The paper that introduced retrieval-augmented generation, coupling a dense retriever to a generator, the foundational reference for the pipeline framed in Section 25.1.
Karpukhin, V., OÄŸuz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., Yih, W. "Dense Passage Retrieval for Open-Domain Question Answering." arXiv:2004.04906, 2020 (EMNLP 2020). arxiv.org/abs/2004.04906
The DPR paper that established dual-encoder dense retrieval as a strong alternative to lexical search, the reference for the embedding-based retrieval of Sections 25.2 and 25.6.
Khattab, O., Zaharia, M. "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT." arXiv:2004.12832, 2020 (SIGIR 2020). arxiv.org/abs/2004.12832
The late-interaction retrieval model that scores query and document token vectors, a key reference for the multi-stage retrieval and reranking of Section 25.7.
Formal, T., Piwowarski, B., Clinchant, S. "SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking." arXiv:2107.05720, 2021 (SIGIR 2021). arxiv.org/abs/2107.05720
The learned sparse retrieval model that bridges lexical and neural signals, the reference for the sparse side of the distributed hybrid search of Section 25.6.
Evaluation Benchmarks
Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., Gurevych, I. "BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models." arXiv:2104.08663, 2021 (NeurIPS 2021 Datasets and Benchmarks). arxiv.org/abs/2104.08663
The heterogeneous zero-shot retrieval benchmark spanning many domains, a primary reference for the retrieval evaluation of Section 25.9.
Muennighoff, N., Tazi, N., Magne, L., Reimers, N. "MTEB: Massive Text Embedding Benchmark." arXiv:2210.07316, 2022 (EACL 2023). arxiv.org/abs/2210.07316
The broad benchmark for text-embedding models across retrieval and many other tasks, the reference for choosing the encoders that feed the pipeline of Section 25.2 and evaluating them in Section 25.9.
Es, S., James, J., Espinosa-Anke, L., Schockaert, S. "RAGAS: Automated Evaluation of Retrieval Augmented Generation." arXiv:2309.15217, 2023. arxiv.org/abs/2309.15217
The framework for automated, reference-free evaluation of retrieval-augmented generation, the reference for the end-to-end answer-quality measurement of Section 25.9.