Part V: Distributed Inference and Serving
Chapter 25: Distributed Retrieval and Vector Search

Distributed Retrieval and Vector Search

Chapter 24 served one large language model as a distributed computation across many machines. This chapter gives that model something to read. A retrieval-augmented system answers a question not from weights alone but from a corpus too large to fit on any single machine: billions of documents embedded into vectors, indexed for fast nearest-neighbor search, and queried in milliseconds on the critical path of a generation request. That corpus does not live in one place, and neither does the work of searching it. The unit of the problem is no longer one model split across devices; it is a pipeline of stages, each of which is itself distributed: an embedding pass that encodes a stream of documents across a fleet of workers, a vector index sharded across nodes so that no single machine holds the whole corpus, an approximate nearest-neighbor search that runs in parallel over those shards and merges the partial results, a hybrid layer that fuses dense and sparse signals, a reranker that refines the top candidates, and a cache that absorbs the queries everyone asks. The nine sections build this retrieval pipeline as a distributed system, from the moment a document enters the embedding pass to the moment a ranked, evaluated set of passages is handed back to the serving stack of Chapter 24. The retrieval quality questions of the broader RAG literature appear only where they shape the distribution; the subject here is the distribution itself, because that is what makes retrieval scale.

Conceptual illustration for Chapter 25: Distributed Retrieval and Vector Search

"I hold one shard of a billion-vector index. I do not know what the other shards contain, I only know that when a query arrives I must search my slice fast, return my best guesses, and trust the merge step to sort us all out. I am a small library that has made peace with not being the whole library."

A Shard of an Index That Has Never Seen the Full Corpus
Big Picture

This is the chapter where retrieval-augmented generation stops being a single function call and becomes a distributed pipeline: a corpus too large for one machine, embedded across a fleet, indexed in shards, searched in parallel by approximate nearest-neighbor algorithms, fused across dense and sparse signals, reranked in stages, and cached, all fast enough to sit on the critical path of a generation request. The popular picture of RAG hides every hard part: "embed the query, look up the nearest documents, paste them into the prompt." Each of those steps dissolves into a distributed-systems problem the moment the corpus grows past what one node can hold. Embedding billions of documents is a streaming batch job across many workers, the same shape as the data-processing pipelines of Part II. The index that makes lookup fast cannot live on one machine, so it is sharded across nodes and replicated for throughput and availability, the same sharding-and-replication tension that runs through the whole book. The lookup itself is approximate, because exact nearest-neighbor search over billions of high-dimensional vectors is too slow, so algorithms like HNSW, IVF, and product quantization trade a controlled loss of recall for orders-of-magnitude speed, and each runs in parallel across shards whose partial results must be merged. Dense vector search is fused with sparse lexical search to cover what embeddings miss, candidates are refined by a heavier reranker in a second stage, and a cache layer soaks up the repeated and popular queries so the index is not asked the same question twice. The distribution is the point: this chapter is the general theory of searching a corpus that no single machine can hold, returning the right passages quickly enough that a language model can read them before it answers.

Chapter Overview

This chapter takes the retrieval layer that the serving stack of Chapter 24 depends on and builds it as a distributed system from the ground up. A retrieval-augmented application is only as scalable as its retrieval, and retrieval at scale is not a library call but a pipeline of distributed stages: documents must be embedded in bulk, the resulting vectors must be stored in an index that no single machine can hold, that index must be searched approximately and in parallel, and the results must be fused, reranked, cached, and evaluated. The nine sections walk the pipeline in the order a query and a document each travel through it, treating retrieval quality only where it changes how the work is distributed.

The sections fall into three movements. The first establishes the pipeline and the data that flows into it: Section 25.1 frames retrieval-augmented generation as a distributed system rather than a single call, and Section 25.2 builds the distributed embedding pipeline that encodes a corpus into vectors across a fleet of workers. The second movement is the index and the search over it: Section 25.3 introduces vector databases as the systems that store and serve those vectors, Section 25.4 develops the approximate nearest-neighbor algorithms that make search fast, and Section 25.5 shards and replicates the index so it spans machines. The third movement is the layers that wrap around the index: Section 25.6 fuses dense and sparse retrieval into distributed hybrid search, Section 25.7 adds multi-stage retrieval and distributed reranking, Section 25.8 caches retrieval results across the fleet, and Section 25.9 evaluates the whole distributed system end to end.

Read in order, the nine sections take you from "RAG is just a lookup" to a working mental model of retrieval as a distributed pipeline: encode a streaming corpus into vectors across many workers, store those vectors in a database built for nearest-neighbor queries, search them approximately with HNSW, IVF, and product quantization, shard the index across nodes and replicate it for throughput and availability, fan a query out across the shards and merge their partial results, fuse dense and sparse signals where each alone falls short, refine the top candidates with a heavier reranker, cache the popular and repeated queries, and measure recall, latency, and end-to-end quality across the whole apparatus. The argument is cumulative and it extends Part V's inference arc outward: the serving fleet of Chapter 24 becomes the consumer of this retrieval system, and the data-processing patterns of Part II reappear as the embedding and indexing pipelines that feed it.

Prerequisites

This chapter assumes the classical-ML and nearest-neighbor foundations of Part III and the serving stack of the preceding chapter. From Chapter 12: Distributed Classical Machine Learning you carry the nearest-neighbor and similarity-search vocabulary this chapter scales up: what an approximate nearest-neighbor query is, how high-dimensional similarity search behaves, and how classical methods partition data across machines, all of which become the index and search stages here. From Chapter 24: Distributed LLM Serving you carry the consumer of this retrieval system: the prefill phase and prefix cache into which retrieved passages flow, and the latency budget that forces retrieval to answer fast enough to sit on the critical path of a generation request. The chapter also leans on the distributed data-processing patterns of Part II, since the embedding pipeline of Section 25.2 is a streaming batch job, and on the sharding-and-replication tension that runs through the whole book, which Section 25.5 specializes to a vector index. Beyond these the chapter assumes comfortable Python, a working picture of how a text encoder turns a passage into a vector, and the distributed-systems vocabulary of fan-out, merge, and replication. No prior experience with a specific vector database is needed; Section 25.1 builds the why-retrieval-is-distributed argument from the ground up before any system appears.

Learning Objectives

Remember the Chapter as One Sentence

If you keep one thing from this chapter, keep this: distributed retrieval turns the lookup hidden inside retrieval-augmented generation into a pipeline that spans many machines, embedding a corpus into vectors across a fleet, sharding and replicating the index so no single node holds it all, searching it approximately and in parallel with HNSW, IVF, and product quantization, fusing dense and sparse signals, reranking in stages, and caching the popular queries, all fast enough to feed the serving stack of Chapter 24. Read forward, the sections build that pipeline in the order its parts arrive: first why retrieval is distributed, then the distributed embedding pipeline, then the vector database, then approximate nearest-neighbor search, then index sharding and replication, then distributed hybrid search, then multi-stage retrieval and reranking, then distributed caching, and finally evaluation. Read as a question, the chapter asks of any corpus too large for one machine: how is it embedded into vectors at scale, where do those vectors live, how is the index searched without scanning every vector, how is it split across nodes and the partial results merged, how are dense and sparse signals combined, how are the top candidates refined, what can be cached, and how do you know the whole thing works. The roadmap below walks the nine sections that answer it, and the last one hands you the benchmarks that tell you whether the answer is right.

Chapter Roadmap

Read the nine sections in order and you will hold a working model of retrieval as a distributed pipeline: Section 25.1 frames RAG as a distributed system, Section 25.2 embeds a corpus across a fleet, Section 25.3 stores the vectors in a database built for the task, Section 25.4 searches them approximately with HNSW, IVF, and product quantization, Section 25.5 shards and replicates the index across nodes, Section 25.6 fuses dense and sparse signals, Section 25.7 reranks the top candidates in stages, Section 25.8 caches the popular queries, and Section 25.9 evaluates the whole system. The thread to watch is two parts of the book reappearing at once: the streaming data pipelines of Part II return as the embedding and indexing stages, and the serving fleet of Chapter 24 becomes the consumer that reads the passages this pipeline returns.

What's Next?

This chapter built the retrieval layer that feeds context into a serving stack, a distributed pipeline that embeds, indexes, searches, fuses, reranks, and caches a corpus too large for any single machine. The next chapter steps back from any single component and asks how the whole apparatus is run in production over time. Chapter 26: MLOps for Distributed AI turns to the operational lifecycle that wraps around everything Part V has built: how distributed models and indexes are versioned, deployed, monitored, and updated; how data and embeddings are refreshed without taking the service down; and how a fleet of training, serving, and retrieval systems is kept healthy, observable, and reproducible. Where this chapter asked how to search a corpus at scale, Chapter 26 asks how to operate the entire distributed AI system that the corpus, the index, and the model belong to. The retrieval pipeline developed here becomes one of the production workloads that MLOps must keep running. Read it next, and watch the components of Part V become a system that lives and changes in production.

Bibliography & Further Reading

Approximate Nearest Neighbor Search

Malkov, Y. A., Yashunin, D. A. "Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs." arXiv:1603.09320, 2016 (IEEE TPAMI 2020). arxiv.org/abs/1603.09320

The HNSW paper that introduced the hierarchical navigable small-world graph index, the dominant in-memory approximate nearest-neighbor algorithm and a core reference for Section 25.4.

📄 Paper

Jégou, H., Douze, M., Schmid, C. "Product Quantization for Nearest Neighbor Search." IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011. hal.inria.fr/inria-00514462

The paper that introduced product quantization, the compression scheme that makes billion-scale vector search fit in memory, the memory-side foundation of the IVF and PQ methods in Section 25.4.

📄 Paper

Subramanya, S. J., Devvrit, Kadekodi, R., Krishnaswamy, R., Simhadri, H. V. "DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node." Advances in Neural Information Processing Systems (NeurIPS), 2019. papers.nips.cc

The graph index that scales nearest-neighbor search to a billion points using SSD-resident storage, the reference for index structures that exceed RAM in Sections 25.4 and 25.5.

📄 Paper

Vector Search Systems and Databases

Johnson, J., Douze, M., Jégou, H. "Billion-scale Similarity Search with GPUs." arXiv:1702.08734, 2017 (IEEE Transactions on Big Data 2021). arxiv.org/abs/1702.08734

The FAISS paper that built GPU-accelerated billion-scale similarity search, the canonical library behind much of the approximate nearest-neighbor and indexing material in Sections 25.4 and 25.5.

📄 Paper

Wang, J., Yi, X., Guo, R., Jin, H., Xu, P., Li, S., Wang, X., Guo, X., Li, C., Xu, X., et al. "Milvus: A Purpose-Built Vector Data Management System." Proceedings of the 2021 ACM SIGMOD International Conference on Management of Data, 2021. dl.acm.org/doi/10.1145/3448016.3457550

The system paper for Milvus, a distributed vector database with sharding, replication, and a serving layer, the primary reference for the vector databases of Section 25.3.

📄 Paper

Retrieval-Augmented Generation and Dense Retrieval

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., Kiela, D. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." arXiv:2005.11401, 2020 (NeurIPS 2020). arxiv.org/abs/2005.11401

The paper that introduced retrieval-augmented generation, coupling a dense retriever to a generator, the foundational reference for the pipeline framed in Section 25.1.

📄 Paper

Karpukhin, V., OÄŸuz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., Yih, W. "Dense Passage Retrieval for Open-Domain Question Answering." arXiv:2004.04906, 2020 (EMNLP 2020). arxiv.org/abs/2004.04906

The DPR paper that established dual-encoder dense retrieval as a strong alternative to lexical search, the reference for the embedding-based retrieval of Sections 25.2 and 25.6.

📄 Paper

Khattab, O., Zaharia, M. "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT." arXiv:2004.12832, 2020 (SIGIR 2020). arxiv.org/abs/2004.12832

The late-interaction retrieval model that scores query and document token vectors, a key reference for the multi-stage retrieval and reranking of Section 25.7.

📄 Paper

Formal, T., Piwowarski, B., Clinchant, S. "SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking." arXiv:2107.05720, 2021 (SIGIR 2021). arxiv.org/abs/2107.05720

The learned sparse retrieval model that bridges lexical and neural signals, the reference for the sparse side of the distributed hybrid search of Section 25.6.

📄 Paper

Evaluation Benchmarks

Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., Gurevych, I. "BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models." arXiv:2104.08663, 2021 (NeurIPS 2021 Datasets and Benchmarks). arxiv.org/abs/2104.08663

The heterogeneous zero-shot retrieval benchmark spanning many domains, a primary reference for the retrieval evaluation of Section 25.9.

📄 Paper

Muennighoff, N., Tazi, N., Magne, L., Reimers, N. "MTEB: Massive Text Embedding Benchmark." arXiv:2210.07316, 2022 (EACL 2023). arxiv.org/abs/2210.07316

The broad benchmark for text-embedding models across retrieval and many other tasks, the reference for choosing the encoders that feed the pipeline of Section 25.2 and evaluating them in Section 25.9.

📄 Paper

Es, S., James, J., Espinosa-Anke, L., Schockaert, S. "RAGAS: Automated Evaluation of Retrieval Augmented Generation." arXiv:2309.15217, 2023. arxiv.org/abs/2309.15217

The framework for automated, reference-free evaluation of retrieval-augmented generation, the reference for the end-to-end answer-quality measurement of Section 25.9.

📄 Paper