Chapter 36: Web-Scale Text Processing and Distributed RAG

"They built me to answer questions, but first they sent me to read the entire internet, twice, and remember only the parts worth retrieving. I have opinions now about deduplication that I did not have before."
A Retrieval Index That Has Seen the Whole Web

Big Picture

The seven parts before this one each taught one axis of distribution in isolation; this chapter is the first place the book wires all of them into a single system that has to work end to end. A web-scale Retrieval-Augmented Generation pipeline is the ideal opening case study precisely because it refuses to fit in any one part: it distributes the data when it crawls and cleans billions of pages, distributes inference compute when it embeds them, distributes the model and index when it shards a vector store too large for any one machine, distributes inference again when it serves generation across a fleet, and coordinates the whole assembly so the pieces act as one. Retrieval-Augmented Generation, introduced as a serving technique in Chapter 25, becomes here a full distributed system with a data side and a serving side that meet at the index. The through-line of Part VIII begins with this chapter: stop building parts, start assembling them, and watch one realistic system put nearly every earlier chapter to work. By the end you will be able to read a production RAG stack as a composition of the book's primitives rather than as a single opaque service, and to reason about where each axis binds and what it costs.

Chapter Overview

Part VIII assembles the whole book into end-to-end systems, and this chapter takes the first and most demanding assembly: a Retrieval-Augmented Generation system that draws its knowledge from the open web. A small RAG demo fits on a laptop, embed a few thousand documents, drop them in a single index, and let a language model condition its answers on the nearest neighbours. Scale the corpus to the web and every step of that demo breaks against a resource ceiling, and breaks against a different one each time. The crawl outgrows the disk of one machine; the cleaning and deduplication outgrow its memory; the embedding outgrows the throughput of one accelerator; the index outgrows the address space of one process; the serving outgrows the request budget of one server. Each ceiling, hit independently, forces the same move this book has taught from Chapter 1 onward: split the work across machines, move only the information that must move, and recombine it correctly. The result is a single pipeline that touches all six axes of distribution at once.

The chapter is organized as the pipeline runs, from raw web to answered query. Section 36.1 defines the problem and fixes the requirements: what a web-scale RAG system must ingest, index, and serve, and the latency, freshness, and quality targets that constrain every later choice. Section 36.2 builds the distributed crawler, a politeness-aware, fault-tolerant fleet that fetches the corpus and is itself a distributed-systems exercise in coordination, deduplication of work, and back-pressure. Section 36.3 takes the raw pages and distributes the cleaning and near-duplicate removal across the cluster, where MinHash and locality-sensitive hashing turn an intractable all-pairs comparison into a sharded, shuffle-bound job in the lineage of Chapter 6. Section 36.4 turns the cleaned corpus into a sharded lexical inverted index built as a MapReduce job, partitioning the posting lists across nodes so that no single machine holds the whole index.

The middle of the chapter is the compute that fills the index and the retrieval that reads it. Section 36.5 distributes embedding generation, the single most compute-heavy stage, as an embarrassingly parallel batch-inference job across a fleet of accelerators, the data-parallel inference cousin of the training in Chapter 15. Section 36.6 builds sharded retrieval and ranking on top of the index of Section 36.4: scatter the query to every shard, gather the top candidates, and rank them with a fusion of dense similarity and a sparse signal like BM25, the scatter-gather pattern of Chapter 25 applied at corpus scale. Section 36.7 integrates retrieval with generation across a serving fleet, threading the retrieved context into a distributed LLM serving stack from Chapter 24 so that the model's answer is grounded in what the index returned.

The final stretch measures the system and points past it. Section 36.8 evaluates the assembled pipeline end to end: retrieval quality, answer faithfulness and groundedness, and the throughput, latency, and cost metrics of Chapter 1 applied to a system with both a data side and a serving side. Section 36.9 closes with a project extension that hands the reader the levers, scaling the corpus, swapping the index, hardening the freshness loop, so the case study becomes something to build rather than only to read. Read in order, the nine sections make one argument that the rest of Part VIII repeats in other domains: a real distributed AI system is not the application of one axis but the composition of all of them, and the engineering is in making the seams hold.

Prerequisites

This chapter is a synthesis, so it assumes the parts it composes rather than reteaching them. From Chapter 6 and Chapter 7 it assumes the MapReduce and Spark model of a sharded, shuffle-bound batch job, the substrate on which the crawling, cleaning, and deduplication of Sections 36.2 and 36.3 run. From Chapter 8 it assumes distributed storage and partitioned data loading, which feed the corpus through every stage. From Chapter 15 it assumes data-parallel execution, reused in Section 36.5 as parallel batch inference rather than training. From Chapter 24 it assumes distributed LLM serving, the fleet that Section 36.7 threads retrieved context into, and from Chapter 25 it assumes distributed retrieval and vector search, the sharded index and scatter-gather retrieval this chapter scales to the web. From Chapter 32 it assumes the orchestration vocabulary, directed pipelines, retries, and back-pressure, that ties the stages into one cluster-wide workflow. A reader comfortable with those threads can read this chapter as the place where they finally run together on one system.

Learning Objectives

Decompose a web-scale RAG system into its stages and map each stage onto the axis of distribution it exercises, recognizing the single pipeline as a composition of every earlier part.
Design a distributed, politeness-aware crawler with coordination, work deduplication, and back-pressure, and reason about its fault tolerance as a distributed-systems problem.
Distribute cleaning and near-duplicate removal across a cluster using MinHash and locality-sensitive hashing to avoid the intractable all-pairs comparison.
Build a sharded retrieval index and parallel embedding-generation pipeline, partitioning both the index and the compute so that no single machine holds the whole corpus.
Assemble sharded retrieval, dense-plus-sparse ranking, and distributed generation into one serving path, and trace a query end to end through scatter, gather, rank, and generate.
Evaluate the full pipeline on retrieval quality, answer groundedness, and the throughput, latency, and cost metrics that price a system with both a data side and a serving side.

The One Idea to Carry Out of This Chapter

If you keep one thing from this chapter, keep this: a web-scale RAG system is the whole book in one pipeline, every stage is a different axis of distribution, and the engineering is in the seams that connect them, not in any single stage. The crawl distributes the data, the embedding distributes inference compute, the sharded index distributes the model, retrieval and generation distribute inference again, and the orchestration coordinates the cluster so the five moves read as one system. No stage is novel on its own; each is an application of a primitive the book already built. What is new, and what every case study in Part VIII teaches, is composition: the crawler's output must match the cleaner's input, the embedder's vectors must match the index's shards, the index's results must reach the serving fleet within the latency budget, and a failure in any stage must not silently corrupt the next. Read forward, the chapter walks that pipeline stage by stage. Read as a question, it is a single checklist you carry into every real system: which axis does each part exercise, and does the seam between two parts hold under load and under failure? The roadmap below walks the nine sections that build that pipeline end to end.

Chapter Roadmap

36.1 Problem Definition What a web-scale RAG system must ingest, index, and serve, and the latency, freshness, and quality requirements that constrain every later design choice and name the axes the rest of the chapter distributes.
36.2 Distributed Crawling A politeness-aware, fault-tolerant crawler fleet that fetches the corpus, treated as a distributed-systems exercise in coordination, work deduplication, and back-pressure rather than a single greedy fetcher.
36.3 Distributed Cleaning and Deduplication Cleaning raw pages and removing near-duplicates across the cluster with MinHash and locality-sensitive hashing, turning an intractable all-pairs comparison into a sharded, shuffle-bound batch job.
36.4 Distributed Indexing Building the lexical inverted index over the cleaned corpus as a MapReduce job, choosing document- versus term-partitioned sharding, compressing posting lists, and scoring with BM25 so no single machine holds the whole index.
36.5 Distributed Embedding Generation The most compute-heavy stage, run as an embarrassingly parallel batch-inference job across a fleet of accelerators, the data-parallel cousin of distributed training applied to inference at corpus scale.
36.6 Sharded Retrieval and Ranking Scatter the query to every shard, gather the top candidates, and rank them by fusing dense similarity with a sparse signal such as BM25, the scatter-gather pattern of Chapter 25 applied across the corpus.
36.7 RAG Integration Across a Fleet Threading the retrieved context into a distributed LLM serving stack so generation is grounded in what the index returned, integrating the data side and the serving side into one query path.
36.8 Evaluation Measuring the assembled pipeline end to end: retrieval quality, answer faithfulness and groundedness, and the throughput, latency, and cost metrics that price a system with both a data side and a serving side.
36.9 Project Extension The levers handed to the reader: scaling the corpus, swapping the index, and hardening the freshness loop, turning the case study from something to read into something to build and defend.

Read the nine sections in order and you will have traced one realistic system from raw web to answered query: Section 36.1 fixes the requirements, Sections 36.2 through 36.4 build the data side that ingests and indexes the corpus, Sections 36.5 through 36.7 build the compute and serving side that embeds, retrieves, and generates, and Sections 36.8 through 36.9 measure the result and hand it to you to extend. The thread to watch runs back to Chapter 25: the sharded index and scatter-gather retrieval introduced there as a serving component return here as the spine of a complete system, which is why the index in Section 36.4 is the technical hinge where the data side and the serving side of this chapter meet.

What's Next?

This chapter built a distributed AI system on the most permissive data the world offers: the open web, where the corpus is public, copies are cheap, and the only real constraint is scale. Chapter 37: Federated Medical AI inverts that assumption. The next case study moves from open web data to privacy-constrained federated data, where the corpus cannot be crawled, copied, or even centralized, because it lives in hospitals that may not share a patient record across a wall. The crawling, deduplication, and central indexing that this chapter took for granted become impossible, and the federated and decentralized learning of Chapter 14 together with the differential privacy of Chapter 35 become the only way to build intelligence from data that must never leave its owner. Where this chapter distributed a system to reach scale, the next distributes one to respect a constraint. Read it next to see the same composition discipline tested against a problem where the easy moves are forbidden.

Bibliography & Further Reading

Retrieval & RAG

Lewis, P., Perez, E., Piktus, A., et al. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020. arXiv:2005.11401

The paper that named RAG: a parametric generator conditioned on documents pulled from a non-parametric retriever, trained end to end. The architecture this entire chapter scales out to the web.

📄 Paper

Karpukhin, V., Oguz, B., Min, S., et al. "Dense Passage Retrieval for Open-Domain Question Answering." EMNLP 2020. arXiv:2004.04906

Showed that a learned dual-encoder embedding beats sparse retrieval for open-domain question answering; the dense-retrieval half of the ranking fusion in Section 36.6.

📄 Paper

Robertson, S., Zaragoza, H. "The Probabilistic Relevance Framework: BM25 and Beyond." Foundations and Trends in Information Retrieval 3(4), 2009. city.ac.uk

The definitive treatment of BM25, the sparse term-weighting score that still anchors hybrid retrieval; the sparse half of the dense-plus-sparse ranking in Section 36.6.

📄 Paper

Es, S., James, J., Espinosa-Anke, L., Schockaert, S. "RAGAS: Automated Evaluation of Retrieval Augmented Generation." 2023. arXiv:2309.15217

A reference-free framework scoring faithfulness, answer relevance, and context relevance for RAG; the groundedness metrics the end-to-end evaluation of Section 36.8 leans on.

📄 Paper

Deduplication & Indexing

Broder, A. Z. "On the Resemblance and Containment of Documents." Compression and Complexity of Sequences (SEQUENCES) 1997. cs.princeton.edu

Introduces MinHash and the shingling estimate of Jaccard resemblance that makes near-duplicate detection tractable; the core primitive of the distributed deduplication in Section 36.3.

📄 Paper

Indyk, P., Motwani, R. "Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality." STOC 1998. dl.acm.org

The founding paper of locality-sensitive hashing, the idea that similar items collide in hash buckets; the bucketing that turns all-pairs deduplication and approximate retrieval into a sharded job.

📄 Paper

Malkov, Y. A., Yashunin, D. A. "Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs (HNSW)." IEEE TPAMI, 2018. arXiv:1603.09320

The graph-based ANN index behind most modern vector stores, giving logarithmic-scale search at high recall; the per-shard index structure of the sharded vector retrieval in Section 36.6.

📄 Paper

Lee, K., Ippolito, D., Nystrom, A., et al. "Deduplicating Training Data Makes Language Models Better." ACL 2022. arXiv:2107.06499

Demonstrates that removing near-duplicate text improves model quality and reduces memorization; the empirical case for why the deduplication of Section 36.3 is not optional but central.

📄 Paper

Systems

Dean, J., Ghemawat, S. "MapReduce: Simplified Data Processing on Large Clusters." OSDI 2004. research.google

The programming model behind every batch stage of this pipeline; its map-shuffle-reduce shape is exactly the structure of the distributed crawl, clean, and deduplicate of Sections 36.2 and 36.3.

📄 Paper

Johnson, J., Douze, M., Jégou, H. "Billion-Scale Similarity Search with GPUs (FAISS)." IEEE Transactions on Big Data, 2019. arXiv:1702.08734

The library and algorithms that make billion-vector similarity search practical on GPUs; the engineering backbone of the vector index and retrieval in Sections 36.5 and 36.6.

🔧 Tool

Kwon, W., Li, Z., Zhuang, S., et al. "Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM)." SOSP 2023. arXiv:2309.06180

PagedAttention and the vLLM serving engine that make high-throughput generation affordable; the serving substrate the fleet of Section 36.7 threads retrieved context into.

🔧 Tool