Part VIII: Case Studies and Capstone Projects
Chapter 36: Web-Scale Text Processing and Distributed RAG

Distributed Crawling

"I am a crawler, politely knocking on a billion doors. I remember which ones I have already opened, I wait my turn at every house, and I never forget that someone downstream has to read everything I drag home."

A Fetcher Worker, Pacing Itself Behind a Per-Host Token Bucket
Big Picture

Every retrieval-augmented system begins with an act of data ingestion, and at web scale that act is a distributed crawl: a fleet of fetcher workers draining a shared, sharded URL frontier under strict politeness limits, deduplicating what they have already seen, and writing the harvested pages into distributed storage for the rest of the pipeline to clean, chunk, embed, and index. The crawl is not a preliminary chore; it is the source whose coverage and freshness silently bound the quality of every answer the RAG system will ever give. A document the crawler never fetched cannot be retrieved, cannot be cited, and cannot correct a hallucination. This section builds the crawl as a distributed system in its own right, names its coordination hazards, and does the throughput arithmetic that tells you how many machines a target corpus actually requires.

The previous section framed the web-scale RAG pipeline end to end: crawl, clean, chunk, embed, index, retrieve, generate. We now zoom into the first stage, because it is the one stage whose failure modes are invisible from the answer side. A retriever that returns nothing useful looks the same whether the relevant page was poorly embedded or simply never fetched, and only one of those is fixable downstream. Crawling is also the stage that most resembles the classic distributed-systems problems of this book: a shared data structure (the frontier) read and written by many workers, a partitioning of work across machines, a deduplication problem solved with a probabilistic set, and a coordination discipline (politeness) that the system must honor against external parties it does not control. Treating the crawl casually, as a script that loops over URLs, is the surest way to either miss most of the web or get every worker's IP address blocked.

Seed URLs Sharded frontier shard = hash(host) % S shard 0 queue shard 1 queue . . . shard S-1 Fetcher worker A robots cache + per-host rate limit Fetcher worker B robots cache + per-host rate limit Fetcher worker C . . . W workers Seen-URL set Bloom filter Distributed storage raw pages Downstream clean / chunk embed / index new links rejoin frontier (dedup against Bloom filter first)
Figure 36.2.1: The distributed crawl as the ingestion front of the RAG pipeline. Seed URLs enter a frontier partitioned into $S$ shards by host hash; $W$ fetcher workers pull from their shards, consult a robots cache and a per-host rate limiter before each fetch, write raw pages to distributed storage, and extract new links. Every extracted link is tested against a shared Bloom filter (the dashed loop) so already-seen URLs are dropped before they re-enter the frontier. Storage then feeds the clean/chunk/embed/index stages of Section 36.3.

1. The Frontier Is a Distributed Queue Beginner

The heart of a crawler is the frontier: the set of URLs discovered but not yet fetched. On one machine the frontier is just a queue; at web scale it is a distributed data structure that many workers consume concurrently, and almost every design decision in a crawler is really a decision about how that queue is partitioned, ordered, and kept consistent. The frontier holds billions of entries, far more than one machine's memory, so it must be sharded across nodes; and because the entries flow in faster than any single coordinator could hand them out one at a time, workers pull batches from their assigned shards rather than from a single global head.

The natural partition key is the host. We assign each URL to shard $\text{hash}(\text{host}) \bmod S$, so every URL belonging to example.com lands in the same shard regardless of which worker discovered it. This host-based sharding, the same partition-by-key idea introduced as a coordination concept in Chapter 2 and used for data shards in Chapter 8, is what makes politeness tractable: if one shard owns all of a host's URLs, then the worker draining that shard is the single place where that host's request rate is enforced, and no cross-worker coordination is needed to avoid hammering the host. The frontier is therefore not partitioned for storage reasons alone; the partition key is chosen so that the hardest coordination constraint, per-host politeness, becomes a local property of a single shard.

Key Insight: Partition by Host So Politeness Becomes Local

Crawling has one global constraint that cannot be relaxed: you must not overload any single host. If URLs were sharded by a hash of the full URL, every host's pages would be scattered across all workers, and enforcing a per-host rate limit would require global coordination on every fetch. Partitioning the frontier by host hash collapses that global constraint into a local one: each host lives in exactly one shard, drained by one worker, which can throttle that host with a simple in-process token bucket. The right partition key turns a distributed-coordination problem into a single-machine bookkeeping problem.

2. Politeness Is the Constraint That Shapes Everything Beginner

A crawler that fetches as fast as its workers allow is a denial-of-service attack with good intentions. Two disciplines keep it civil. The first is robots.txt: before fetching any URL on a host, the crawler retrieves and caches that host's robots file, which declares the paths the site owner permits or forbids and may specify a crawl delay. Honoring it is both an ethical and a practical necessity, because the alternative is a blocked IP range and a corpus with holes where the most valuable, most actively defended sites used to be. The second is rate limiting: even on permitted paths, the crawler spaces its requests to a single host, typically allowing only one in-flight request and a fixed gap between requests, so that the host sees a trickle rather than a flood.

These two disciplines interact with sharding to set the crawler's real throughput. A worker draining a shard cannot simply fetch as fast as its CPU and network allow; it is rate-limited per host, so its useful throughput depends on how many distinct hosts live in its shard. A shard dominated by one enormous host (a content farm, a wiki with millions of pages) is throttled to that one host's polite rate no matter how much hardware you give the worker, while a shard spread across thousands of small hosts can keep many fetches in flight concurrently. This is the crawler's version of the load-skew problem that haunts every partitioned system in this book, and it is why the host count per shard, not the URL count, is the quantity to watch.

Practical Example: The Crawl That Throttled Itself to a Trickle

Who: A data engineer building the ingestion corpus for an internal documentation RAG assistant.

Situation: The crawl targeted twelve million pages across roughly forty thousand hosts, with a fleet of sixty-four fetcher workers provisioned to finish overnight.

Problem: After twelve hours the crawl was only forty percent done, and most workers sat idle while a handful ran flat out.

Dilemma: Add more workers, which the idle majority suggested was pointless, or rebalance the frontier, which meant re-sharding mid-crawl.

Decision: They measured host-per-shard distribution and found three hosts (a large knowledge base and two forums) held sixty percent of all URLs, so three shards were politeness-bound to a crawl while the rest had long since drained.

How: They split the three hot hosts across multiple sub-shards keyed by a path prefix, so several workers could fetch different sections of the same giant host while still respecting a shared per-host token bucket that they moved into a small coordination service.

Result: Worker utilization rose from under thirty percent to over eighty, and the crawl finished in five hours; adding workers before the rebalance would have changed nothing, because the binding constraint was host skew, not worker count.

Lesson: In a polite crawl the limiting resource is distinct hosts, not machines. Diagnose the partition before you scale the fleet.

3. Deduplicating the URL Space with a Bloom Filter Intermediate

The web is densely cross-linked, so the same URL is discovered again and again from different pages. If the crawler re-enqueues every discovered link, the frontier explodes and workers waste their politeness budget re-fetching pages they already have. The crawler therefore maintains a set of already-seen URLs and tests every newly extracted link against it before enqueueing. At web scale this set holds tens of billions of URLs, far too many to store as exact strings in memory, so crawlers use a Bloom filter: a bit array with $k$ hash functions that answers set membership in constant time and constant memory per element, at the price of a tunable false-positive rate.

The trade is worth understanding precisely because it sets a memory budget. A Bloom filter of $m$ bits holding $n$ elements with $k$ hash functions has false-positive probability

$$p \approx \left(1 - e^{-kn/m}\right)^{k},$$

minimized by choosing $k = (m/n)\ln 2$ hash functions, which gives the clean relation $p \approx (0.6185)^{\,m/n}$. The crucial consequence is that the memory cost is a function of the false-positive rate you are willing to tolerate and the number of elements, not of the length of the URLs: at the optimal $k$, holding $p$ fixed costs roughly $m/n = -\log_2 p / \ln 2 \approx 1.44 \log_2(1/p)$ bits per URL. Ten billion URLs at a one-percent false-positive rate cost about $1.44 \cdot \log_2(100) \approx 9.6$ bits each, near twelve gigabytes, which is why the seen-set is itself distributed: sharded by the same host hash as the frontier, so the worker that owns a host also owns the dedup decisions for that host.

A false positive in this filter is benign in a specific way: it causes the crawler to wrongly believe it has already seen a URL and skip it, so a small false-positive rate means a small fraction of pages silently never fetched. That is a coverage loss, and coverage loss is exactly the kind of downstream-invisible damage this section warns about, so the budget is chosen with the RAG corpus in mind, not as a generic data-structures default. The code below builds a Bloom filter for seen-URL dedup, measures its actual false-positive rate against the theory at four memory budgets, and then confirms that host-hash partitioning balances the frontier across shards.

import hashlib, math, random

class BloomFilter:
    def __init__(self, n_bits, n_hashes):
        self.m = n_bits
        self.k = n_hashes
        self.bits = bytearray((n_bits + 7) // 8)

    def _idx(self, item, i):                       # i-th hash via a personalized BLAKE2b
        h = hashlib.blake2b(item.encode(), digest_size=16,
                            person=i.to_bytes(2, "little")).digest()
        return int.from_bytes(h, "little") % self.m

    def add(self, item):
        for i in range(self.k):
            j = self._idx(item, i)
            self.bits[j >> 3] |= (1 << (j & 7))   # set the bit

    def __contains__(self, item):
        for i in range(self.k):
            j = self._idx(item, i)
            if not (self.bits[j >> 3] & (1 << (j & 7))):
                return False                       # a clear bit proves "never seen"
        return True

n = 200_000                                        # distinct URLs already crawled
rng = random.Random(0)
seen  = [f"https://site{rng.randrange(50_000)}.com/page/{i}" for i in range(n)]
probe = [f"https://other{i}.net/path/{i}" for i in range(200_000)]   # never inserted

print(f"{'bits/url':>9} {'mem (MiB)':>10} {'k':>3} {'measured FPR':>13} {'theory FPR':>11}")
for bits_per_url in (4, 8, 12, 16):
    m = bits_per_url * n
    k = max(1, round((m / n) * math.log(2)))       # optimal hash count
    bf = BloomFilter(m, k)
    for u in seen:
        bf.add(u)
    fp = sum(1 for u in probe if u in bf)          # false positives among never-seen URLs
    measured = fp / len(probe)
    theory   = (1 - math.exp(-k * n / m)) ** k
    print(f"{bits_per_url:>9} {m/8/2**20:>10.2f} {k:>3} {measured:>13.5f} {theory:>11.5f}")

def host_of(url):  return url.split("//", 1)[1].split("/", 1)[0]
def shard_of(host, S):
    h = hashlib.blake2b(host.encode(), digest_size=8).digest()
    return int.from_bytes(h, "little") % S         # same partition key as the frontier

S = 16
counts = [0] * S
for u in seen:
    counts[shard_of(host_of(u), S)] += 1
mean, maxc = sum(counts) / S, max(counts)
print(f"\nshards={S}  urls={n}  mean/shard={mean:.0f}"
      f"  max/shard={maxc}  imbalance(max/mean)={maxc/mean:.3f}")
Code 36.2.1: A seen-URL Bloom filter and a host-hash sharder, the two data structures at the core of Figure 36.2.1. The filter is measured against the closed-form false-positive rate at four memory budgets; the sharder confirms that hashing on host spreads URLs evenly across shards.
 bits/url  mem (MiB)   k  measured FPR  theory FPR
        4       0.10   3       0.14708     0.14689
        8       0.19   6       0.02184     0.02158
       12       0.29   8       0.00287     0.00314
       16       0.38  11       0.00044     0.00046

shards=16  urls=200000  mean/shard=12500  max/shard=12970  imbalance(max/mean)=1.038
Output 36.2.1: The measured false-positive rate tracks the theory to within rounding at every budget, and each doubling of memory cuts the rate by an order of magnitude: eight bits per URL buys roughly two percent, sixteen bits buys under one in two thousand. The host-hash partition is near-perfectly balanced (max shard only 3.8 percent above the mean), confirming that politeness can be enforced shard-locally without creating load skew from the partition itself.

4. The Throughput Arithmetic of a Crawl Intermediate

How many machines does a target crawl need? The aggregate fetch rate of the fleet is the per-worker rate times the worker count, capped by the politeness ceiling. If each of $W$ workers sustains $r$ fetches per second, the fleet's nominal rate is

$$R = W \cdot r,$$

and the wall-clock time to fetch a corpus of $N$ pages is $T = N / R = N / (W r)$. A concrete reading: at $r = 50$ pages per second per worker, a fleet of $W = 200$ workers fetches $R = 10{,}000$ pages per second, and a corpus of $N = 10^{9}$ pages takes $T = 10^{9} / 10^{4} = 10^{5}$ seconds, a little over a day. That arithmetic is necessary but not sufficient, because politeness imposes a second, independent ceiling. If the crawler allows at most one request every $\delta$ seconds to a host, then a single host yields at most $1/\delta$ pages per second, and a corpus concentrated on $H$ distinct hosts cannot be fetched faster than

$$R_{\text{polite}} = \frac{H}{\delta},$$

regardless of how many workers you provision. The achievable rate is the smaller of the two, $R_{\text{eff}} = \min(W r,\; H/\delta)$. With a one-second politeness gap, fetching ten thousand pages per second requires drawing from at least ten thousand distinct hosts at any moment; a crawl of a few large hosts is host-bound, and adding workers past that point only raises cost. This is the same lesson as the practical example, now as an inequality you can evaluate before provisioning: count your hosts, not just your pages.

Thesis Thread: The Ingestion Front Is Where Coverage Is Won or Lost

The crawl is the first place the book's thesis bites in this case study: the corpus quality is set by a distributed system, not a single machine, and the binding ceiling is a coordination constraint (politeness) rather than raw compute. Every later stage of the RAG pipeline, the cleaning and chunking of Section 36.3, the distributed embedding and indexing it feeds, and ultimately the retrieval the model relies on, operates only on documents this stage chose to fetch. Coverage and freshness decided here propagate forward and cannot be recovered downstream, which is why the crawl earns the same architectural care as the training and serving systems in the rest of the book.

5. Coordination Hazards and Continuous Recrawl Advanced

A shared frontier read and written by hundreds of workers raises every consistency question of Chapter 2. Two workers can extract the same new URL at the same instant and both pass the Bloom-filter test before either inserts it, briefly enqueueing a duplicate; the seen-set therefore tolerates a small rate of duplicate fetches rather than serializing every insertion behind a global lock, trading exactness for throughput in exactly the way distributed systems usually must. A worker can crash mid-shard, so the frontier must record which URLs are leased to which worker and re-lease them on a timeout, the same lease-and-recover discipline that makes MapReduce re-execute failed tasks in Chapter 6. And the robots cache must be coherent enough that a forbidden path is not fetched by one worker just because another worker's cache had not yet refreshed.

Once the harvested pages land in distributed storage, the rest of the ingestion pipeline reads them as a batch: extracting text, deduplicating near-identical documents, and computing the cleaned corpus is a textbook MapReduce job, mapping over raw pages and reducing by content fingerprint, exactly the batch model of Chapter 6, and the raw and cleaned crawls live in the distributed object store of Chapter 8. A crawl is never finished, though. The web changes, and a corpus frozen at one crawl grows stale, so production systems run a continuous recrawl: a stream of re-fetch tasks prioritized by how often each page changes, processed by the stream-processing machinery of Chapter 9. Freshness is the temporal half of coverage, and for a RAG system answering questions about a moving world it is just as load-bearing as breadth.

Library Shortcut: Scrapy Gives You the Frontier, Politeness, and Dedup For Free

The frontier queue, the per-host throttle, the robots check, and the seen-URL filter are not things you should hand-roll for a production crawl. Scrapy provides all four behind a handful of settings and a spider that yields requests; the scheduler maintains the frontier, the autothrottle and download-delay machinery enforce politeness, the robots middleware reads robots.txt, and a dupefilter (Bloom-backed in the scrapy-bloom and Frontera ecosystems) handles URL dedup:

import scrapy

class CorpusSpider(scrapy.Spider):
    name = "corpus"
    start_urls = ["https://example.com/"]
    custom_settings = {
        "ROBOTSTXT_OBEY": True,            # read and honor robots.txt
        "AUTOTHROTTLE_ENABLED": True,      # adapt the per-host rate to latency
        "CONCURRENT_REQUESTS_PER_DOMAIN": 1,
        "DOWNLOAD_DELAY": 1.0,             # the politeness gap, delta = 1 s
        "DUPEFILTER_CLASS":                # seen-URL dedup (Bloom-backed)
            "scrapy_bloom.BloomDupeFilter",
    }

    def parse(self, response):
        yield {"url": response.url, "html": response.text}   # to distributed storage
        for href in response.css("a::attr(href)").getall():  # discovered links
            yield response.follow(href, self.parse)           # re-enter the frontier
Code 36.2.2: A polite, deduplicating crawl in roughly a dozen lines. The four hand-built mechanisms of Sections 1 to 3 (frontier, per-host rate limit, robots check, Bloom dedup) collapse into Scrapy settings; for multi-machine crawls, Frontera distributes the same frontier across workers with a Kafka or HBase backend.
Research Frontier: Crawling for Retrieval Quality, Not Just Coverage (2024 to 2026)

As the web fills with machine-generated text, raw coverage is no longer the only goal; the frontier of crawling research has shifted toward fetching what a retrieval system can actually use. Large open crawls in the lineage of Common Crawl now ship aggressively filtered derivatives (the RefinedWeb and FineWeb pipelines of 2023 to 2024 document quality and deduplication filters applied at crawl-corpus scale), and 2024 to 2025 work on detecting and down-weighting synthetic or low-quality pages aims to keep model-generated sludge out of the retrieval corpus before it ever reaches the embedder. A parallel line studies freshness-aware and learned recrawl scheduling, predicting per-page change rates so the continuous recrawl of Section 5 spends its politeness budget where the web is actually moving. The common thread is that the crawler is being treated as the first quality filter of the RAG system, optimized jointly with the retrieval objective rather than as an undifferentiated firehose; we meet the retrieval side of this coupling in Chapter 25.

Fun Note: The Politest Denial-of-Service in the World

A crawler with a bug in its politeness logic does not look malicious; it looks enthusiastic. The classic incident is a single misconfigured worker that ignores DOWNLOAD_DELAY and discovers a calendar widget, then follows "next month" forever, fetching an infinite sequence of empty pages from one unlucky host at full speed. The host's operator sees a relentless, perfectly well-formed flood of requests for the year 2147 and reasonably concludes they are under attack. The lesson the whole industry learned the hard way: a crawler's most important feature is the brake, not the engine.

Exercise 36.2.1: Why Host Hash, Not URL Hash Conceptual

Suppose a colleague proposes sharding the frontier by $\text{hash}(\text{full URL}) \bmod S$ instead of by host, arguing it gives perfect load balance across shards. They are right about balance. Explain, referring to Section 2, why this partition makes per-host politeness require global coordination on every fetch, and describe the specific extra machinery (and its cost) you would need to enforce a one-request-per-second per-host limit under URL-hash sharding. Then state the one situation in which URL-hash sharding is nonetheless preferable.

Exercise 36.2.2: Tune the Dedup Budget Coding

Extend Code 36.2.1 so that, for a target corpus of $N = 10^{10}$ URLs and a desired false-positive rate of $p = 0.005$, it computes the optimal bits per URL, the total filter memory in gigabytes, and the optimal $k$, using $m/n = -\log_2 p / \ln 2$. Then empirically verify the bits-per-URL formula at the smaller scale of the demo by inserting $n$ URLs into a filter sized for several target rates and confirming the measured rate matches $p$. Report how much memory you would save by accepting $p = 0.02$ instead, and argue in two sentences whether that coverage loss is acceptable for a RAG corpus.

Exercise 36.2.3: Which Ceiling Binds Your Crawl Analysis

You must fetch $N = 5 \times 10^{8}$ pages drawn from $H = 20{,}000$ distinct hosts, with a politeness gap of $\delta = 1$ second per host. Each worker sustains $r = 40$ fetches per second. Using $R_{\text{eff}} = \min(W r,\; H/\delta)$ from Section 4, find the worker count $W^\star$ beyond which adding workers no longer speeds the crawl, compute the wall-clock time at $W^\star$, and state which ceiling (worker throughput or politeness) binds. Then recompute if the corpus is instead concentrated on $H = 500$ hosts, and explain in terms of the case study why the second scenario is the more dangerous one for downstream RAG coverage.