"I crawled the open web for a month and came back proud of three billion documents. The deduplicator looked at me, sighed, and quietly informed me that I had brought home roughly four hundred million."
A Document, Discovering It Is the Millionth Copy of Itself
A raw web crawl is mostly noise and repetition; the single step that most determines the quality of everything downstream, both the corpus you train a language model on and the documents your retrieval system will serve, is turning that crawl into clean, deduplicated text at scale. The crawl arrives as terabytes of HTML wrapped in navigation bars, cookie banners, and ad markup, riddled with machine-generated junk, mirror sites, and the same press release copied across ten thousand domains. Before a single embedding is computed or a single training token is consumed, that mess must be extracted into plain text, filtered for quality, and deduplicated so that no fact is learned a thousand times and no near-identical page wastes a slot in the index. None of this fits on one machine, so the work is expressed as a sequence of distributed passes, and the centerpiece, near-duplicate detection, becomes a textbook shuffle: hash documents into buckets, then compare only within a bucket. This section builds that pipeline and shows why its quality, more than any model architecture choice, sets the ceiling on retrieval and training that follows.
In Section 36.2 we crawled and stored the raw web at scale, landing billions of pages as compressed records in distributed storage. That crawl is not a corpus; it is raw material. A large fraction of it is boilerplate that surrounds the actual content, another large fraction is low-quality or machine-generated text, and a startlingly large fraction is duplicated, the same article syndicated across domains, the same template instantiated a million times, the same document re-crawled on successive passes. Left untreated, this material poisons everything downstream. A retrieval index built on it returns ten copies of the same page for one query; a language model trained on it memorizes the duplicated spans verbatim and wastes compute relearning them. This section is about the cleaning pipeline that stands between the crawl and the corpus, and it is the step where most of the corpus-quality battle is won or lost.
The pipeline has three stages, each a distributed pass over the data. First, text extraction strips the HTML scaffolding and recovers the main content. Second, quality filtering discards documents that are not worth keeping: wrong language, too short, too repetitive, or flagged by a learned classifier. Third, and most demanding, deduplication removes exact and near-duplicate documents so that each distinct piece of content appears essentially once. We treat the three in order, then spend the bulk of the section on the deduplication algorithm, because it is the one that does not have an obvious parallelization and the one whose cleverness, MinHash combined with Locality-Sensitive Hashing, repays careful study.
1. Text Extraction and Boilerplate Removal Beginner
A crawled page is an HTML document, and the text a reader cares about is a minority of the bytes. The rest is structure: the document head, navigation menus, sidebars, footers, cookie notices, comment widgets, and advertising markup. Boilerplate removal is the task of recovering the main textual content and discarding the chrome. The classical heuristics exploit the observation that boilerplate tends to live in short, link-dense, repeated blocks while main content lives in long runs of prose with high text-to-markup ratio; production extractors such as the descendants of boilerpipe, trafilatura, and resiliparse combine these signals with the DOM tree structure to segment the page. This is an embarrassingly parallel operation: every document is processed independently, so it maps cleanly onto the MapReduce pattern of Chapter 6, one map task per page, no shuffle required.
The cost that matters here is not algorithmic but throughput: extracting text from billions of pages is bounded by how fast bytes can be read from storage and fed to worker cores, which is exactly the data-loading problem of Chapter 8. Extraction quality nonetheless propagates all the way to the model. If the extractor leaves navigation text in, that boilerplate becomes training tokens and index entries; if it cuts too aggressively, it amputates real content. Because the operation is per-document and stateless, it parallelizes without coordination, which is why we dispatch it quickly and reserve the section's depth for the stage that does require coordination.
Extraction and quality filtering are per-document operations: each page is decided on its own, so they are pure map steps with no data movement between workers, and they scale by simply adding machines. Deduplication is fundamentally different, because deciding whether a document is a duplicate requires comparing it to other documents that may live on any machine in the cluster. That comparison is a relation, not a function of one record, and relations across a partitioned dataset are resolved by a shuffle: move records that might match into the same place, then compare locally. Recognizing which stages are map-only and which demand a shuffle tells you exactly where the engineering effort and the network cost will concentrate.
2. Quality Filtering: Heuristics, Language ID, and Classifiers Beginner
Extraction yields plain text, but plenty of that text is not worth keeping. Quality filtering is the second per-document pass, and it stacks three kinds of filter. The first is language identification: a fast model (the fastText language identifier is the standard choice) labels each document with its language and confidence, so that a corpus targeting English can drop everything else or route other languages to their own pipelines. The second is a battery of cheap heuristics drawn from the lineage of the C4, Gopher, and RefinedWeb cleaning rules: discard documents that are too short, that have too high a fraction of symbols or digits, that repeat the same line or n-gram excessively, that lack terminal punctuation, or that contain blocklisted patterns. These rules are blunt but remove a large mass of obvious junk for almost no compute.
The third filter is learned. A lightweight classifier, often nothing heavier than a logistic-regression or fastText model trained to distinguish a curated reference set (for example, pages cited on a high-quality reference site) from random web text, scores each document for a model-quality notion of "is this the kind of text we want." This classifier-based filtering is what separated the corpora that produced strong language models from those that did not, and it remains a per-document map operation: the model is small enough to replicate to every worker, so each worker scores its own shard with no communication. The same discipline that picks training data also picks retrieval data; a document too low-quality to train on is usually too low-quality to retrieve, so a RAG corpus inherits these filters directly.
Who: A platform engineer building an internal RAG assistant over a 40-terabyte web crawl for a financial-news company.
Situation: The first index was built straight from extracted text, with language ID and a length filter but no deduplication.
Problem: For many queries the top ten retrieved passages were the same wire-service article republished across dozens of partner domains, so the language model saw ten copies of one fact and nothing else, and its answers cited a single source as though it were ten.
Dilemma: Patch the symptom by deduplicating retrieval results at query time, cheap but leaving the bloated index and its wasted storage and recall, or fix the cause by deduplicating the corpus before indexing, a heavier distributed batch job but a permanent cure.
Decision: They deduplicated the corpus, because query-time deduplication cannot recover the relevant-but-distinct documents that the duplicates had crowded out of the top results.
How: They added a MinHash plus LSH near-duplicate pass (the algorithm of Section 4) as a Spark job between extraction and indexing, collapsing each cluster of near-identical pages to one representative.
Result: The index shrank by 38 percent, retrieval latency dropped with it, and answer quality rose sharply because the top results now spanned genuinely distinct sources.
Lesson: Deduplicate the corpus, not the answer. Duplicates do not merely waste space; they evict the diverse evidence that makes retrieval useful.
3. Why Deduplication Matters So Much Intermediate
It is tempting to treat deduplication as housekeeping, a tidy-up that saves some disk. It is far more consequential than that, and three distinct harms motivate it. The first is memorization. A language model trained on text that contains a passage a thousand times will memorize that passage verbatim, which inflates the risk of regurgitating copyrighted or private strings and distorts the model's sense of how common a fact is; the empirical finding, established across several studies, is that deduplicating training data reduces memorization and improves downstream quality at the same time. The second harm is wasted compute. Every duplicated token that survives into the training set is a token the model pays to process without learning anything new from it, so duplication is a direct tax on the most expensive resource in the entire enterprise of Chapter 19, the training run itself.
The third harm, and the one most specific to the case study of this chapter, is evaluation contamination. If a benchmark's test documents, or near-paraphrases of them, leak into the training corpus through duplication, the model's reported accuracy becomes meaningless, because it is being tested on what it has already seen. Deduplication against held-out evaluation sets is therefore not optional hygiene but a precondition for trustworthy numbers. For the RAG system specifically, duplication degrades retrieval directly, as the practical example showed: a deduplicated index returns a more diverse and therefore more informative set of passages for the same query budget. Clean, deduplicated text is the common upstream cause of both a better-trained model and a better-retrieving index, which is why this single stage carries so much of the corpus-quality weight.
Near-duplicate detection is the clearest application in this book of the idea that a hard relational problem over a partitioned dataset becomes tractable the moment you can express it as "group by a key, then act within each group." That is precisely the MapReduce shuffle of Chapter 6: the LSH band hash is the key, the map step emits one key per band, and the shuffle gathers all documents that share a band into the same reducer, where the only comparisons that remain are local. The same shuffle that grouped words for a distributed word count, and that returns as gradient all-reduce in Chapter 15, here groups documents by a similarity-preserving hash. The art of distributed deduplication is almost entirely the art of choosing a key under which similar things collide.
4. Exact Deduplication: Hashing and Suffix Arrays Intermediate
Exact duplicates, byte-identical documents, are easy and should be removed first because the method is cheap and shrinks the input to the harder near-duplicate pass. Compute a strong content hash of each document (a 64-bit or 128-bit hash such as those in the xxhash or SHA family), then group by that hash and keep one document per group. This is a one-line MapReduce: map each document to a key-value pair of (hash, document id), shuffle by hash, and in each reducer keep a single representative. Because the hash is a deterministic function of the bytes, identical documents always land in the same bucket and the comparison within a bucket is a trivial equality check.
A more refined notion of exact duplication operates at the substring level rather than the document level: long verbatim spans that recur across otherwise-different documents, such as a license header or a boilerplate disclaimer embedded mid-article. Detecting these calls for a suffix array, a sorted array of all suffixes of the concatenated corpus that lets you find every repeated substring of a minimum length in near-linear time; the influential deduplication work of Lee and colleagues used exactly this structure to excise long duplicated spans from training data. Suffix-array construction parallelizes, though less trivially than hashing, and it is the right tool when the duplication you care about is substring-level rather than whole-document. For the corpus-scale, whole-document near-duplicate problem that dominates web cleaning, we turn to the probabilistic method that is the heart of this section.
5. Near-Duplicate Detection With MinHash and LSH Advanced
Most web duplication is not byte-identical. The same article appears with a different ad inserted, a different timestamp, a one-word edit, a re-ordered paragraph. We need a notion of "almost the same," and the standard one for documents is the Jaccard similarity of their sets of shingles, where a shingle is a contiguous run of $k$ tokens. Represent each document $d$ as the set $S(d)$ of its $k$-token shingles; the similarity of two documents is
$$J(S(d_1), S(d_2)) = \frac{|S(d_1) \cap S(d_2)|}{|S(d_1) \cup S(d_2)|},$$a number in $[0,1]$ that is $1$ for identical shingle sets and $0$ for disjoint ones. The naive way to find all near-duplicate pairs, comparing every document to every other, costs $O(N^2)$ Jaccard computations, which for $N$ in the billions is utterly out of reach. MinHash and LSH together replace that quadratic search with a near-linear one, and they do it in two ideas that compose beautifully.
The first idea, MinHash, compresses each document's shingle set into a short fixed-length signature whose agreement estimates Jaccard. Pick a hash function $h$ that maps shingles to integers, and define $\text{minhash}_h(d) = \min_{s \in S(d)} h(s)$. The defining property is that the probability two documents share the same MinHash value equals their Jaccard similarity:
$$\Pr\big[\text{minhash}_h(d_1) = \text{minhash}_h(d_2)\big] = J(S(d_1), S(d_2)).$$The reason is elegant: the minimum is achieved by whichever shingle in the union $S(d_1) \cup S(d_2)$ hashes lowest, and the two documents agree exactly when that lowest-hashing shingle lies in the intersection, an event with probability equal to the intersection-over-union ratio. Using $n$ independent hash functions gives a signature of $n$ MinHash values, and the fraction of positions at which two signatures agree is an unbiased estimate of their Jaccard similarity, with error shrinking like $1/\sqrt{n}$. A document of any length collapses to $n$ numbers, and similarity estimation becomes counting agreements between two short vectors.
The second idea, Locality-Sensitive Hashing by banding, turns the signature into a bucketing scheme that brings likely-similar documents together without comparing all pairs. Split the length-$n$ signature into $b$ bands of $r$ rows each, so $n = b \cdot r$, and hash each band to a bucket. Two documents become a candidate pair if they land in the same bucket for at least one band. Figure 36.3.1 traces the whole flow from shingles to candidate pairs.
The banding scheme gives a tunable similarity threshold. For two documents with true Jaccard $s$, the probability that they agree on all $r$ rows of one specific band is $s^r$, so the probability they differ in at least one row of that band is $1 - s^r$, the probability they differ in every one of the $b$ bands is $(1 - s^r)^b$, and therefore the probability they collide in at least one band, which is to say become a candidate pair, is
$$\Pr[\text{candidate}] = 1 - (1 - s^r)^b.$$As a function of $s$ this is the famous S-curve: near-zero for low similarity, near-one for high similarity, with a sharp transition around $s^* \approx (1/b)^{1/r}$. By choosing $b$ and $r$ (subject to $b \cdot r = n$) you slide that threshold to whatever similarity you consider a duplicate. Figure 36.3.2 plots the curve for the configuration used in this section's demonstration.
The payoff for distribution is that this scheme is a shuffle and nothing more. Each document emits one (band-bucket, document-id) pair per band; the shuffle of Chapter 6 groups all documents that share any band bucket onto the same reducer; and within each reducer the only candidate pairs are those that landed together, a set vastly smaller than all pairs. A final verification step computes the true Jaccard (or compares full signatures) within each bucket to drop the false positives that banding inevitably admits. The expensive $O(N^2)$ comparison has become an $O(N)$ emission followed by a shuffle and a pile of tiny local comparisons, which is exactly the shape that the Spark DataFrame pipeline of Chapter 7 executes well.
6. A Runnable MinHash and LSH Demonstration Intermediate
The code in Code 36.3.1 implements MinHash signatures and LSH banding on five short documents: two near-duplicate pairs (a sentence and its one-word edit) and one entirely distinct document. It builds an $n = 120$ signature split into $b = 40$ bands of $r = 3$ rows, forms candidate pairs from shared band buckets, and reports for every pair the true Jaccard, the signature-estimated Jaccard, and whether LSH flagged it as a candidate. The point to watch is that the two genuine near-duplicate pairs become candidates while the distinct document collides with no one, and that the estimated Jaccard tracks the true Jaccard closely.
import random, hashlib
from collections import defaultdict
def shingles(text, k=3): # set of k-token shingles
t = text.lower().split()
if len(t) < k:
return {tuple(t)}
return {tuple(t[i:i+k]) for i in range(len(t) - k + 1)}
def true_jaccard(a, b):
return len(a & b) / len(a | b)
def make_hashes(num, seed=7): # a family of (a, b) hash params
rng = random.Random(seed)
M = (1 << 61) - 1
return [(rng.randrange(1, M), rng.randrange(0, M)) for _ in range(num)]
def shingle_hash(s): # stable, salt-free hash of a shingle
h = hashlib.blake2b(" ".join(s).encode(), digest_size=8).digest()
return int.from_bytes(h, "little")
def signature(sh, hashes): # one MinHash value per hash function
M = (1 << 61) - 1
return tuple(min(((a * (shingle_hash(s) % M) + b) % M) for s in sh) for a, b in hashes)
def lsh_buckets(sig, bands, rows): # one bucket key per band
return [sig[i*rows:(i+1)*rows] for i in range(bands)]
docs = {
"A": "the distributed crawler fetched billions of web pages overnight",
"A2": "the distributed crawler fetched billions of web pages last night", # near-dup of A
"B": "minhash and locality sensitive hashing find near duplicate documents",
"B2": "minhash and locality sensitive hashing detect near duplicate documents", # near-dup of B
"C": "a quiet garden held three sleeping cats under warm sun", # distinct
}
NUM, BANDS, ROWS = 120, 40, 3 # signature length = BANDS * ROWS
hashes = make_hashes(NUM)
sh = {k: shingles(v) for k, v in docs.items()}
sigs = {k: signature(sh[k], hashes) for k in docs}
bucket_map = defaultdict(list) # the LSH shuffle, in one process
for k in docs:
for b in lsh_buckets(sigs[k], BANDS, ROWS):
bucket_map[b].append(k)
candidates = set() # pairs sharing at least one band bucket
for members in bucket_map.values():
for i in range(len(members)):
for j in range(i + 1, len(members)):
candidates.add(tuple(sorted((members[i], members[j]))))
def est_jaccard(x, y): # fraction of agreeing signature slots
return sum(1 for i in range(NUM) if sigs[x][i] == sigs[y][i]) / NUM
print(f"signature length = {NUM}, bands b = {BANDS}, rows r = {ROWS}")
print("LSH candidate near-duplicate pairs:", sorted(candidates))
print()
print(f"{'pair':<8}{'true J':>10}{'est J':>10}{'candidate?':>12}")
keys = list(docs)
for i in range(len(keys)):
for j in range(i + 1, len(keys)):
x, y = keys[i], keys[j]
flag = "yes" if tuple(sorted((x, y))) in candidates else "no"
print(f"{x}-{y:<6}{true_jaccard(sh[x], sh[y]):>10.3f}{est_jaccard(x, y):>10.3f}{flag:>12}")
bucket_map dictionary plays the role of the distributed shuffle: in a cluster, emitting (band-bucket, doc-id) pairs and grouping by the bucket key is exactly the MapReduce shuffle, here collapsed into one process so it runs on a laptop.signature length = 120, bands b = 40, rows r = 3
LSH candidate near-duplicate pairs: [('A', 'A2'), ('B', 'B2')]
pair true J est J candidate?
A-A2 0.667 0.742 yes
A-B 0.000 0.000 no
A-B2 0.000 0.000 no
A-C 0.000 0.000 no
A2-B 0.000 0.000 no
A2-B2 0.000 0.000 no
A2-C 0.000 0.000 no
B-B2 0.400 0.400 yes
B-C 0.000 0.000 no
B2-C 0.000 0.000 no
The demonstration shows the whole method in miniature: shingling, signature estimation that tracks the true similarity within the sampling error of a short signature, and LSH bucketing that surfaces exactly the two duplicate pairs while ignoring the distinct document and every cross-cluster pair. Scaling this from five documents to five billion changes nothing conceptual; it changes only where the bucket_map lives. On a cluster the emission of band buckets is the map, the grouping is the shuffle, and the within-bucket verification is the reduce, so the laptop dictionary becomes a Spark stage and the algorithm carries over without modification.
The roughly forty lines of Code 36.3.1 exist to make the mechanism visible; in production you reach for a library that has tuned the hashing, the banding, and the storage. The datasketch package provides MinHash and an MinHashLSH index that inserts signatures and queries candidates in a few calls, and at corpus scale the same logic runs as a Spark job where the banding becomes a groupBy on the band-bucket column:
from datasketch import MinHash, MinHashLSH
def sig(text, k=3, perm=128): # build a 128-permutation MinHash
m = MinHash(num_perm=perm)
toks = text.lower().split()
for i in range(max(1, len(toks) - k + 1)):
m.update(" ".join(toks[i:i+k]).encode())
return m
lsh = MinHashLSH(threshold=0.6, num_perm=128) # banding chosen for you from threshold
for name, text in docs.items():
lsh.insert(name, sig(text)) # internally splits the signature into bands
print(lsh.query(sig(docs["A2"]))) # -> near-duplicates of A2, e.g. ['A', 'A2']
datasketch. Passing a threshold lets the library pick the optimal $b$ and $r$ for you, and on a cluster this index becomes a Spark groupBy over the band-bucket key, the spark.ml MinHashLSH transformer, or a dedicated tool such as the text-dedup pipeline used to clean open LLM corpora.Large-scale corpus construction has turned deduplication from a preprocessing footnote into a headline design decision. The open datasets that defined recent practice, RefinedWeb (Penedo et al., 2023), the FineWeb and FineWeb-Edu releases (Penedo et al., 2024), and Dolma (Soldaini et al., 2024), all document MinHash-LSH near-deduplication as central to their quality, and the FineWeb ablations show measurable gains in downstream model accuracy from the dedup and filtering choices alone. A live debate concerns how aggressive to be: global deduplication across the whole corpus can strip away useful repeated signal and sometimes underperforms per-snapshot deduplication, so the frontier is not "more dedup" but "dedup tuned to its effect on the model." Semantic deduplication, which clusters by embedding similarity rather than shingle overlap to catch paraphrases that share few exact tokens (SemDeDup, Abbas et al., 2023), pushes the same idea beyond Jaccard, trading the cheap LSH shuffle for a distributed nearest-neighbor search of the kind built in Chapter 25.
The quiet magic of MinHash is that a document of ten thousand shingles and a document of ten shingles both collapse to the same $n$ numbers, and two such wildly different-sized documents can still register their true overlap correctly. The signature does not remember how big the document was, only what its smallest hashed shingles were. It is a little like recognizing a song from the same three notes no matter how long the recording: the minimum carries the identity, and the length politely steps aside.
7. The Cleaning Pipeline as One Distributed Job Intermediate
Assembled, the three stages form a single distributed batch job that consumes the raw crawl of Section 36.2 and emits the clean, deduplicated corpus that the indexing and retrieval sections downstream will use. Extraction and quality filtering run as map-only stages, reading crawl records from distributed storage and writing extracted, filtered text back, bounded by I/O throughput and trivially scalable. Exact deduplication runs as a hash-group-by, and near-duplicate deduplication runs as the MinHash-LSH shuffle, the one stage that moves real data across the network. Expressing the whole thing as Spark DataFrame transformations, the way Chapter 7 teaches, lets the engine schedule the map stages, plan the shuffle, and recover failed tasks without the pipeline author writing any coordination code.
The output of this job is the artifact on which the rest of the chapter depends. Clean, deduplicated text is simultaneously the training corpus for any model the project fine-tunes and the document collection that the retrieval index of the next sections is built over. Both uses inherit the same quality: a duplicate not removed here is a memorized span in the model and a redundant hit in the index; a quality filter applied here lifts both the model's training distribution and the retriever's candidate pool. The single most leveraged hour of engineering in a web-scale RAG project is the one spent on this pipeline, because everything that follows is a function of its output. With the corpus clean, Section 36.4 turns it into a sharded lexical index, where the documents this section deduplicated become the posting lists a query will search.
You have a MinHash signature of length $n = 128$ and you consider two documents near-duplicates when their Jaccard similarity is at least $0.8$. Using the S-curve $\Pr[\text{candidate}] = 1 - (1 - s^r)^b$ and the threshold approximation $s^* \approx (1/b)^{1/r}$, pick a banding $(b, r)$ with $b \cdot r = 128$ whose threshold sits near $0.8$, and explain in words what goes wrong if you instead pick a banding whose threshold sits near $0.4$: which kind of error (false candidates or missed duplicates) does each mistake produce, and how does each error cost you when the candidate set then goes to an exact-Jaccard verification step?
Extend Code 36.3.1 into a small experiment. Generate many synthetic document pairs with known, controlled Jaccard similarity (for example, start from one shingle set and remove a tunable fraction of its shingles to produce a partner of target similarity). For a fixed $n = 120$ and several bandings $(b, r)$, estimate the empirical probability that a pair becomes an LSH candidate as a function of true Jaccard, and plot it against the theoretical curve $1 - (1 - s^r)^b$. Confirm that the empirical and theoretical thresholds agree, and report how the transition sharpens as you increase $r$ at fixed threshold.
Consider deduplicating $N = 5 \times 10^9$ documents with a signature of $n = 128$ split into $b = 32$ bands. Each document emits one (band-bucket, doc-id) pair per band. Estimate the number of records that cross the shuffle and, assuming each emitted record is about 24 bytes, the total bytes the shuffle moves. Compare this to the cost of the naive all-pairs approach by estimating how many Jaccard comparisons $O(N^2)$ would require, and argue from the two numbers why the LSH shuffle is the only feasible option. Then explain what a heavily populated band bucket (a "hot" bucket where thousands of documents collide) does to the within-bucket verification cost, and connect this to the shuffle-skew problem of Chapter 7.