Section 6.7: MinHash and Locality-Sensitive Hashing

"I compared every document to every other document. Halfway through the second trillion comparisons, I began to suspect there was a better way."
A Reducer That Counted Too High

Big Picture

Finding near-duplicate documents in a web-scale corpus is a similarity-join problem, and the naive solution compares every pair, which is quadratic and therefore impossible at scale. MinHash and Locality-Sensitive Hashing replace that impossible comparison with two ideas that fit the MapReduce model exactly: compress each document into a short signature that preserves set similarity, then hash similar signatures into the same bucket so that only documents already likely to be near-duplicates are ever compared. This is not a niche trick. Deduplicating the petabytes of text used to pretrain a modern language model is a mandatory cleaning step, and it is built on the algorithms in this section. The same banding idea reappears later as approximate nearest-neighbor search inside vector retrieval. We develop it here as a concrete MapReduce job: map documents to signatures, band the signatures into buckets, reduce the buckets to candidate pairs.

Earlier sections of this chapter computed exact answers: an exact word count, an exact sort, an exact join. This section confronts a problem where the exact answer is the obstacle rather than the goal. Suppose you hold a billion documents and you want every pair that is "almost the same" so you can drop the redundant copies. The exact formulation, compare all pairs and keep the similar ones, requires on the order of $N^2 / 2$ comparisons. For a billion documents that is roughly $5 \times 10^{17}$ comparisons, and no cluster of any size makes that tractable. The way out is to stop insisting on examining every pair, and instead arrange the computation so that dissimilar pairs are never compared at all. That arrangement is the subject of this section, and it is the first place in the book where we trade an exact computation for a probabilistic one to make scale possible, a trade that Section 6.8 generalizes into a whole toolkit of approximate algorithms.

1. Why Deduplication Decides Model Quality Beginner

A web crawl is full of repetition. The same news article appears on hundreds of syndicating sites, boilerplate headers and footers recur across every page of a domain, and scraped forums echo the same answer in thread after thread. If this redundancy is fed verbatim into a pretraining corpus, the model sees some passages thousands of times more often than others, which wastes compute, inflates the apparent size of the dataset, encourages memorization of the over-represented text, and can leak benchmark questions that happen to be duplicated across the web. Removing near-duplicates, not just exact duplicates, measurably improves the resulting model and reduces the tokens needed to train it. Clean data is not a luxury applied after the interesting work; at corpus scale it is the interesting work, and it is why a data-processing chapter sits early in a book about training large models. The downstream payoff is developed when we reach foundation-model data pipelines in Chapter 19.

The unit of comparison is a document represented as a set. The standard construction is to slide a window of $k$ consecutive words (or characters) across the text and collect the resulting shingles, so that "the distributed system scaled out" with $k = 3$ becomes the set {"the distributed system", "distributed system scaled", "system scaled out"}. Two documents that share most of their shingles are near-duplicates; two that share few are unrelated. The natural measure of set overlap is the Jaccard similarity, and everything in this section is built to estimate it cheaply and to find high-Jaccard pairs without examining all pairs.

Key Insight: Turn an Impossible Join Into Two Cheap MapReduce Passes

Near-duplicate detection is a similarity self-join, and the exact self-join is quadratic in the number of documents. The MinHash plus LSH method never forms that quadratic set. It spends one linear pass to compress each document into a short signature (map), and a second pass to group signatures that are likely to match into shared buckets (shuffle), so the only pairs ever compared are the ones that already collided in a bucket (reduce). The quadratic blowup is replaced by work proportional to the number of documents plus the number of genuinely similar pairs.

2. Jaccard Similarity and the MinHash Estimator Intermediate

Let document $A$ and document $B$ be sets of shingles. Their Jaccard similarity is the size of their intersection over the size of their union,

$$J(A, B) = \frac{|A \cap B|}{|A \cup B|}.$$

This is exactly the quantity we want, and computing it directly for one pair is easy. The trouble is that storing the full shingle set of every document is expensive, and comparing all pairs of sets is the quadratic cost we are trying to avoid. MinHash attacks both problems at once by replacing each set with a short fixed-length signature whose agreement rate estimates $J$ directly.

Pick a random permutation $\pi$ of the universe of all possible shingles, and define $h(A) = \min_{x \in A} \pi(x)$, the smallest rank that any element of $A$ receives under $\pi$. The defining property of this min-hash is that the element achieving the minimum over the union $A \cup B$ is equally likely to be any element of the union, and the hashes of $A$ and $B$ agree exactly when that minimizing element lies in the intersection. Therefore

$$\Pr[\, h(A) = h(B) \,] = \frac{|A \cap B|}{|A \cup B|} = J(A, B).$$

A single min-hash is a coin whose probability of heads is precisely the Jaccard similarity. Using $n$ independent permutations gives a signature of $n$ min-hashes per document, and the fraction of the $n$ positions where the two signatures agree is an unbiased estimator of $J$. Because the positions are independent, the estimator's variance is $J(1 - J)/n$, so the error shrinks like $1/\sqrt{n}$: quadruple the signature length and you halve the typical error. A signature of a few hundred integers stands in for a shingle set of any size, which is the compression that makes the whole pipeline fit in memory and in the shuffle.

The code below makes both claims concrete. It builds two near-duplicate documents, computes their true Jaccard exactly, then estimates it from MinHash signatures of growing length, repeating each length two hundred times so the shrinking spread of the estimator is visible. It then runs the LSH banding stage that the next section explains, finding near-duplicate candidate pairs in a small corpus.

Figure 6.7.1: The MinHash plus LSH pipeline as a MapReduce job. The map stage turns each document into a shingle set and then a short MinHash signature. The signature is split into $b$ bands of $r$ rows; each band emits a key, and the shuffle groups documents whose band matches into the same bucket. The reduce stage emits a candidate pair for every two documents sharing a bucket. A document that collides with nobody (doc3) is never compared and is dropped from the candidate set.

import numpy as np

def shingles(text, k=3):                       # set of k-word windows
    words = text.split()
    return {" ".join(words[i:i + k]) for i in range(len(words) - k + 1)}

doc_a = ("the distributed system scaled out across many machines and kept "
         "every worker busy processing its own shard of the training corpus")
doc_b = ("the distributed system scaled out across several machines and kept "
         "each worker busy processing its own shard of the training corpus")

A, B = shingles(doc_a), shingles(doc_b)
true_jaccard = len(A & B) / len(A | B)          # exact Jaccard, for reference

rng = np.random.default_rng(7)
universe = sorted(A | B)
idx = {s: i for i, s in enumerate(universe)}
P = 2_147_483_647                              # a large prime modulus

def make_hashes(n):                            # n random hash functions a*x+b mod P
    return rng.integers(1, P, size=n), rng.integers(0, P, size=n)

def signature(members, a, b):                  # MinHash: min over the set per hash
    rows = np.array([idx[s] for s in members])
    return ((np.outer(a, rows) + b[:, None]) % P).min(axis=1)

print("true Jaccard            :", f"{true_jaccard:.4f}")
print(" sig length |  mean MinHash est |  est std dev (across 200 trials)")
for n in [16, 64, 256, 1024, 4096]:
    ests = []
    for _ in range(200):                       # repeat to expose shrinking variance
        a, b = make_hashes(n)
        ests.append(np.mean(signature(A, a, b) == signature(B, a, b)))
    ests = np.array(ests)
    print(f"  {n:>6d}    |      {ests.mean():.4f}       |          {ests.std():.4f}")

Code 6.7.1: MinHash as an unbiased Jaccard estimator. Each random hash plays the role of one permutation; the fraction of signature positions where two documents agree estimates $J(A, B)$, and averaging over many trials at each length exposes how the estimator's spread shrinks.

true Jaccard            : 0.5200
 sig length |  mean MinHash est |  est std dev (across 200 trials)
 -----------|-------------------|---------------------------------
      16    |      0.5184       |          0.1323
      64    |      0.5259       |          0.0650
     256    |      0.5245       |          0.0322
    1024    |      0.5213       |          0.0164
    4096    |      0.5211       |          0.0077

Output 6.7.1: The mean estimate stays pinned near the true Jaccard of $0.52$ at every signature length, confirming the estimator is unbiased. The standard deviation roughly halves each time the signature length quadruples, the $1/\sqrt{n}$ decay predicted by the variance $J(1 - J)/n$.

The mean column never drifts from the true value of $0.52$, which is what "unbiased" means: more positions do not move the target, they only tighten the spread around it. The standard deviation falls from $0.13$ at sixteen hashes to $0.008$ at four thousand, a clean halving for every quadrupling, exactly as the variance formula promises. A signature of a few hundred integers already estimates Jaccard to within a few percent, and it does so in a fixed amount of space no matter how long the documents are.

Fun Note: One Coin, Tossed in Lockstep

It helps to picture a single min-hash as a coin that two documents toss together. The coin is rigged: it lands heads exactly when the rarest shingle under this particular permutation happens to live in both documents, which is the Jaccard probability. One toss tells you almost nothing. Two hundred tosses of two hundred independently rigged coins, all reporting the same hidden bias, pin that bias down. MinHash is just the discipline of tossing many rigged coins and counting heads.

3. LSH Banding: Finding Candidates Without Comparing All Pairs Advanced

MinHash shrinks each document to a short signature, but comparing all pairs of signatures is still quadratic. Locality-Sensitive Hashing removes the quadratic step by hashing signatures so that similar ones collide and dissimilar ones almost never do. Split each signature of length $n$ into $b$ bands of $r$ rows, so $n = b \cdot r$. For each band, hash the $r$ values in that band to a bucket; two documents land in the same bucket for a given band only if all $r$ of their values in that band are identical. Two documents become a candidate pair if they share a bucket in at least one band.

The probability analysis is what makes this work as a tunable filter. For two documents with Jaccard $s$, each row matches with probability $s$, so an entire band of $r$ rows matches with probability $s^r$, a band fails to match with probability $1 - s^r$, all $b$ bands fail with probability $(1 - s^r)^b$, and therefore the pair becomes a candidate with probability

$$\Pr[\text{candidate}] = 1 - (1 - s^{r})^{b}.$$

As a function of $s$ this is an S-curve with a sharp threshold near $s \approx (1/b)^{1/r}$. Below the threshold the candidate probability is almost zero, above it almost one. By choosing $b$ and $r$ you place the threshold exactly where you want the line between "near-duplicate" and "unrelated" to fall. A common choice for aggressive deduplication is to put the threshold around $0.8$, so that only genuinely redundant documents survive the filter while unrelated pairs are discarded before any direct comparison happens.

Expressed as MapReduce, the banding stage is the most natural job imaginable. The map function takes a document, computes its signature, and emits one key-value pair per band, where the key is the (band index, band hash) and the value is the document id. The shuffle groups by that key, so every bucket arrives at a reducer with the list of documents that fell into it. The reduce function emits a candidate pair for each pair of documents in a bucket, optionally verifying the true similarity before keeping it. This is the same map, shuffle, reduce skeleton from Section 6.2, with band keys playing the role that words played in word count.

# continued from Code 6.7.1; builds a small corpus and runs the LSH banding job
from collections import defaultdict

base = ("machine learning models trained on web scale text corpora require "
        "careful deduplication of the pretraining data before any training begins")
syn = {"careful": "thorough", "require": "need", "begins": "starts",
       "models": "networks", "trained": "fitted"}

def variant(text, swaps):                      # make a controlled near-duplicate
    out, done = [], 0
    for w in text.split():
        if done < swaps and w in syn:
            out.append(syn[w]); done += 1
        else:
            out.append(w)
    return " ".join(out)

corpus = {"doc0": base, "doc1": variant(base, 1), "doc2": variant(base, 2),
          "doc3": "completely unrelated content about gardening tomatoes in summer heat",
          "doc4": base + " and it must be reproducible"}

n_hashes, bands = 128, 32                       # 32 bands of 4 rows
rows = n_hashes // bands
a128, b128 = make_hashes(n_hashes)
gshingles = sorted(set().union(*(shingles(t) for t in corpus.values())))
gidx = {s: i for i, s in enumerate(gshingles)}

def sig128(text):
    r = np.array([gidx[s] for s in shingles(text)])
    return ((np.outer(a128, r) + b128[:, None]) % P).min(axis=1)

sigs = {d: sig128(t) for d, t in corpus.items()}

buckets = defaultdict(list)                      # MAP + SHUFFLE: emit (band, hash) keys
for d, s in sigs.items():
    for band in range(bands):
        chunk = tuple(s[band * rows:(band + 1) * rows].tolist())
        buckets[(band, hash(chunk))].append(d)

candidates = set()                               # REDUCE: pairs sharing any bucket
for members in buckets.values():
    for i in range(len(members)):
        for j in range(i + 1, len(members)):
            candidates.add(tuple(sorted((members[i], members[j]))))

print("LSH candidate near-duplicate pairs (128 rows, 32 bands of 4):")
for pa, pb in sorted(candidates):
    sa, sb = shingles(corpus[pa]), shingles(corpus[pb])
    print(f"   {pa} ~ {pb}   true Jaccard = {len(sa & sb) / len(sa | sb):.3f}")

Code 6.7.2: The LSH banding stage as a MapReduce job. Each document emits one (band, hash) key per band (map and shuffle); each bucket with more than one member yields candidate pairs (reduce). The unrelated doc3 shares no band with any other document and never enters the candidate set.

LSH candidate near-duplicate pairs (128 rows, 32 bands of 4):
   doc0 ~ doc1   true Jaccard = 0.714
   doc0 ~ doc2   true Jaccard = 0.636
   doc0 ~ doc4   true Jaccard = 0.783
   doc1 ~ doc2   true Jaccard = 0.714
   doc1 ~ doc4   true Jaccard = 0.577
   doc2 ~ doc4   true Jaccard = 0.519

Output 6.7.2: Every recovered pair is a true near-duplicate (Jaccard above $0.5$), and the four redundant variants of the same base document are all linked. The unrelated doc3 produced no candidate pair, so it was filtered out before any direct comparison, which is precisely the quadratic work the method avoids.

The output recovers exactly the near-duplicate pairs among the four variants of the base document and produces nothing for the unrelated doc3, which never shared a band with anything. On five documents the saving is invisible, but the structure is the point: the number of (band, hash) keys emitted is linear in the number of documents, the shuffle groups them, and the only comparisons performed are inside buckets that already collided. Scale the corpus to a billion documents and the candidate set stays proportional to the number of genuinely similar pairs rather than to $N^2$. Figure 6.7.1 traces this flow from documents through signatures and bands into buckets, and the reduce arrow back to candidate pairs is the only place a direct comparison ever happens.

Library Shortcut: datasketch Builds the MinHash LSH Index for You

Code 6.7.1 and Code 6.7.2 spell out the permutations, the signature, and the banding so the mechanism is visible. In production you do not hand-roll any of it. The datasketch library implements MinHash and an LSH index that handles signature construction, band hashing, and bucket lookup, so inserting documents and querying for near-duplicates is a few lines:

from datasketch import MinHash, MinHashLSH

def make_minhash(text, num_perm=128):
    m = MinHash(num_perm=num_perm)
    for sh in shingles(text):                  # same shingles as above
        m.update(sh.encode("utf8"))
    return m

lsh = MinHashLSH(threshold=0.6, num_perm=128)  # picks b and r to hit the threshold
mh = {d: make_minhash(t) for d, t in corpus.items()}
for d, m in mh.items():
    lsh.insert(d, m)

print(lsh.query(mh["doc0"]))                   # near-duplicates of doc0, no all-pairs scan

Code 6.7.3: The same near-duplicate search with datasketch. The roughly forty lines of explicit permutation, signature, and banding logic in Code 6.7.1 and Code 6.7.2 collapse to a handful, and the library chooses the band layout from a target Jaccard threshold and stores the buckets for fast querying.

Practical Example: Deduplicating a Crawl Before a Pretraining Run

Who: A data engineer on a team preparing a multi-terabyte web crawl for language-model pretraining.

Situation: The raw crawl held roughly two billion documents, and an exact-duplicate hash pass removed only a fraction of the redundancy, because syndicated articles and near-identical boilerplate differed by a few words and so hashed differently.

Problem: An all-pairs Jaccard comparison would have been on the order of $10^{18}$ comparisons, which no budget or cluster could absorb, yet leaving the near-duplicates in would waste training compute and risk benchmark leakage.

Dilemma: Keep the redundant data and accept a worse model and inflated cost, or attempt the impossible quadratic join, or adopt an approximate method that might miss some duplicates or admit some false candidates.

Decision: They ran MinHash with 128 permutations and LSH banding tuned to a Jaccard threshold near $0.8$, as a two-stage Spark job over the cluster, accepting the small approximation in exchange for a tractable cost.

How: The map stage shingled each document and emitted band keys; the shuffle grouped documents into buckets; the reduce stage formed candidate pairs and verified their true Jaccard before unioning them into duplicate clusters, keeping one representative per cluster.

Result: The pipeline ran in linear time over the corpus, removed a large fraction of near-duplicate text that the exact-hash pass had missed, and cut the effective token count enough to shorten the pretraining run, with no measurable loss of genuine content.

Lesson: At corpus scale the right answer to a quadratic similarity join is to not compute it. MinHash plus LSH turns "compare everything" into "compare only what already collided", and the small probability of a missed pair is a price worth paying for a job that finishes.

4. The Same Banding Returns as Vector Search Intermediate

The banding idea you just built for set similarity is one instance of a general principle: design a hash family that makes nearby points collide, and you turn nearest-neighbor search into a bucket lookup. MinHash is the locality-sensitive family for Jaccard distance over sets. Other distances have their own families, for example random-projection hashing for cosine and Euclidean distance over dense vectors, but the LSH banding mechanism is identical, group by hash, compare only within buckets. This is why the algorithm reappears far from data cleaning.

When a retrieval system must find the embedding vectors closest to a query among billions of stored vectors, the exact search is again quadratic and again avoided by approximate nearest-neighbor methods, of which LSH is the conceptual ancestor. We meet approximate nearest-neighbor search as a tool for classical models in Chapter 12, and as the indexing engine of distributed vector databases in Chapter 25. The banding you implemented in Code 6.7.2 and the bucket lookup in Code 6.7.3 are the same shape as the index probe inside a vector database; only the hash family and the distance change.

Thesis Thread: One Hashing Idea, Scaled Across the Book

Locality-Sensitive Hashing is a primitive that returns scaled out. Here it deduplicates a training corpus as a MapReduce job, where the band key is the shuffle key and the bucket is the reducer's input. In Chapter 25 the same banding becomes a sharded index that serves nearest-neighbor queries across a fleet at inference time, where the bucket lookup is a network fan-out to index shards. When you see approximate nearest-neighbor search later, recognize the S-curve of Section 3: a hash family chosen so that similar things collide and dissimilar things do not, with the threshold tuned by how many bands you use.

5. Where the Approximation Bites Intermediate

Because the method is probabilistic, it makes two kinds of error, and an honest pipeline budgets for both. A false negative is a near-duplicate pair that never shares a bucket, possible because every one of its bands happened to differ; its probability is $(1 - s^r)^b$, small for high-similarity pairs but never exactly zero. A false positive is an unrelated pair that collides in some band by chance and enters the candidate set; the reduce stage controls these cheaply by recomputing the true Jaccard on each candidate and discarding the ones below threshold, since the candidate set is small. The tuning knobs $b$ and $r$ trade these errors against each other and against cost: more bands raise recall and the candidate count, longer bands raise precision and the false-negative rate. The S-curve of Section 3 is the design tool, and the communication cost of the shuffle, which carries $b$ keys per document, is the budget it is tuned against, the kind of cost model developed in Chapter 3.

There is also a data-skew hazard specific to the shuffle. A boilerplate fragment shared by millions of documents produces a single enormous bucket, and the reducer that receives it faces a near-quadratic blowup inside that one bucket, the straggler problem this chapter keeps returning to. Practical systems cap bucket sizes, sample within giant buckets, or strip common boilerplate before shingling. The lesson is that the algorithm removes the global quadratic but can reintroduce a local one through skew, which is exactly the shuffle-skew problem that Chapter 7 treats in depth for Spark.

Research Frontier: Near-Deduplication of Pretraining Corpora (2024 to 2026)

Data-centric work on language models has made corpus deduplication a first-class research topic rather than a preprocessing afterthought. The lineage runs from the demonstration that deduplicated training data improves models (Lee et al., 2022) through large open pipelines that publish their MinHash-LSH recipes: the RefinedWeb and FineWeb efforts document fuzzy deduplication over trillions of tokens, and the Dolma and SlimPajama corpora ship reproducible MinHash deduplication code. Current questions are sharper than "should we deduplicate". They ask how aggressive deduplication should be before it removes useful diversity, how to deduplicate across languages and modalities, how to make the billion-document MinHash-LSH shuffle cheaper and skew-resistant, and how semantic near-duplicate detection over embeddings, an LSH cousin, complements the surface-form MinHash approach. The 2024 to 2026 open-data releases treat the MinHash-LSH job in this section as the workhorse and compete largely on how well they tune and scale it.

You now have a method that replaces a quadratic similarity join with two linear MapReduce passes, the probability theory that explains why MinHash signatures estimate Jaccard and why LSH banding finds the right candidates, and the cost and skew caveats that keep it honest. This is the first deliberately approximate algorithm in the book, and it opens a door. The next section widens it, building a family of small-space sketches, counting distinct items with HyperLogLog and frequencies with count-min, that answer otherwise impossible questions over streams and corpora. That family continues in Section 6.8.

Exercise 6.7.1: Place the Threshold Conceptual

You have signatures of length $n = 128$ and want candidate generation to switch on near a Jaccard of $s = 0.8$, using the approximation that the S-curve's threshold sits near $(1/b)^{1/r}$ with $n = b \cdot r$. Evaluate the candidate probability $1 - (1 - s^r)^b$ at $s = 0.8$ and at $s = 0.5$ for the layouts $(b, r) = (32, 4)$ and $(b, r) = (16, 8)$. Which layout more sharply separates near-duplicates from merely similar documents, and what does choosing a longer band $r$ do to the false-negative rate for genuine duplicates? Explain in terms of the S-curve's steepness.

Exercise 6.7.2: Measure the Candidate Saving Coding

Extend Code 6.7.2 to a synthetic corpus of $2000$ documents, built as $200$ clusters of $10$ near-duplicates each (generate each cluster from a random base sentence with a few word swaps) plus pure noise documents. Count the number of candidate pairs the LSH job produces and compare it to the $\binom{2000}{2}$ pairs an all-pairs scan would examine. Report the ratio, then add the reduce-stage verification that recomputes true Jaccard and keeps only pairs above $0.6$. State how many true near-duplicate pairs you recovered and how many you missed (false negatives), and relate the miss count to $(1 - s^r)^b$.

Exercise 6.7.3: Diagnose the Skewed Bucket Analysis

Suppose one boilerplate paragraph appears verbatim in $1\%$ of a billion-document corpus, so its shingles are shared by ten million documents. Argue why these documents collide in nearly every band and therefore land in one giant bucket, and estimate the number of candidate pairs that one bucket alone generates in the reduce stage. Explain why this single bucket can dominate the entire job's runtime even though the global method is linear, and propose two mitigations (one at shingling time, one at bucket-processing time). Connect your answer to the shuffle-skew treatment in Chapter 7.