Section 6.3: Key-Value Computation: Word Count and Inverted Indexing

"They asked me to count words. I emitted a billion little ones each saying 'I am here', went to sleep, and woke up holding the dictionary of an entire civilization."
A Mapper That Underestimated Its Corpus

Big Picture

Almost every large-scale data computation for AI is one shape: each record emits a few key-value pairs, the system gathers all pairs that share a key onto one machine, and a reducer collapses each key's pile into an answer. Word count and the inverted index are the two computations that make this shape concrete, and they are not toy examples kept around for nostalgia. The inverted index is the data structure underneath every search engine and every retrieval system, and the corpus statistics that fall out of word count (term and document frequencies) are the raw material for cleaning training data and for the TF-IDF features that classical text models still run on. This section shows that both computations are the same map (emit a posting) plus the same reduce (collect by key), so that once you can write one, you can write the family.

The previous section took apart the MapReduce execution model into its three phases, map, shuffle, and reduce, and showed how the shuffle is the load-bearing step that moves data between machines. This section puts that machinery to work on the two computations that defined MapReduce in the first place. Word count is the canonical "hello world" of distributed data processing, and the inverted index is the canonical real workload that word count is secretly a special case of. Treating them together makes a point that the rest of the chapter depends on: the algorithm you write is almost entirely in the choice of what key to emit, and the framework supplies the rest.

We care about these two computations for a reason specific to this book. Building an AI system at scale almost always begins with a corpus that has to be understood before it can be trained on. You need to know how often each term appears, how many documents contain it, which documents mention which entities, and which records are near-duplicates of which others. Every one of those questions is a key-value computation over the corpus, and the inverted index is the structure that turns "which documents contain this term?" from a full scan into a lookup. That is exactly the question a retrieval system answers at query time, which is why the index you build here returns in Chapter 25 as the backbone of distributed retrieval.

1. Word Count Is an Emit-and-Collect Beginner

Counting how often each word appears in a corpus sounds trivial until the corpus does not fit on one machine. The single-machine version is a loop that increments a dictionary entry per word; the distributed version cannot keep one shared dictionary, because the corpus is spread across many nodes and no single node can hold the running totals for a web-scale vocabulary while the data streams past. MapReduce solves this by refusing to share any state during the scan. Each mapper reads its own shard of documents and, for every word it sees, emits a single key-value pair: the word as the key, and the number one as the value. A mapper never looks up a total; it only ever says "here is one more occurrence of this word."

The shuffle then does the only globally coordinated work. It routes every emitted pair to a reducer chosen by the key, so that all the ones for the word cat arrive together at one reducer, no matter which document or which machine produced them. The reducer receives a key and the full list of values for that key, and it sums them. Formally, if document $d$ contributes a count $c_{t,d}$ of term $t$, the reducer for term $t$ computes the corpus term frequency

$$\mathrm{tf}(t) = \sum_{d \in \text{corpus}} c_{t,d},$$

and the sum decomposes across documents exactly the way the gradient sum decomposed across shards in Section 1.1. The mappers compute the per-document terms in parallel, blind to one another; the reducer performs the regrouping. Nothing about the answer depends on how the documents were partitioned across mappers, which is precisely what makes the computation safe to distribute.

Key Insight: The Algorithm Is the Key You Emit

In a key-value computation you do not write the parallelism, the partitioning, or the data movement; the framework supplies all three. What you write is two short functions and, above all, one decision: what key does the map step emit? Emit the word and you get word count. Emit the word paired with the document it came from and you get an inverted index. Emit a normalized fingerprint and you get deduplication. The shuffle groups by whatever key you chose, so the key is the algorithm. Most of the computations in this chapter differ only in that one line.

2. The Inverted Index Is the Same Pattern, One Key Wider Beginner

An inverted index maps each term to the list of documents that contain it, the inverse of the obvious "forward" layout that maps each document to the words it contains. That inversion is what makes search fast: to answer "which documents mention retrieval?" you read one short posting list instead of scanning every document in the corpus. The list of document identifiers attached to a term is called its postings list, and building it is the workhorse of every text-retrieval system ever shipped.

The build is word count with a richer payload. The map step still emits one pair per term occurrence, but now the value carries the document identity rather than a bare one. The reducer still collects everything for a key, but instead of summing it collapses the collected document identifiers into a sorted, de-duplicated set, the postings list. Counting the length of that set gives the document frequency $\mathrm{df}(t)$, the number of distinct documents containing the term, which is a different and more useful statistic than the raw term frequency: a word can appear a thousand times but in only one document. Figure 6.3.1 traces a handful of documents through the map, shuffle, and reduce phases to show the postings lists assembling.

Figure 6.3.1: Building an inverted index as map, shuffle, reduce. Each document is mapped to a stream of (term, doc) postings (orange), the shuffle groups every posting by term, and each reducer (green) collapses one term's group into a sorted postings list and a document-frequency count. Word count is the identical diagram with the value doc replaced by the constant 1 and the reducer summing instead of collecting.

Reading Figure 6.3.1 left to right makes the shared skeleton visible. The map column is per-document and embarrassingly parallel; the shuffle is the single cross-machine step that gathers a term's postings from wherever they were produced; the reduce column turns each gathered group into one output row. Swap the emitted value from a document identifier to the constant one, and swap the reducer from "collect into a set" to "sum," and the very same picture computes word count. The two computations are one template with two fillings.

Fun Note: Why "Inverted"?

The name sounds like the index is broken. It is not; it is inverted relative to the document itself. A book's body reads document-to-words (page 12 contains "gradient"); the index at the back reads word-to-pages ("gradient" appears on pages 12, 88, 201). The back-of-book index is a hand-built inverted index, and a search engine is the same idea with a few billion more pages and no one willing to compile it by hand.

3. Building the Index by Hand Intermediate

The code below builds a small inverted index and computes document frequencies from a four-document corpus, written as an explicit map, shuffle, and reduce so the framework's role is visible. The map function yields one (term, doc_id) posting per token, a plain Python defaultdict stands in for the shuffle that groups postings by key, and the reduce step collapses each group into a sorted document set and a document-frequency count. It then uses the index to answer a two-term query by intersecting postings lists, and feeds the document frequencies into the inverse-document-frequency term of TF-IDF.

from collections import defaultdict
import math

# A tiny corpus: each document is (doc_id, text).
corpus = [
    ("d1", "the cat sat on the mat"),
    ("d2", "the dog sat on the log"),
    ("d3", "the cat chased the dog"),
    ("d4", "a quick brown dog"),
]

def tokenize(text):
    return text.lower().split()

# MAP: every document emits one posting (term, doc_id) per token occurrence.
def map_doc(doc_id, text):
    for term in tokenize(text):
        yield (term, doc_id)

# SHUFFLE: group every emitted posting by its key (the term).
grouped = defaultdict(list)
for doc_id, text in corpus:
    for term, did in map_doc(doc_id, text):
        grouped[term].append(did)

# REDUCE: collapse each term's postings into a sorted, de-duplicated
# document set plus a document frequency (how many documents contain the term).
inverted_index, doc_freq = {}, {}
for term, postings in grouped.items():
    docs = sorted(set(postings))
    inverted_index[term] = docs
    doc_freq[term] = len(docs)

print("INVERTED INDEX  (term -> documents)")
for term in sorted(inverted_index):
    print(f"  {term:<7} df={doc_freq[term]}  -> {inverted_index[term]}")

# A query is now a set intersection over posting lists, not a corpus scan.
hits_cat = set(inverted_index.get("cat", []))
hits_dog = set(inverted_index.get("dog", []))
print("\nQUERY 'cat' AND 'dog' -> documents:", sorted(hits_cat & hits_dog))

# The same df values feed the IDF term of TF-IDF.
N = len(corpus)
print("\nIDF = log(N / df), N =", N)
for term in ["the", "cat", "quick"]:
    print(f"  {term:<7} df={doc_freq[term]}  idf={math.log(N / doc_freq[term]):.3f}")

Code 6.3.1: An inverted index built as explicit map, shuffle, and reduce over a four-document corpus, then used for a boolean query and for the IDF statistic. The map_doc generator is the only line that decides the algorithm; emitting (term, 1) and summing in the reducer would make this word count instead.

INVERTED INDEX  (term -> documents)
  a       df=1  -> ['d4']
  brown   df=1  -> ['d4']
  cat     df=2  -> ['d1', 'd3']
  chased  df=1  -> ['d3']
  dog     df=3  -> ['d2', 'd3', 'd4']
  log     df=1  -> ['d2']
  mat     df=1  -> ['d1']
  on      df=2  -> ['d1', 'd2']
  quick   df=1  -> ['d4']
  sat     df=2  -> ['d1', 'd2']
  the     df=3  -> ['d1', 'd2', 'd3']

QUERY 'cat' AND 'dog' -> documents: ['d3']

IDF = log(N / df), N = 4
  the     df=3  idf=0.288
  cat     df=2  idf=0.693
  quick   df=1  idf=1.386

Output 6.3.1: The postings lists and document frequencies for the corpus. The query cat AND dog resolves to a single set intersection returning d3, and the IDF column shows the index's payoff for AI: the ubiquitous term the earns a near-zero weight while the rare term quick earns a high one.

The last block of Output 6.3.1 is where the corpus statistics become an AI feature. Term frequency-inverse document frequency weights a term in a document as $\mathrm{tfidf}(t, d) = \mathrm{tf}(t, d) \cdot \log\!\big(N / \mathrm{df}(t)\big)$, where $N$ is the number of documents and $\mathrm{df}(t)$ the document frequency the reducer just produced. The logarithm makes a term that appears in nearly every document (like the, with $\mathrm{df}=3$ out of $N=4$) almost worthless as a discriminator and a rare term (like quick) highly informative. The entire numerator of that weight, the global $\mathrm{df}(t)$ over a corpus too large for one machine, is exactly what the reduce step computes. Feature engineering itself now lives with the storage and loading machinery of Chapter 8; the point here is that the feature's raw ingredient is a distributed key-value reduction.

Thesis Thread: The Shuffle Is the Combine Step, Again

The "group all values for a key onto one machine" step you just simulated with a defaultdict is the MapReduce shuffle, and it is the same combining motion as the all-reduce of Section 1.1: partial results computed independently per shard, then merged by key. The shuffle moves variable-length postings keyed by term; the all-reduce moves fixed-length gradient vectors keyed by parameter position. Chapter 4 shows that they are members of one family of collective operations, and the cost of this cross-machine grouping, not the map or reduce arithmetic, is what governs whether either computation scales.

4. Why This One Pattern Generalizes Intermediate

The reason word count and the inverted index are taught first is not that text is special; it is that the emit-and-collect shape is general. Any computation that can be phrased as "for each record, produce some (key, payload) pairs, then for each key combine its payloads" fits the template, and a surprising fraction of corpus-scale AI preprocessing can be phrased that way. Counting label frequencies in a training set is word count with labels as keys. Grouping records by a join key, the subject of Section 6.4, is the inverted index with the join key in place of the term. Finding near-duplicate documents, which keeps web-scale training corpora from being polluted with repeats, emits a content fingerprint as the key so that duplicates collide at the same reducer, a technique we reach in the locality-sensitive hashing section later in this chapter.

The constraint that makes the pattern both general and scalable is associativity. Because the reducer combines a key's payloads in whatever order they arrive, and because a partial combine of a subset gives the same result as combining everything at once, the framework is free to start reducing on each mapper before the shuffle (a combiner) and to re-run any failed task without corrupting the answer. That associativity is what links this section to fault tolerance: a reducer that dies mid-merge can be restarted from the same inputs and produce the same postings list, which is why MapReduce can run a word count across thousands of unreliable machines and still get the right dictionary. We make that re-execution guarantee precise in the fault-tolerance section that closes the chapter.

Practical Example: The Deduplication Pass Before a Trillion-Token Run

Who: A data engineer on a team assembling a pretraining corpus for a foundation model.

Situation: A raw web crawl of several hundred terabytes was destined for a language-model training run, with budget for a single pass over the data.

Problem: The crawl was riddled with near-duplicate pages (mirrors, boilerplate, syndicated articles), and training on duplicates wastes compute and memorizes junk, but no single machine could hold a hash set over hundreds of terabytes.

Dilemma: Build a bespoke distributed hashing service, weeks of systems work, or express the deduplication as a key-value job on the cluster they already ran their inverted indexes on.

Decision: They wrote it as a MapReduce job, because grouping documents by a content fingerprint is structurally identical to the inverted index in Code 6.3.1, only the emitted key changed.

How: The map step emitted (fingerprint, doc_id) for each document; the shuffle gathered colliding fingerprints together; the reducer kept one representative document per fingerprint group and dropped the rest, writing out document frequencies of each fingerprint as a duplication report.

Result: The pass removed roughly a fifth of the corpus by volume in a single shuffle, ran on the existing cluster with no new infrastructure, and produced a per-fingerprint count that flagged the worst boilerplate for the next crawl.

Lesson: When a corpus-scale data problem can be phrased as "group by some key, then combine," it is the inverted index wearing a different key, and it inherits that pattern's parallelism and fault tolerance for free.

Library Shortcut: One Pass in Spark Replaces the Hand-Rolled Shuffle

Code 6.3.1 spelled out the shuffle with a defaultdict so the grouping was visible. On a real cluster you never write that loop; a data engine performs the shuffle, the parallel scheduling, and the fault-tolerant re-execution for you. In PySpark, the entire inverted-index build is a flat-map to postings followed by a group-by-key, and it runs unchanged on one core or a thousand:

# Spark distributes the shuffle, scheduling, and re-execution across the cluster.
docs = sc.parallelize([("d1", "the cat sat on the mat"),
                       ("d2", "the dog sat on the log")])

postings = docs.flatMap(lambda d: [(t, d[0]) for t in d[1].lower().split()])
index = (postings
         .groupByKey()                       # the shuffle: group postings by term
         .mapValues(lambda ds: sorted(set(ds)))   # reduce: de-duplicated postings list
         .collectAsMap())

Code 6.3.2: The inverted index of Code 6.3.1 in three lines of PySpark. The explicit defaultdict shuffle and the per-key reduce loop collapse into groupByKey().mapValues(...), and Spark supplies the cross-machine data movement and fault tolerance that Chapter 7 unpacks in full.

5. From Postings Lists to Retrieval at Scale Advanced

The index in Output 6.3.1 fits in a dictionary, but a real one does not. A web-scale inverted index has billions of terms and posting lists millions of documents long, so the index itself must be partitioned across machines, and the partitioning choice shapes how queries run. Partitioning by term (each machine owns a disjoint set of terms and their full posting lists) makes a single-term query touch one machine but forces multi-term queries to ship long lists between machines. Partitioning by document (each machine holds the full index for its slice of the corpus) makes every query fan out to all machines but keeps each posting list local. This is the same sharding-by-key versus sharding-by-row tension that recurs for embedding tables in Chapter 12 and for model parameters in Chapter 16; the index is just the first place it bites.

Modern retrieval for AI extends the inverted index rather than replacing it. A boolean or TF-IDF inverted index matches exact terms, so a query for "automobile" misses a document that only says "car." Dense retrieval answers that by indexing learned vector embeddings and searching by nearest neighbor, and production systems run a hybrid: the term-based inverted index from this section for exact and rare matches, alongside a vector index for semantic matches, with the two score lists fused. Both are still emit-and-collect builds, one keyed by term and one keyed by a quantized vector cell, and both shard the same two ways. Chapter 25 builds out the distributed vector side; the postings list you assembled here is the half that has been carrying search for fifty years and is not going anywhere.

Research Frontier: Learned and Hybrid Sparse Indexes (2024 to 2026)

The inverted index is in the middle of a quiet revival driven by retrieval-augmented generation. Learned sparse-retrieval models in the SPLADE lineage predict term weights and document expansions with a transformer, then write the result into a classic inverted index, so the postings list is built by a neural network but queried with the same fast intersection as Code 6.3.1. Production search now standardly fuses these sparse signals with dense vectors using reciprocal rank fusion, and 2024 to 2026 work on the BGE-M3 family trains a single model to emit dense, sparse, and multi-vector representations together for one-pass hybrid indexing. The engineering message for distributed systems is that the term-keyed inverted index is not being retired for AI retrieval; it is being fed by better mappers, and it still has to be sharded and shuffled exactly as this section describes. We evaluate these hybrid retrievers in Chapter 25 and put one to work in the web-scale RAG case study of Chapter 36.

With word count and the inverted index in hand, you have the two computations that the rest of this chapter generalizes. The next section keeps the same key-value skeleton but asks the reducer to do more than collect: it adds aggregation with combiners, filtering during the map, and the secondary sort that lets a reducer see its values in a chosen order. That progression starts in Section 6.4.

Exercise 6.3.1: One Template, Two Computations Conceptual

State precisely the two changes to Code 6.3.1 that turn the inverted-index build into a word-count job: what the map step emits as the value, and what the reduce step does with the collected values. Then explain why, despite producing different outputs, both computations require the identical shuffle, and what that shared shuffle implies about which of the three phases dominates the running time as the corpus grows. Identify one further computation from Section 4 (label counting, joining, or deduplication) and give the single key it must emit.

Exercise 6.3.2: Positional Postings and Phrase Queries Coding

Extend Code 6.3.1 so that each posting carries not just the document identifier but also the position of the term within the document, emitting (term, (doc_id, position)). Have the reducer build a positional postings list per term. Then write a function that answers the phrase query "sat on" by intersecting the postings of sat and on and keeping only documents where on appears exactly one position after sat. Run it on the corpus and report which documents match. Explain how much larger the positional index is than the plain one and why phrase search needs it.

Exercise 6.3.3: Term Partitioning Versus Document Partitioning Analysis

Consider a two-term conjunctive query A AND B on an index sharded across $M$ machines. Under term partitioning, the posting lists for $A$ and $B$ live on (at most) two machines; under document partitioning, every one of the $M$ machines holds a slice of both lists. For each scheme, estimate the number of machines contacted and the volume of posting data that must move between machines to compute the intersection, in terms of the posting-list lengths $|A|$ and $|B|$ and the shard count $M$. State which scheme you would choose for a latency-sensitive search service and which for a throughput-oriented batch analytics job, and connect your reasoning to the partitioning trade-off named in Section 5.