Section 25.4: Approximate Nearest Neighbor Search

"I never find the single closest vector. I find ten that are close enough, in a millisecond, and let the reader decide which truth it wanted. So far the reader has always been satisfied."
An Index That Negotiates With Exactness

Big Picture

A vector database does not search by comparing the query against every stored vector; it builds an index that lets it look at a small fraction of them and still return almost the right answer. Three index families dominate, and each one trades the same three quantities against each other: recall, latency, and memory. Inverted-file (IVF) indexes cluster the vectors and probe only the nearest cells. Hierarchical navigable small-world (HNSW) graphs walk a layered graph from a coarse entry point down to close neighbors. Product quantization (PQ) compresses each vector into a few bytes so a billion of them fit in memory. The knob you turn on each family (nprobe, efSearch, the PQ code size) slides the system along the recall-latency-memory surface, and where you choose to sit on that surface fixes the per-shard memory footprint and per-query latency that the next section then shards and replicates across machines. Get the single-shard index right and distribution is a matter of fan-out; get it wrong and no amount of fan-out will save the latency budget.

Section 12.7 introduced approximate nearest neighbor search as the operation that makes embedding retrieval tractable, and showed why exhaustive comparison against a billion vectors is hopeless under a latency budget. This section opens the index itself. We look inside the three structures that vector databases actually use, name the single knob each one exposes, and watch that knob trade recall for speed or memory in real measured numbers. The point is not just to catalog the families; it is to see that every vector database is a choice of one point on a three-way trade-off surface, and that the coordinates of that point, the per-shard memory and the per-query latency, are exactly the inputs that Section 25.5 needs before it can decide how to shard and replicate the index across a fleet.

We work throughout with a database of $N$ vectors in $d$ dimensions and a query that asks for the $k$ nearest. Exhaustive search computes $N$ distances per query, which is $O(Nd)$ work; for $N$ in the billions and a few-millisecond budget, that is far too much. Every index in this section is a way to touch many fewer than $N$ vectors while still finding most of the true top $k$.

1. The Quantity We Are Trading: Recall Beginner

Before comparing index families we need a single number that says how good an approximate answer is. The standard measure is recall at $k$. Let $A_q$ be the set of $k$ vectors the index returns for query $q$, and let $G_q$ be the true $k$ nearest neighbors found by exhaustive search. Averaged over a query set $\mathcal{Q}$,

$$\text{recall@}k = \frac{1}{|\mathcal{Q}|} \sum_{q \in \mathcal{Q}} \frac{|A_q \cap G_q|}{k}.$$

A recall of $1.0$ means the index returned exactly the true neighbors; a recall of $0.9$ means it found nine of every ten true neighbors on average and substituted slightly-worse vectors for the rest. For most retrieval-augmented generation and recommendation workloads, recall in the $0.9$ to $0.99$ range is indistinguishable from exact in downstream quality, which is precisely why approximation is worth the speedup. The whole game is to reach that recall band while touching as few vectors and bytes as possible.

Key Insight: Every Index Is One Point on a Three-Way Surface

Recall, latency, and memory are not independent dials you can max out together; they form a trade-off surface, and an index family with its parameter fixed is a single point on it. IVF's nprobe buys recall with latency at fixed memory. PQ's code size buys memory back at the cost of recall. HNSW's efSearch buys recall with latency, but the graph itself costs memory up front. There is no free corner where recall is perfect, latency is zero, and memory is tiny. Choosing an index is choosing where on this surface your workload can afford to sit, and that choice is what the next section then multiplies across shards.

2. The Three Index Families Intermediate

Figure 25.4.1 shows the three structures side by side, because seeing their shapes makes their trade-offs intuitive before any code runs. IVF carves the space into cells and visits a few; HNSW threads a graph you traverse from coarse to fine; PQ replaces each full vector with a short code that approximates it. Each panel highlights the one knob that moves the family along the trade-off surface.

Figure 25.4.1: The three index families. Left (IVF): the vectors are clustered into cells; a query searches only the nprobe cells nearest its own cell (shaded), so most of the database is never touched. Center (HNSW): a layered proximity graph is entered at a sparse top layer and traversed greedily downward, with efSearch controlling how wide a candidate frontier is kept. Right (PQ): a full vector is split into subspaces, each replaced by the index of its nearest codebook centroid, shrinking 256 bytes to 4. The highlighted knob in each panel is what moves the family along the recall-latency-memory surface.

IVF: cluster, then probe only the near cells

An inverted-file index first runs k-means over the database to produce $\text{nlist}$ centroids, then assigns every vector to its nearest centroid, building one posting list (the "cell") per centroid. At query time it computes the distance from the query to all $\text{nlist}$ centroids, picks the $\text{nprobe}$ nearest, and exhaustively searches only the vectors in those cells. If a database of $N$ vectors splits evenly into $\text{nlist}$ cells, one probe touches roughly $N/\text{nlist}$ vectors, so probing nprobe of them costs about $\text{nlist} + \text{nprobe} \cdot N/\text{nlist}$ distance computations instead of $N$. The single knob is nprobe: at $\text{nprobe} = 1$ the search is fast but misses true neighbors that fell just across a cell boundary; as nprobe rises, recall climbs toward $1.0$ and latency rises with it. Memory is essentially the raw vectors plus a tiny centroid table, so IVF alone does not save memory; it saves time.

HNSW: walk a layered proximity graph

A hierarchical navigable small-world graph stores each vector as a node connected to its near neighbors, with a small random subset of nodes promoted into sparse upper layers that act as express lanes. A search enters at the single top-layer node, greedily hops to whichever neighbor is closest to the query, drops a layer when it can get no closer, and repeats until it reaches the dense bottom layer, where it expands a candidate frontier of size efSearch and returns the best $k$. Because each hop only examines a node's handful of edges, a query touches a logarithmic-ish number of nodes rather than $N$, which makes HNSW the fastest family at high recall. The price is paid twice in memory: the graph stores the full vectors plus tens of edges per node, and building or inserting is costly because every new node must find and link its neighbors. efSearch is the query-time knob; raising it widens the frontier, lifting recall at the cost of latency, with no change to the memory already spent on the graph.

Product quantization: compress vectors into bytes

Product quantization attacks the memory axis directly. It splits each $d$-dimensional vector into $m$ contiguous subvectors and, for each subspace, learns a codebook of (typically) 256 centroids by k-means. A vector is then stored not as $d$ floats but as $m$ one-byte codes, each naming the nearest centroid in its subspace. A $d = 64$ float32 vector that occupied 256 bytes becomes $m$ bytes; at $m = 8$ that is a 32-fold shrink. Search uses an asymmetric distance: the query stays full precision, and for each subspace the index precomputes a small table of distances from the query subvector to all 256 centroids, so a database vector's approximate distance is just $m$ table lookups summed. The knob is $m$, the code size: more subspaces means more bytes per vector but a finer approximation and higher recall. PQ is rarely used alone for search quality; its power is memory, which is why it is almost always combined with IVF.

Fun Note: The Index That Forgets on Purpose

Product quantization is a database that deliberately throws away most of what it was told. A 256-byte vector goes in; 8 bytes come out, and the other 248 are gone forever. The trick is that the 8 bytes remember the shape of the vector well enough to rank neighbors correctly most of the time. It is the rare engineering win where forgetting 97 percent of your data makes the system better, because the 3 percent you keep is the 3 percent that fit in RAM.

3. IVF-PQ and the Combined Index Intermediate

The two non-graph families compose into the workhorse index of large vector databases: IVF-PQ. IVF supplies the coarse partition that limits how many vectors are scanned, and PQ compresses the vectors inside each cell so the whole index fits in memory. The refinement that makes IVF-PQ accurate is residual encoding: rather than quantizing each raw vector, the index subtracts the cell centroid first and product-quantizes the residual, the part of the vector the coarse centroid did not capture. Residuals are smaller and more uniform than raw vectors, so the same code size buys higher recall. A query then probes nprobe cells and ranks their compressed members by table-lookup distance, combining IVF's time saving with PQ's memory saving in one structure. This is the index that lets a billion vectors live on a single machine's RAM and still answer in milliseconds, and it is the per-shard building block the next section replicates.

4. Measuring the Trade-Off Surface From Scratch Intermediate

Words about the trade-off surface are less convincing than measured numbers. The code below implements IVF (k-means clustering plus cell probing) and PQ (per-subspace codebooks with asymmetric table-lookup distance) in pure NumPy on a synthetic clustered dataset, then sweeps the two knobs. The IVF sweep over nprobe traces the recall-versus-latency curve; the PQ sweep over the code size $m$ traces the recall-versus-memory curve. Both are compared against exhaustive search, which defines recall $1.0$ and the latency baseline.

import numpy as np
import time

rng = np.random.default_rng(0)
N, d, Q = 40_000, 64, 200       # database vectors, dimension, queries
nlist = 128                     # IVF coarse clusters
topk = 10                       # neighbors to return

# Synthetic clustered data so the index structure is realistic.
centers = rng.standard_normal((20, d)) * 3.0
assign = rng.integers(0, 20, N)
X = centers[assign] + rng.standard_normal((N, d))
Xq = centers[rng.integers(0, 20, Q)] + rng.standard_normal((Q, d))

def sqdist(A, B):
    # ||a-b||^2 = ||a||^2 + ||b||^2 - 2 a.b, vectorized and memory-light.
    return (A * A).sum(1)[:, None] + (B * B).sum(1)[None, :] - 2.0 * A @ B.T

# ---- Exhaustive ground truth ----
t0 = time.perf_counter()
D = sqdist(Xq, X)
truth = np.argpartition(D, topk, axis=1)[:, :topk]
t_exh = (time.perf_counter() - t0) / Q * 1e3   # ms per query

def recall(approx, truth):
    hit = sum(len(set(a.tolist()) & set(t.tolist())) for a, t in zip(approx, truth))
    return hit / (topk * len(truth))

def kmeans(data, k, iters=10):                 # Lloyd's algorithm
    c = data[rng.choice(len(data), k, replace=False)].copy()
    for _ in range(iters):
        lab = np.argmin(sqdist(data, c), axis=1)
        for j in range(k):
            m = lab == j
            if m.any():
                c[j] = data[m].mean(0)
    return c, lab

# ---- IVF: coarse quantizer, probe nprobe nearest cells ----
cent, labels = kmeans(X, nlist, iters=8)
cells = [np.where(labels == j)[0] for j in range(nlist)]

def ivf_search(q, nprobe):
    cd = ((cent - q) ** 2).sum(1)
    probe = np.argpartition(cd, nprobe)[:nprobe]        # nearest cells
    cand = np.concatenate([cells[j] for j in probe])    # their members
    if len(cand) < topk:
        return cand
    dd = ((X[cand] - q) ** 2).sum(1)
    return cand[np.argpartition(dd, topk)[:topk]]

print("=== IVF: sweep nprobe (recall vs latency) ===")
print(f"{'nprobe':>7}{'recall':>9}{'ms/query':>11}")
for nprobe in [1, 4, 8, 16, 32]:
    t0 = time.perf_counter()
    approx = [ivf_search(q, nprobe) for q in Xq]
    ms = (time.perf_counter() - t0) / Q * 1e3
    print(f"{nprobe:>7}{recall(approx, truth):>9.3f}{ms:>11.3f}")
print(f"{'exhaust':>7}{1.000:>9.3f}{t_exh:>11.3f}")

# ---- Product Quantization: m subspaces, 256-entry codebook each ----
def train_pq(data, m, ksub=256):
    dsub = d // m
    books = np.empty((m, ksub, dsub))
    for i in range(m):
        sub = data[:, i * dsub:(i + 1) * dsub]
        books[i], _ = kmeans(sub, ksub, iters=6)
    return books, dsub

def encode_pq(data, books, dsub, m):
    codes = np.empty((len(data), m), dtype=np.uint8)    # 1 byte per subspace
    for i in range(m):
        sub = data[:, i * dsub:(i + 1) * dsub]
        codes[:, i] = np.argmin(sqdist(sub, books[i]), axis=1)
    return codes

def pq_search(q, books, codes, dsub, m):
    tabs = np.empty((m, books.shape[1]))                # query-to-centroid tables
    for i in range(m):
        qsub = q[i * dsub:(i + 1) * dsub]
        tabs[i] = ((books[i] - qsub) ** 2).sum(1)
    approx_d = tabs[np.arange(m)[:, None], codes.T].sum(0)   # m lookups summed
    return np.argpartition(approx_d, topk)[:topk]

print("\n=== PQ: sweep code size m (recall vs memory) ===")
print(f"{'m':>4}{'bytes/vec':>11}{'recall':>9}{'compress':>10}")
raw_bytes = d * 4
for m in [4, 8, 16, 32]:
    books, dsub = train_pq(X, m)
    codes = encode_pq(X, books, dsub, m)
    approx = [pq_search(q, books, codes, dsub, m) for q in Xq]
    print(f"{m:>4}{m:>11}{recall(approx, truth):>9.3f}{raw_bytes / m:>9.0f}x")
print(f"raw float32 vector: {raw_bytes} bytes/vec")

Code 25.4.1: IVF and PQ implemented from first principles in NumPy. The IVF block clusters the database and probes nprobe cells per query; the PQ block compresses each vector into m one-byte codes and ranks by summed table-lookup distance. The two sweeps trace the two trade-off curves that Figure 25.4.2 plots.

=== IVF: sweep nprobe (recall vs latency) ===
 nprobe   recall   ms/query
      1    0.321      0.221
      4    0.801      1.075
      8    0.973      2.073
     16    1.000      3.940
     32    1.000      7.230
exhaust    1.000      3.147

=== PQ: sweep code size m (recall vs memory) ===
   m  bytes/vec   recall  compress
   4          4    0.054       64x
   8          8    0.116       32x
  16         16    0.251       16x
  32         32    0.686        8x
raw float32 vector: 256 bytes/vec

Output 25.4.1: The two trade-off curves in real numbers. IVF reaches 0.973 recall at nprobe 8 while touching a fraction of the data and running faster than the exhaustive baseline (2.07 ms versus 3.15 ms); probing more cells only buys the last few percent of recall at rising latency. PQ shows the opposite axis: shrinking each vector from 256 bytes to 4 (a 64-fold memory win) drops recall sharply, and recovering recall costs bytes back. Standalone PQ on raw vectors recalls poorly here, which is exactly why production indexes pair it with IVF and encode residuals.

Two lessons fall out of Output 25.4.1, and both feed the next section. First, IVF's nprobe is a near-monotone recall-latency dial: there is a sweet spot (nprobe 8 here) where recall is already near $1.0$ and latency is below the exhaustive baseline, and pushing past it wastes time for almost no recall. Second, PQ's code size is a recall-memory dial pointing the other way: the 64-fold memory saving at $m = 4$ is real and is what makes billion-scale indexes fit in RAM, but raw PQ alone is too lossy for final ranking, which is the empirical argument for IVF-PQ with residual encoding. The numbers you would carry forward, "this shard needs $B$ bytes per vector and answers in $T$ milliseconds," are precisely the per-shard cost and latency that Section 25.5 sums and replicates.

Figure 25.4.2: The two trade-off curves plotted from the Output 25.4.1 numbers. Left: IVF recall climbs steeply with latency as nprobe rises, flattening near $1.0$ once it crosses the exhaustive-latency line (dashed), so the useful operating points sit at small nprobe. Right: PQ recall rises with bytes per vector, but every point sits far left of the 256-byte raw baseline (dashed), which is the memory it saves; standalone PQ recall is low, motivating the IVF-PQ residual combination of Section 3.

Library Shortcut: FAISS Builds an IVF-PQ Index in Six Lines

Code 25.4.1 spends roughly a hundred lines re-deriving k-means clustering, cell probing, and codebook encoding so the mechanics are visible. In production you never write that; FAISS (the standard similarity-search library) builds, trains, and queries an IVF-PQ index, and exposes nprobe as a single attribute. The same two knobs you swept by hand are one assignment and one constructor argument:

import faiss, numpy as np
d, nlist, m, nbits = 64, 128, 8, 8     # dim, cells, PQ subspaces, bits/code

quantizer = faiss.IndexFlatL2(d)                       # coarse quantizer
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, nbits)  # IVF + PQ residuals
index.train(database)                                  # learns centroids + codebooks
index.add(database)                                    # encodes and stores codes

index.nprobe = 8                                       # the recall-latency knob
D, I = index.search(queries, k=10)                     # batched ANN search

Code 25.4.2: The from-scratch index of Code 25.4.1 as a FAISS IndexIVFPQ. About a hundred lines of clustering and quantization collapse to six, and FAISS handles residual encoding, SIMD-optimized table lookups, and optional GPU execution internally. For graph search the analogous one-liner is hnswlib.Index(space='cosine', dim=d) with an ef attribute in place of nprobe.

5. Distance Metrics: What "Nearest" Means Beginner

Every index above ranks candidates by a distance, but "distance" is a choice, and the choice must match how the embedding model was trained. Three metrics cover almost all retrieval. Squared Euclidean (L2) distance, $\lVert a - b \rVert^2$, is what Code 25.4.1 used and is natural when magnitude carries meaning. Inner product, $a \cdot b$, ranks by alignment and magnitude together and is the right metric for models trained with a dot-product objective, including many recommendation and dense-retrieval embeddings; note that "nearest" under inner product means largest dot product, so the index maximizes rather than minimizes. Cosine similarity, $\frac{a \cdot b}{\lVert a \rVert \lVert b \rVert}$, is inner product after normalizing both vectors to unit length, and is the default for text embeddings where only direction, not magnitude, should matter. The practical rule is to normalize once at index-build time if the model expects cosine, after which cosine and inner product coincide and the same fast inner-product machinery serves both.

Practical Example: Choosing the Index Point for a RAG Service

Who: A platform engineer standing up the retrieval tier for a retrieval-augmented generation product.

Situation: 200 million text-embedding vectors at $d = 768$, a 25 ms retrieval budget inside a larger LLM request, and a fixed fleet of memory-bounded machines.

Problem: Stored as raw float32, the vectors need about 600 GB, which does not fit the planned per-machine RAM, and exhaustive search blows the latency budget by orders of magnitude.

Dilemma: Pure HNSW would hit the recall and latency targets but stores full vectors plus a heavy graph, exceeding the memory budget; pure IVF on raw vectors fits the latency target but still does not shrink the 600 GB; standalone PQ shrinks memory but, as Output 25.4.1 shows, recalls too poorly for final answers.

Decision: They chose IVF-PQ with residual encoding, normalized for cosine, sized so the compressed index fit comfortably in RAM, and tuned nprobe up from a low default until recall@10 crossed 0.95 on a held-out query set.

How: Built a FAISS IndexIVFPQ as in Code 25.4.2 with $m$ chosen to hit roughly 32 bytes per vector (a 24-fold shrink to about 25 GB), then swept nprobe offline to find the smallest value meeting the recall target, fixing latency well inside 25 ms.

Result: The index fit one machine's RAM with headroom, met both the recall and latency targets, and produced concrete per-shard numbers, bytes per vector and milliseconds per query, that the team handed to the sharding design.

Lesson: The index family and its knobs are chosen by which axis binds first. Memory bound it here, so PQ had to be in the mix, and IVF-PQ was the only point on the surface that satisfied all three constraints at once.

6. When the Index Exceeds RAM: Disk-Based ANN Advanced

IVF-PQ shrinks vectors enough that hundreds of millions fit in one machine's memory, but at billion to ten-billion scale even compressed codes plus a high-recall graph can exceed RAM. Disk-based ANN keeps the bulk of the index on SSD and only a compressed summary in memory. DiskANN is the canonical design: it stores a single flat proximity graph and the full-precision vectors on SSD, keeps PQ-compressed codes of every vector in RAM for fast approximate ranking during traversal, and issues a small number of carefully batched SSD reads to fetch and re-rank the most promising candidates at full precision. The in-memory PQ codes steer the search so that only a handful of disk pages are read per query, which keeps latency in the single-digit-millisecond range despite the index living on disk. The trade-off surface gains a fourth coordinate here, SSD reads per query, but the framing is unchanged: you are still spending one resource (disk I/O) to buy back another (RAM), and the per-shard cost you report to the next section now includes a disk-bandwidth term.

Research Frontier: GPU-Native and Learned ANN (2024 to 2026)

Three threads are reshaping the trade-off surface. GPU-native ANN has matured: NVIDIA's cuVS (the successor to RAFT, 2024) ships GPU IVF, IVF-PQ, and the CAGRA graph index, with CAGRA reporting order-of-magnitude throughput gains over CPU HNSW at matched recall and a graph layout designed for massively parallel traversal, and FAISS now offers cuVS-backed GPU indexes directly. Disk-based methods have advanced past the original DiskANN with filtered and streaming variants (FreshDiskANN and the filtered-DiskANN line) that support real-time inserts and attribute-constrained search without rebuilding, the operations a live vector database actually needs. A third thread learns the index structure itself: learned partitions and routing functions that replace hand-tuned k-means cells with trained models (in the lineage of learned indexes and query-aware routing such as the SOAR and ScaNN-family refinements) push recall-per-byte further than fixed quantization. The common direction is that the knobs of Section 2, nprobe, efSearch, code size, are increasingly chosen or replaced by learned components and executed on hardware that traverses graphs and scans codes in parallel. We meet GPU-accelerated retrieval again as part of web-scale RAG in Chapter 36.

Thesis Thread: The Index Sets the Per-Shard Numbers That Get Scaled Out

This section is single-machine on purpose: it chooses an index family and a point on the recall-latency-memory surface for one shard. That choice is the unit the rest of the chapter distributes. The bytes-per-vector you fixed with PQ determines how many vectors a shard can hold, and therefore how many shards a billion-vector corpus needs; the milliseconds-per-query you fixed with nprobe or efSearch determines the per-shard latency that fan-out then has to fit inside the request budget. Section 25.5 takes these numbers and shards the index across machines (so each shard holds a slice of the corpus) and replicates it (so each slice survives a node failure and serves more queries per second), turning the single-shard index of this section into a distributed retrieval service. The embedding tables of Chapter 11 were sharded by the same logic; an ANN index is a read-optimized sibling of that sharded table.

Exercise 25.4.1: Reading the Trade-Off Surface Conceptual

Using only the numbers in Output 25.4.1, answer each part and justify it from the trade-off surface. (a) A service has a strict 2.5 ms per-query budget and needs recall@10 of at least 0.95; which IVF nprobe setting do you pick, and how much latency headroom is left? (b) A second service is memory-bound and can spend at most 16 bytes per vector; what recall does standalone PQ give it, and why is that an argument for IVF-PQ rather than raw PQ? (c) Explain in one sentence why raising nprobe past 16 in this run is pure waste.

Exercise 25.4.2: Add Residual Encoding to the PQ Coding

Extend Code 25.4.1 into a minimal IVF-PQ. After clustering with the existing IVF code, for each vector subtract its cell centroid to form a residual, and train and encode the PQ on residuals instead of raw vectors; at search time, probe nprobe cells and rank their members by PQ distance computed on residuals (remember to use each cell's own centroid). Sweep $m \in \{4, 8, 16\}$ at a fixed nprobe of 8 and report recall@10 and bytes per vector. Confirm that residual encoding raises recall at every code size compared with the standalone PQ column of Output 25.4.1, and explain in two sentences why residuals are easier to quantize than raw vectors.

Exercise 25.4.3: Sizing a Billion-Vector Shard Analysis

You must index $10^9$ vectors of dimension $d = 768$ stored as float32, on machines with 64 GB of usable RAM each. (a) Compute the raw memory for the full corpus and the number of machines needed if you store raw vectors. (b) Recompute both for an IVF-PQ index at 32 bytes per vector (ignore the small centroid and codebook overhead). (c) State how the per-shard query latency you measured for one shard relates to the end-to-end latency once the corpus is split across the shards from part (b) and queried in parallel, and identify which term, per-shard search or cross-shard fan-out, Section 25.5 must control. Express your memory answers with the recall@$k$ definition's notation in mind, treating each shard as holding $N/S$ vectors for $S$ shards.