Part V: Distributed Inference and Serving
Chapter 25: Distributed Retrieval and Vector Search

Index Sharding and Replication

"I hold one sixteenth of all human knowledge and not a vector more. When a query arrives, fifteen siblings and I each shout our best ten guesses at the coordinator, and we are only ever as fast as the slowest of us."

A Shard That Believes It Is the Whole Model
Big Picture

When a vector index outgrows one machine's memory, you split it into shards and search them all in parallel, then merge their partial answers; the index becomes a distributed system whose correctness is a merge and whose latency is a maximum. A billion 768-dimensional vectors in float32 occupy roughly three terabytes, far beyond any single server's memory, so the index must be partitioned across many machines. A query then scatters to every shard, each returns its local top-$k$, and a coordinator gathers and merges those into the global top-$k$. That pattern is exactly the broadcast-then-gather collective you met in Chapter 4, now carrying nearest-neighbor candidates instead of gradients. This section shows how shard count trades per-machine work against tail latency, how the choice of sharding strategy trades work against recall, and how replication buys throughput and availability on top of a sharded index.

The previous section built the per-machine search engine: HNSW, IVF, and PQ turn an exact nearest-neighbor scan into an approximate one that runs in milliseconds on a single node. That machinery assumes the whole index fits in one machine's memory. It usually does not. A production retrieval corpus for a large-scale RAG system or a recommendation candidate generator runs from hundreds of millions to tens of billions of vectors, and the index plus its graph or inverted-list overhead overflows even a high-memory server. The remedy is the same one the rest of this book applies to data, to parameters, and to models: partition the structure across machines and coordinate them. For a vector index, that partitioning is called sharding, and it is the core distributed-systems content of this chapter. Partitioning as a general technique was introduced in Section 2.3; here we specialize it to an index that every query must consult. Figure 25.5.1 draws the scatter-gather that results, and the sections below quantify each of its three knobs in turn.

Coordinator scatter query, gather top-k Shard 1 Shard 2 Shard 3 Shard 4 vectors 0..M vectors M..2M vectors 2M..3M vectors 3M..N replica A / B replica A / B replica A / B replica A / B Merge: global top-k k from each of 4 shards = 4k candidates keep the best k overall scatter (solid) gather (dashed)
Figure 25.5.1: Scatter-gather over a sharded vector index. The coordinator broadcasts the query to all four shards (solid arrows); each shard searches only its slice of the $N$ vectors and returns its local top-$k$ (dashed arrows); the coordinator merges the $4k$ candidates into the global top-$k$. Each shard sits behind replicas (green) that hold identical copies for throughput and availability. The pattern is the broadcast-and-gather collective of Section 4.7 carrying neighbor candidates.

1. Sharding: One Query, Every Shard Beginner

To shard an index is to partition its $N$ vectors into $S$ disjoint subsets, one per machine, so that each shard holds about $N/S$ vectors and fits in one machine's memory. The arithmetic that forces this is blunt: a billion 768-dimensional float32 vectors need about 3 TB for the raw vectors alone, more once an HNSW graph or IVF lists are added, while a large server holds a few hundred gigabytes. Splitting across sixteen or thirty-two shards brings each piece within reach. This is the same partitioning move that data shards (Section 2.3) and sharded parameters (Chapter 16) make; the twist for retrieval is that a nearest-neighbor query has no idea in advance which shard holds its answer.

Because the nearest neighbors of an arbitrary query vector can land in any shard, a correct search must consult every shard. The coordinator broadcasts the query to all $S$ shards, each runs its local approximate search and returns its own top-$k$ candidates, and the coordinator gathers the $S \times k$ candidates and keeps the best $k$ overall. This is scatter-gather, and it is the broadcast-then-gather collective from Section 4.7 with nearest-neighbor candidates as the payload. The merge is correct because nearest-neighbor ranking is decomposable: the global top-$k$ by similarity is always contained in the union of the per-shard top-$k$ sets, so merging local winners never misses a global winner that a shard actually found.

Key Insight: Sharding Makes Search a Scatter-Gather, So Its Latency Is a Maximum

A sharded query is not faster because the shards are smaller; it is parallel because the shards run at once. The coordinator cannot return until the last shard replies, so the query's latency is the maximum over shard latencies, not the average. Per-shard work shrinks as $1/S$, which is why sharding lets the index grow, but the end-to-end latency is governed by the slowest of $S$ independent searches. Adding shards to fit a bigger index therefore quietly raises the tail, and managing that tail is the central engineering tension of a sharded index.

2. Random Versus Clustered Sharding Intermediate

How you assign vectors to shards decides how much work each query does. The two ends of the spectrum trade balance against selectivity. Random or round-robin sharding hashes each vector to a shard independently of its content, giving every shard nearly equal size and an equal share of every query's neighbors. Load is balanced and recall is high, but the query must probe all $S$ shards, so total work does not fall as you add shards; only the per-shard slice shrinks. Clustered or semantic sharding instead groups similar vectors together, for example by the IVF coarse centroid a vector falls under, so that a query's likely neighbors concentrate in a few shards. A router then sends the query to only those shards, cutting total work sharply, at the risk of missing true neighbors that sit just across a shard boundary.

The boundary risk is real and quantifiable. If a query's true neighbors straddle the edge between two clusters and the router probes only the single nearest cluster's shard, the neighbors on the far side are simply never seen, and recall drops. The standard fix is to probe the few nearest shards rather than one, the same $n_{\text{probe}}$ dial that IVF exposes within a single machine (Section 25.4), now lifted to the cluster level. Probing more shards recovers recall and converges to the all-shards behavior of random sharding, at the cost of more work. Clustered sharding is therefore most attractive when the embedding space has strong cluster structure and the router can be trusted, and least attractive when queries are adversarial or the data is near-uniform.

Fun Note: The Shard That Cried "Found It"

Clustered sharding has a failure mode with a human flavor. Route every query to the one shard the router is most confident about, and that shard becomes a hotspot: it answers nearly every request while its fifteen siblings idle, like an over-eager intern who volunteers for everything and then drowns. Random sharding has the opposite personality, sixteen equally busy workers who never specialize. Most production systems sit between the two, clustering loosely and probing a handful of shards, so that no single shard believes it is the whole index.

3. The Tail-Latency Problem Intermediate

Because a scatter-gather query waits for its slowest shard, the number of shards directly controls the tail of the latency distribution. Suppose each shard's search time is a random variable with cumulative distribution $F$, independent across shards. The query latency is the maximum of $S$ such variables, and its distribution is

$$P(\text{latency} \le t) = F(t)^S, \qquad \text{so} \qquad P(\text{latency} > t) = 1 - F(t)^S.$$

The probability that some shard exceeds a threshold $t$ grows with $S$: even if each shard is slow only one time in a hundred, with $S = 100$ shards the chance that at least one is slow on a given query is $1 - 0.99^{100} \approx 0.63$. This is the tail-latency amplification that Section 5.3 introduced for distributed serving in general, and it is the same straggler effect that Section 2.7 analyzed for any fan-out computation. More shards make each search cheaper but make a slow query more likely, so the median latency falls while the 99th-percentile latency can rise. A retrieval service is judged on its tail, not its median, so the shard count is bounded from above by the latency budget, not only from below by the memory budget.

Mitigations are exactly the straggler remedies of Section 2.7, specialized to retrieval. Hedged requests send a duplicate query to a second replica of any shard that has not answered within a short delay and take whichever returns first, trimming the tail at the cost of extra load. A coordinator may also return early once it has heard from enough shards to be confident the global top-$k$ is settled, accepting a small recall loss for a bounded latency. Both are throughput-for-latency trades, and both depend on replication, which is the subject of Section 5.

4. A Sharded Index You Can Run Intermediate

The code below builds a 200,000-vector index, shards it, and runs scatter-gather queries, measuring three quantities at once: the per-query work (vectors searched), the recall against the exact top-$k$, and the tail latency (the maximum over shard latencies under a simple cost model). It first sweeps the shard count under random sharding, then switches to clustered sharding and sweeps how many shards each query probes, then reports how replicas multiply throughput. Everything is pure NumPy so the scatter-gather logic is visible rather than hidden inside a library.

import numpy as np

rng = np.random.default_rng(7)
N, d, k = 200_000, 64, 10            # vectors in the index, dimensions, neighbors wanted
data = rng.standard_normal((N, d)).astype(np.float32)
queries = rng.standard_normal((300, d)).astype(np.float32)

def topk(scores, k):                  # indices of the k largest scores, sorted high-first
    idx = np.argpartition(scores, -k)[-k:]
    return idx[np.argsort(scores[idx])[::-1]]

# Exact top-k over the WHOLE index, the answer a sharded search is graded against.
truth = [set(topk(data @ q, k).tolist()) for q in queries]

# Per-shard latency model: fixed scatter overhead + per-vector work x straggler skew.
def shard_latency(n, skew):
    return 0.05 + n * 2.0e-6 * skew    # milliseconds (illustrative)

def build(S, strategy):
    if strategy == "random":           # round-robin-like: balanced, every shard probed
        assign = rng.integers(0, S, size=N)
    else:                              # clustered: group by sign bits (semantic-ish)
        bits = int(np.log2(S))
        code = np.zeros(N, dtype=np.int64)
        for b in range(bits):
            code = code * 2 + (data[:, b] > 0)
        assign = code % S
    return [np.where(assign == s)[0] for s in range(S)]

def scatter_gather(shards, q, targets):
    cand_idx, cand_score, lat = [], [], []
    for s in targets:                  # scatter: each shard returns its LOCAL top-k
        m = shards[s]
        if m.size == 0:
            continue
        sc = data[m] @ q
        loc = topk(sc, min(k, m.size))
        cand_idx.append(m[loc]); cand_score.append(sc[loc])
        lat.append(shard_latency(m.size, 1.0 + 0.6 * rng.random()))
    ci = np.concatenate(cand_idx); cs = np.concatenate(cand_score)
    merged = ci[topk(cs, k)]           # gather: coordinator merges into global top-k
    work = sum(shards[s].size for s in targets)
    return set(merged.tolist()), work, max(lat), sum(lat)   # tail = slowest shard

print(f"{'shards':>6} {'strategy':>10} {'work/q':>8} {'recall@10':>10} {'tail ms':>8} {'sum ms':>8}")
for S in [1, 4, 16, 64]:
    sh = build(S, "random")
    rec = wk = tail = tot = 0.0
    for q, gt in zip(queries, truth):
        got, work, t, sm = scatter_gather(sh, q, range(S))
        rec += len(got & gt) / k; wk += work; tail += t; tot += sm
    n = len(queries)
    print(f"{S:>6} {'random':>10} {wk/n:>8.0f} {rec/n:>10.3f} {tail/n:>8.3f} {tot/n:>8.2f}")

S = 16; shc = build(S, "clustered"); bits = int(np.log2(S))
def route(q, n_probe):                 # route to the matching shard, plus its neighbors
    base = 0
    for b in range(bits):
        base = base * 2 + (1 if q[b] > 0 else 0)
    return [(base + j) % S for j in range(n_probe)]
for n_probe in [1, 2, 4, 16]:
    rec = wk = 0.0
    for q, gt in zip(queries, truth):
        got, work, t, _ = scatter_gather(shc, q, route(q, n_probe))
        rec += len(got & gt) / k; wk += work
    n = len(queries)
    print(f"clustered S=16, probe {n_probe:>2}: work/q={wk/n:>6.0f}  recall@10={rec/n:.3f}")

for R in [1, 2, 4]:                    # replicas share load: throughput scales, tail unchanged
    qps = R * 1000.0 / shard_latency(N // 16, 1.3)
    print(f"replicas R={R}: serving throughput ~= {qps:>8.0f} queries/sec (16 shards)")
Code 25.5.1: A sharded vector index with explicit scatter-gather. The scatter_gather function searches each target shard locally, merges the per-shard top-$k$ into a global top-$k$, and reports tail latency as the maximum over shard latencies. The three sweeps isolate shard count, sharding strategy, and replication.
shards   strategy   work/q  recall@10  tail ms   sum ms
     1     random   200000      1.000    0.571     0.57
     4     random   200000      1.000    0.197     0.72
    16     random   200000      1.000    0.089     1.32
    64     random   200000      1.000    0.060     3.72

clustered S=16, probe  1: work/q= 12490  recall@10=0.167
clustered S=16, probe  2: work/q= 24999  recall@10=0.238
clustered S=16, probe  4: work/q= 49985  recall@10=0.339
clustered S=16, probe 16: work/q=200000  recall@10=1.000

replicas R=1: serving throughput ~=    12121 queries/sec (16 shards)
replicas R=2: serving throughput ~=    24242 queries/sec (16 shards)
replicas R=4: serving throughput ~=    48485 queries/sec (16 shards)
Output 25.5.1: Random sharding holds recall at 1.000 because every shard is probed, and the tail latency per shard falls from 0.571 ms at $S=1$ to 0.060 ms at $S=64$ as each shard shrinks; the rising sum ms column shows total work climbing with fixed per-shard overheads. Clustered sharding cuts work from 200,000 to 12,490 vectors per query when probing one shard, but recall collapses to 0.167 because neighbors across shard boundaries are missed; probing more shards trades work back for recall until it matches random sharding at probe 16. Replicas scale throughput linearly while leaving per-query latency untouched.

The output makes the three trades concrete. Shard count lowers the per-shard slice and so the per-shard tail, but the climbing sum ms warns that more shards mean more fixed overheads and, in a real network, more chances for a straggler. Clustered sharding's first row is the headline: a 16-fold reduction in work for a steep recall cost, recovered only by probing more shards. And replication is the clean win, multiplying throughput with no latency penalty. The catch that the cost model cannot show is the one Section 5 addresses next: keeping replicas consistent as the index is updated.

Thesis Thread: Scatter-Gather and Tail Latency Return, Carrying Vectors

The all-reduce of Section 1.1 was the book's first collective; scatter-gather is its retrieval cousin, broadcasting a query and gathering candidates instead of summing gradients. The tail-latency amplification you just measured is the same maximum-over-workers cost that bounds every fan-out computation in Section 2.7 and that Section 5.3 taught you to measure. A sharded vector index is a clean, self-contained instance of the book's recurring lesson: partition for capacity, pay in communication and tail latency, and replicate to recover throughput. The same three-way balance reappears for sharded recommendation embeddings in Chapter 38.

5. Replication: Throughput, Availability, and Consistency Intermediate

Sharding lets the index grow but does nothing for query throughput or availability: each shard is a single point of failure, and a shard saturates once its query rate exceeds what one machine can serve. Replication answers both. Each shard is copied to $R$ machines that hold identical vectors, and the coordinator routes each query to one replica of every shard, balancing load across replicas. Throughput then scales with $R$, as Output 25.5.1 shows, because $R$ copies of a shard serve $R$ times the queries, while per-query latency is unchanged because a query still touches exactly one replica of each shard. Availability improves for the same reason: if one replica of a shard fails, the coordinator routes to another, and the index stays fully searchable as long as one replica of every shard survives.

Replication's hard part is not reads but writes. A vector index is rarely static; documents are added, deleted, and re-embedded, and every update must reach all $R$ replicas of the affected shard. The system therefore faces the consistency choice of Section 2.3. Strong consistency makes an update visible on all replicas before acknowledging it, so every replica answers identically, at the cost of slower writes and reduced availability during a partition. Eventual consistency lets replicas apply updates asynchronously, so a freshly inserted vector may appear on one replica before another, which means two identical queries routed to different replicas can briefly return slightly different neighbors. Most vector databases choose eventual consistency for the index and accept this transient divergence, because retrieval is approximate already and a few seconds of staleness rarely changes which documents a RAG pipeline retrieves. Where freshness is critical, a common pattern keeps a small, strongly consistent buffer of very recent vectors that every query also scans, merging its results with the large eventually-consistent main index.

Practical Example: Sharding a RAG Index That Outgrew Its Box

Who: A platform engineer running the retrieval tier for an enterprise RAG assistant over internal documents.

Situation: The corpus grew from 40 million to 600 million chunks as more departments onboarded, and the single-node vector index began swapping to disk, pushing p99 retrieval latency past two seconds.

Problem: The 600-million-vector HNSW index needed roughly 1.8 TB of memory, well beyond the 512 GB on the serving box, so it could not stay on one machine.

Dilemma: Shard into many small pieces to fit memory comfortably and minimize per-shard search time, which raises the tail because the query waits for the slowest of many shards, or use a few large shards that keep the fan-out small but leave each shard's search slower and its memory tight.

Decision: They chose 12 shards with random assignment for balanced load and added 2 replicas per shard, sizing the shard count from the p99 latency budget rather than packing each machine to its memory limit.

How: They deployed the index in Milvus, which handles shard placement and the scatter-gather merge internally, set the replica count to 2, and enabled hedged requests so a slow shard's query was retried against its second replica after a short delay.

Result: p99 retrieval latency fell from 2.1 s to 95 ms, the index served three times the previous query rate from the added replicas, and a single node failure no longer took retrieval down because every shard had a live replica.

Lesson: Let the latency budget pick the shard count and the throughput target pick the replica count; do not pack machines to their memory ceiling and then discover the tail.

Library Shortcut: Milvus and Vespa Shard and Replicate for You

Code 25.5.1 spells out scatter-gather, shard assignment, and the merge in about forty lines. Production vector databases reduce that to a few configuration values and run the scatter-gather, the global-top-$k$ merge, replica routing, and update propagation internally. In Milvus, sharding and replication are collection-level settings:

from pymilvus import Collection

# Milvus partitions the collection into shards at creation and serves queries
# with an internal scatter-gather; replicas are loaded for throughput + HA.
col = Collection("docs", shards_num=12)        # 12 index shards (set at create time)
col.load(replica_number=2)                     # 2 replicas per shard, query-load balanced

# A single search call fans out to all shards, merges the per-shard top-k, and
# returns the global top-k. The scatter-gather of Code 25.5.1 is one line here.
hits = col.search(query_vectors, "embedding",
                  param={"metric_type": "IP", "params": {"nprobe": 16}},
                  limit=10)
Code 25.5.2: The same sharded, replicated search as Output 25.5.1, expressed in Milvus. Roughly forty lines of explicit scatter-gather collapse to shards_num, replica_number, and one search call; the engine owns shard placement, the global-top-$k$ merge, replica routing, and update propagation. Vespa exposes the same controls through its distribution and content-group settings, and both let you tune shard count against the latency budget without touching application code.

6. Balancing Throughput, Latency, Recall, and Cost Advanced

A sharded, replicated index is governed by four quantities that cannot all be maximized at once. Shard count $S$ sets memory headroom and per-shard search time but raises the tail; replica count $R$ sets throughput and availability but multiplies hardware cost; the sharding strategy and probe count set the work-recall trade; and the total machine count $S \times R$ sets the bill. A useful way to hold these together is to read each design choice as moving along one axis while pinning the others, exactly as Output 25.5.1 isolated shard count, then strategy, then replicas. Table 25.5.1 summarizes which knob moves which quantity, and in which direction the cost lands.

Table 25.5.1: The four knobs of a sharded, replicated vector index and what each one trades. No single setting optimizes all four quantities, so the design is chosen against an explicit latency and cost budget.
KnobRaising it improvesRaising it costsBounded by
Shard count $S$Memory headroom, per-shard search timeHigher tail latency, more fixed overheadThe p99 latency budget
Replica count $R$Throughput, availabilityHardware cost ($S \times R$ machines)The query-rate target and budget
Probe count (clustered)RecallMore work per query, higher latencyThe recall target
Clustering strengthSelectivity (less work per query)Recall risk at shard boundariesEmbedding cluster structure

The discipline this section teaches is to fix the budgets first and let them choose the knobs, not the reverse. The latency budget caps $S$, the throughput target sets $R$, the recall target sets the probe count, and the remaining freedom is spent minimizing $S \times R$. That is the same budget-first reasoning the chapter applies to retrieval as a whole, and it carries directly into the distributed hybrid search of the next section, where a second index (sparse lexical) is sharded and queried alongside the dense one and the two result sets must be merged.

Research Frontier: Disaggregated and Out-of-Core Vector Indexes (2024 to 2026)

The memory cost that forces sharding is itself under attack. Disk-based and out-of-core graph indexes in the DiskANN lineage keep most of the index on NVMe SSD and only a compressed summary in memory, so a single machine can host billions of vectors and the shard count needed for a given corpus drops sharply; recent work pushes these designs toward streaming updates and filtered search. A parallel line disaggregates storage from compute, as in systems such as Milvus 2.x and the research around serverless vector search, so that shards, replicas, and query nodes scale independently and the index can be re-sharded without re-ingesting data. On the algorithm side, learned routing and partition assignment (for example SOAR and related learned-partition methods, 2023 to 2025) aim to make clustered sharding hit higher recall at lower probe counts by partitioning vectors so that each query's neighbors concentrate in fewer shards. The common thread is loosening the memory-per-machine constraint and the work-recall trade that this section framed, so that future sharded indexes scatter to fewer shards and pay a smaller tail for the same capacity. We revisit web-scale instances of these systems in Chapter 36.

Exercise 25.5.1: Tail Latency From Shard Count Conceptual

Assume each shard answers within its latency target with probability $p = 0.99$ on any given query, independently across shards. Using $P(\text{all fast}) = p^{S}$, compute the probability that a query is fast (no slow shard) for $S = 1, 4, 16, 64, 256$. At which shard count does more than half of all queries hit at least one slow shard? Explain in one sentence why this makes the p99 latency budget, not the memory budget, the binding constraint on shard count for a large index, and connect your answer to the maximum-over-workers cost of Section 2.7.

Exercise 25.5.2: Measure the Boundary Recall Loss Coding

Extend Code 25.5.1 so that clustered sharding uses true $k$-means cluster assignment instead of the sign-bit code: fit $S$ centroids on the data with any simple $k$-means, assign each vector to its nearest centroid's shard, and route each query to the shards of its n_probe nearest centroids. Sweep n_probe from 1 to $S$ and plot recall against work per query. Compare the recall-versus-work curve to the sign-bit version in Output 25.5.1 and explain why a content-aware partition reaches a given recall at lower work than the crude sign-bit partition.

Exercise 25.5.3: Size a Sharded, Replicated Index Analysis

You must serve a 2-billion-vector index of 1024-dimensional float32 vectors. Each machine has 256 GB of usable memory, an HNSW index adds 40 percent overhead over the raw vectors, you must answer 8,000 queries per second, one replica of a shard sustains 1,500 queries per second, and you require at least 2 replicas of every shard for availability. Compute the minimum shard count from memory, the minimum replica count from the throughput target, the total machine count $S \times R$, and the resulting fan-out per query. Then argue, using the tail-latency model of Section 3, whether you would deploy exactly that shard count or fewer larger shards, and what you would measure first to decide.