Part I: Foundations of Distributed AI
Chapter 2: Distributed Systems Concepts for AI

Partitioning, Sharding, and Replication

"They split me into eight pieces so I would fit, then copied each piece three times so I would survive. I have never been so divided and so redundant at once."

A Shard That Believes It Is the Whole Model
Big Picture

There are only two fundamental ways to spread data and state across machines, and every distributed AI system is built from a mixture of the two. You either partition (cut one large thing into disjoint pieces so each machine holds a part) or you replicate (copy the same thing onto several machines so any of them can answer). Partitioning buys capacity and parallelism: a dataset, a parameter table, or a search index too big for one node fits when sliced across many. Replication buys availability and read throughput: a copy survives the loss of its original, and many copies answer many readers at once. The two pull in opposite directions, partitioning fragments state while replication duplicates it, and the tension between them, plus the consistency cost replication imposes, organizes the rest of this section. Every later mechanism in the book, from data loaders to sharded optimizers to vector indexes, is a specialization of these two moves.

The previous section established that a distributed system is a collection of machines coordinating through messages, and that no single machine holds the whole picture. This section asks the immediate next question: given that the state must live on many machines, how do we decide which byte goes where? Two answers cover the entire design space. We can divide the state into disjoint pieces and scatter the pieces, which we call partitioning or sharding, or we can copy the state and place the copies on multiple machines, which we call replication. Real systems do both at once, and the interesting engineering is in choosing the partition function, the replication factor, and the consistency rules that hold the copies together. We take the two moves in turn, then show how they compose, and close by tracing where each one resurfaces in the AI chapters ahead.

Partitioning (sharding) disjoint pieces, capacity and parallelism Full dataset: rows A through L Node 1 rows A-D Node 2 rows E-H Node 3 rows I-L each row lives on exactly one node; total capacity = sum of the nodes Replication identical copies, availability and read throughput Full dataset: rows A through L Node 1 rows A-L Node 2 rows A-L Node 3 rows A-L every row lives on all three nodes; survives node loss, serves three readers
Figure 2.3.1: The two fundamental moves. Partitioning (left) cuts one dataset into disjoint shards, so total capacity grows with the number of nodes but each row has a single home. Replication (right) copies the full dataset onto every node, so any node can answer and the loss of one node loses nothing, at the cost of storing the data several times. Real systems combine both: partition for capacity, then replicate each partition for safety.

1. Partitioning: Cutting One Dataset Into Disjoint Shards Beginner

Partitioning, also called sharding, splits a single logical dataset or structure into disjoint subsets and assigns each subset to a different node. The word shard pictures it well: one pane of glass broken into pieces that together make up the whole, with no piece overlapping another. A partition function maps each item to exactly one shard, so a key of interest has a single home and a reader knows where to look. The payoff is twofold. Capacity grows with the number of nodes, because the data that would not fit on one machine now occupies many. Throughput grows too, because independent shards can be scanned, updated, or trained on in parallel without coordinating. The cost is that any operation spanning many keys, a join, a global aggregation, a nearest-neighbor search, must now touch many shards and combine their answers, which is exactly the communication tax that Chapter 3 teaches you to price.

Two partition functions dominate practice, and they trade the same two properties against each other: how evenly load is spread, and whether items that are close in key space land near each other. Hash partitioning sends item with key $k$ to shard $\text{hash}(k) \bmod M$ for $M$ shards. A good hash scatters keys uniformly, so each shard receives roughly $1/M$ of the items regardless of how skewed the original keys were; this is the default when you want balanced load and only ever look keys up one at a time. Range partitioning instead assigns contiguous key intervals to shards: keys $[a, b)$ to shard 1, $[b, c)$ to shard 2, and so on. Range partitioning keeps neighboring keys together, which makes range scans and ordered iteration cheap, but it invites hot spots when traffic concentrates in one interval (think of timestamps, where today's range absorbs every new write). Table 2.3.1 lays the trade-off out directly.

Table 2.3.1: Hash versus range partitioning. The choice is governed by whether your workload is dominated by point lookups (favor hash) or range scans and ordered access (favor range), and by how much you fear load skew.
PropertyHash partitioningRange partitioning
Key placement$\text{hash}(k) \bmod M$, scatteredcontiguous intervals, ordered
Load balanceeven, even for skewed keysuneven if traffic concentrates in a range
Point lookupone shard, $O(1)$ routingone shard via interval search
Range scantouches all shardstouches only the spanning shards
Typical AI useshuffled training shards, embedding rowstime-ordered logs, sorted feature stores
Key Insight: The Partition Function Is a Design Decision, Not a Detail

Choosing hash versus range, and choosing how many shards, fixes the load balance and the cost of every multi-key operation for the life of the system. A hash partition spreads load evenly but makes ordered scans expensive; a range partition makes scans cheap but can pile all the new traffic onto one node. There is no neutral default: you are trading even load against locality, and you should pick the side your workload actually needs. Most failures attributed to "the database being slow at scale" are really a partition function matched to the wrong access pattern.

2. Consistent Hashing: Rebalancing When Nodes Join and Leave Intermediate

Plain hash partitioning has a hidden flaw that surfaces the moment the cluster changes size. The shard of key $k$ is $\text{hash}(k) \bmod M$, and that formula depends on $M$. Add one node so $M$ goes from 8 to 9, and almost every key now hashes to a different shard, because changing the modulus reshuffles the entire mapping. In a static cluster this never matters, but real clusters grow, shrink, and replace failed machines constantly, and re-homing nearly all the data every time a single node joins is ruinous: it saturates the network, evicts caches, and stalls the workload for the duration of the move. We want the opposite property, that adding the $(M{+}1)$-th node moves only about $1/(M{+}1)$ of the keys, the bare minimum needed to give the newcomer its fair share.

Consistent hashing delivers exactly that property. Picture the hash output space as a ring (the integers modulo $2^{32}$, wrapped end to end). Each node is hashed to one or more positions on the ring, and each key is hashed to a position too; a key belongs to the first node found by walking clockwise from the key's position. The crucial consequence is locality of change: when a node joins, it lands at some point on the ring and takes over only the keys in the arc between it and its clockwise neighbor. Every other key keeps its old home untouched. To keep load even and avoid a single node owning a long arc by bad luck, each physical node is placed at many positions, called virtual nodes; the more virtual nodes, the smoother the load distribution and the closer the moved fraction gets to the ideal $1/(M{+}1)$. The demonstration below builds both schemes on the same 100,000 keys, adds one node to an 8-node cluster, and counts how many keys change homes.

import hashlib, bisect

def h(s):                                     # stable hash, independent of run
    return int(hashlib.md5(str(s).encode()).hexdigest(), 16)

NUM_KEYS = 100_000
keys = [f"key-{i}" for i in range(NUM_KEYS)]

# Plain hash (modulo) partitioning: shard = hash(k) % M
def mod_assign(M):
    return {k: h(k) % M for k in keys}

before_mod = mod_assign(8)
after_mod  = mod_assign(9)                    # one node joins: M goes 8 -> 9
moved_mod  = sum(1 for k in keys if before_mod[k] != after_mod[k])

# Consistent hashing on a ring, with virtual nodes for smooth load
RING   = 1 << 32
VNODES = 200                                  # replicas per node smooth the ring

def build_ring(node_ids):
    ring = sorted((h(f"node-{n}-vn-{v}") % RING, n)
                  for n in node_ids for v in range(VNODES))
    return [p for p, _ in ring], [n for _, n in ring]

def ch_assign(node_ids):
    pos, owners = build_ring(node_ids)
    return {k: owners[bisect.bisect(pos, h(k) % RING) % len(pos)] for k in keys}

before_ch = ch_assign(range(8))
after_ch  = ch_assign(range(9))               # add node 8 to the ring
moved_ch  = sum(1 for k in keys if before_ch[k] != after_ch[k])

print(f"keys total                    : {NUM_KEYS}")
print(f"nodes before / after          : 8 -> 9 (one node joins)")
print(f"plain hash mod  keys moved     : {moved_mod:>7}  ({100*moved_mod/NUM_KEYS:5.1f}% of all keys)")
print(f"consistent hash keys moved     : {moved_ch:>7}  ({100*moved_ch/NUM_KEYS:5.1f}% of all keys)")
print(f"ideal minimum (1/9 of keys)    : {NUM_KEYS//9:>7}  ({100/9:5.1f}%)")
print(f"reduction factor               : {moved_mod/moved_ch:5.1f}x fewer keys moved")
Code 2.3.1: Plain modulo hashing versus consistent hashing under a single node addition. Both schemes assign the same 100,000 keys; the only difference is how the mapping reacts when the cluster grows from 8 nodes to 9.
keys total                    : 100000
nodes before / after          : 8 -> 9 (one node joins)
plain hash mod  keys moved     :   88863  ( 88.9% of all keys)
consistent hash keys moved     :   11280  ( 11.3% of all keys)
ideal minimum (1/9 of keys)    :   11111  ( 11.1%)
reduction factor               :   7.9x fewer keys moved
Output 2.3.1: Adding one node forces plain hashing to re-home 88.9% of all keys, but consistent hashing moves only 11.3%, a hair above the 11.1% ideal of $1/9$, for a 7.9x reduction. The virtual-node smoothing is what pins the consistent-hash figure so close to the theoretical floor.

The numbers tell the whole story. Plain modulo hashing scrambled nearly nine tenths of the keys to absorb one new machine, while consistent hashing moved only the keys the new machine had to take, a touch over one ninth. That gap is the difference between a cluster that can grow and shrink gracefully and one that seizes up every time its membership changes. This is why consistent hashing (introduced for distributed web caches and popularized by the Dynamo storage system) underlies the partitioning layer of countless distributed databases, caches, and, as we will see, the embedding and vector stores that AI systems lean on.

Fun Note: The 89% That Did Not Need to Move

In the plain-hash run, 88,863 keys changed homes to make room for a single new node. Almost none of them had to: only about 11,111 keys genuinely belonged to the newcomer. The other 77,000-odd moves were pure churn, data shuttled across the network for no reason other than that $\bmod 8$ and $\bmod 9$ disagree. Consistent hashing's entire contribution is refusing to do that pointless work.

3. Replication: Copying State for Availability and Read Throughput Beginner

Partitioning answers "how do we fit and parallelize?" but it does nothing for survival: if the single node holding shard 3 dies, shard 3 is gone, and any operation needing those keys fails. Replication answers the survival question by copying state onto several nodes. With a replication factor of three, every piece of data lives on three machines, so the system tolerates the loss of two of them without losing the data, and three readers can be served at once from the three copies. Replication is the source of the high availability that production systems advertise, and it is the reason a well-built service stays up while individual machines underneath it crash, reboot, and get replaced. The two motives reinforce each other: copies provide both fault tolerance (a copy survives the loss of the original) and read scaling (many copies share the read load).

The catch is that copies must agree. The instant you have more than one copy of a mutable value, a write has to reach all of them, and until it does, the copies disagree: a reader hitting a stale copy sees an old value. This is the consistency cost of replication, and it is not a bug to be fixed but a permanent tension to be managed. You can insist that every write reach every copy before it is acknowledged, which keeps copies identical but makes writes slow and blocks them entirely when a copy is unreachable. Or you can let writes return after reaching some copies and propagate to the rest in the background, which keeps writes fast and available but lets readers occasionally see stale data, a posture called eventual consistency. The synchronous-versus-asynchronous choice you meet here returns throughout the book, as synchronous versus asynchronous gradient updates in Chapter 10 and as bounded-staleness parameter reads in Chapter 11. The deeper impossibility result that forces this trade-off, that a replicated system cannot be simultaneously consistent and available when the network partitions, is the CAP theorem, formalized in Section 2.5.

Key Insight: Replication Trades Consistency for Availability, and You Choose Where on the Dial

A single copy is always consistent and never available under failure; many copies are highly available but consistent only as fast as writes can propagate. There is no setting that gives you instant writes, always-fresh reads, and survival of arbitrary failures all at once. Every replicated system, a database, a parameter server, a model registry, a vector index, picks a point on this dial, and naming that point (strong, bounded-staleness, or eventual consistency) is the first thing to ask about any replicated component you depend on.

4. Composing the Two: Partition for Capacity, Replicate for Safety Intermediate

Production systems never choose between partitioning and replication; they layer them. First partition the state so it fits and parallelizes, then replicate each partition so the loss of a node does not lose a shard. A dataset of one petabyte across 100 nodes might be cut into 100 shards of ten terabytes each, with each shard copied onto three nodes, giving 300 storage assignments that survive any two simultaneous failures while still letting 100 shards be read in parallel. The replication factor controls how many failures you tolerate; the shard count controls how much you parallelize; the partition function controls how evenly load lands. These three knobs, set independently, describe essentially every distributed storage and state layer you will meet, including the ones built specifically for AI.

Composition also inherits the costs of both moves. Partitioning's cost is that multi-shard operations must gather from many places; replication's cost is that every write must fan out to several copies and the copies can disagree. A system that both shards and replicates pays both taxes, which is why the cheapest distributed operation is always the one that touches a single shard's single up-to-date copy, and why so much distributed-systems engineering is about arranging for the common case to be exactly that. Keeping these costs explicit, rather than hidden behind a convenient abstraction, is the habit this chapter is trying to build, and Chapter 6 shows the first large payoff when MapReduce turns partitioned data into parallel computation.

5. Where Partitioning and Replication Return in AI Intermediate

The two moves are not background infrastructure that AI happens to sit on; they are load-bearing in the AI methods themselves, and the same vocabulary recurs at every layer of the stack. The clearest case is the training data. A web-scale corpus is partitioned into data shards so that many workers each stream a disjoint slice in parallel, which is precisely the data-parallel setup from Section 2.2 seen from the storage side; Chapter 8 builds the distributed loaders that feed those shards to GPUs without starving them, and a replication factor on the underlying object store keeps a shard available when a storage node fails.

Partitioning reappears, far less obviously, inside the model. When an embedding table or a parameter vector is too large for one accelerator, it is sharded across devices exactly like a dataset: Chapter 11 shards billion-row embedding tables across parameter servers using the hash partitioning of this section, and Chapter 16 shards the parameters, gradients, and optimizer state of a single giant model across GPUs in the ZeRO and FSDP schemes, where the "shard" is a slice of one weight tensor rather than a slice of data. At serving time the pattern returns once more: a vector index too large to search on one node is split into index shards that are searched in parallel and whose partial results are merged, the design Chapter 25 develops for distributed retrieval and approximate nearest-neighbor search. Table 2.3.2 collects these recurrences so you can see one idea wearing many costumes.

Table 2.3.2: One concept, many AI incarnations. Each row is a place later in the book where the partitioning or replication of this section does the heavy lifting under a different name.
AI incarnationWhat is partitioned or replicatedWhere developed
Data shardstraining corpus split across loaders; copies for durabilityChapter 8
Sharded embeddingsbillion-row embedding tables hash-partitioned across serversChapter 11
Sharded parametersone model's weights, gradients, optimizer state split across GPUsChapter 16
Vector index shardsnearest-neighbor index split, searched in parallel, results mergedChapter 25
Thesis Thread: Sharding Is the Same Cut, From Data to Model to Index

This section's partitioning move is one of the book's signature arcs. Introduced here as a concept, it is deepened into data shards and sharded embedding tables (Chapter 8, Chapter 11), then transformed into model shards and vector-index shards (Chapter 16, Chapter 25). When you reach those chapters, notice that the questions are always the ones raised here: what is the partition function, how even is the load, and how do we recombine the pieces? Recognizing the same cut under each new name is how a reader turns four hard chapters into one idea applied four times.

Library Shortcut: A Production Ring in a Few Lines

The 25-line consistent-hash ring in Code 2.3.1 is enough to teach the idea, but you would not hand-roll it in production, where you also need weighting, replication placement, and thread safety. The maintained uhashring library gives you the same ring, virtual nodes included, in a handful of lines, and the HashRing object handles node addition and removal with the minimal-movement guarantee built in:

from uhashring import HashRing                 # pip install uhashring

ring = HashRing(nodes=[f"node-{i}" for i in range(8)])   # 8 nodes, vnodes by default
home = ring.get_node("key-42")                # which node owns this key
ring.add_node("node-8")                        # join: only ~1/9 of keys re-home
ring.remove_node("node-3")                     # leave: only that node's keys move
Code 2.3.2: The same ring as Code 2.3.1 via uhashring. Roughly two dozen lines of ring construction and bisection collapse to a constructor call, and the library handles virtual-node weighting and the add and remove rebalancing that Output 2.3.1 measured by hand.
Practical Example: The Cache Cluster That Could Not Grow

Who: A platform engineer running the feature cache in front of a recommendation model.

Situation: A 16-node in-memory cache held precomputed user feature vectors, keyed by user id with plain hash(user_id) % 16 routing.

Problem: Black-Friday traffic needed the cache scaled to 24 nodes, but adding nodes flushed almost the entire cache, and the resulting flood of cold-cache misses hammered the backing store and tripled tail latency for an hour.

Dilemma: Scale the cache and eat a painful cache-cold window during the worst possible traffic, or freeze the cluster size and risk running out of memory as traffic climbed.

Decision: Neither: they replaced modulo routing with consistent hashing so the cluster could grow without invalidating itself.

How: They swapped the routing layer for a consistent-hash ring with 200 virtual nodes per machine (the uhashring approach of Code 2.3.2), then added the eight new nodes one at a time during a low-traffic window.

Result: Each node addition re-homed only about $1/N$ of the keys instead of nearly all of them, matching the 11% figure of Output 2.3.1; the hit rate barely dipped and the backing store never noticed the scale-up.

Lesson: If a cluster will ever change size, and almost all do, the partition function must move the minimum number of keys. Consistent hashing is not an optimization here; it is the difference between a cache that scales and one that self-destructs when you touch it.

6. The Frontier: Sharding the Largest Models and Indexes Advanced

Partitioning and replication are old ideas, but the scale of modern AI keeps forcing new variants of them, and the 2024 to 2026 literature is full of sharding strategies invented because the thing being split got too large for the previous one. On the training side, the FSDP and ZeRO family (Zhao et al., 2023, PyTorch FSDP; and the ongoing ZeR++ and per-parameter sharding work in DeepSpeed through 2024) shard not just data but the optimizer state, gradients, and parameters of a single model across hundreds of GPUs, then re-gather each shard just in time for the layer that needs it, treating a weight tensor exactly as this section treats a dataset. On the serving side, distributed vector search has become a research field of its own: systems in the lineage of DiskANN and the 2024 wave of disaggregated and GPU-accelerated vector databases (such as the partitioned-index designs surveyed for billion-scale retrieval-augmented generation) shard a billion-vector index across machines and tune the partition function to keep each query touching as few shards as possible, because every extra shard a query fans out to is added tail latency. The common research question across both is sharper than the classic one: not merely how to split state, but how to split it so that the recombination, the all-gather of weights or the merge of partial search results, stays cheap as the shard count climbs. We meet the training answer in Chapter 16 and the serving answer in Chapter 25.

Research Frontier: Partition Functions That Minimize Recombination (2024 to 2026)

The newest work treats the partition function as something to optimize for the cost of putting the pieces back together, not just for even load. In sharded training, per-parameter FSDP and ZeRO++ (DeepSpeed, 2023 to 2024) reduce the communication of re-gathering shards by quantizing and overlapping the all-gather, so the shard boundary is chosen to hide its own recombination cost. In billion-scale vector search, learned and clustering-based partitioners (the IVF and graph-partition designs behind 2024 disaggregated vector databases and RAG retrieval stacks) place vectors so that a query's nearest neighbors cluster onto few shards, cutting the fan-out that drives tail latency. Both lines share one premise with this section: the partition function is the most consequential decision in a sharded system, and the frontier is choosing it so the merge step, the all-gather or the result-union, never becomes the bottleneck. We build the tools to evaluate these claims in Chapter 3 and Chapter 4.

We now have the third concept of this chapter in hand: state lives on many machines either by partitioning (disjoint shards for capacity and parallelism, with consistent hashing to rebalance gracefully) or by replication (copies for availability and read throughput, at the cost of keeping them consistent), and real systems compose the two. What we have not yet confronted is what happens when the machines holding those shards and copies fail, which is not an exceptional event but the normal condition of a large cluster. Turning replication from a throughput trick into a survival strategy, and deciding what "still correct" means when copies disagree during a failure, is the subject of Section 2.4.

Exercise 2.3.1: Match the Partition Function to the Access Pattern Conceptual

For each store, decide whether hash or range partitioning fits better and justify the choice in terms of the trade-off in Table 2.3.1: (a) a feature store queried only by exact user id for point lookups; (b) a time-series log of training metrics almost always read as "the last hour"; (c) an embedding table whose rows are looked up individually by item id but whose write traffic is heavily skewed toward a few popular items. For (c), explain why hash partitioning helps the skew but a single very hot item can still overload its shard, and name one mitigation.

Exercise 2.3.2: Measure the Rebalancing Curve Coding

Extend Code 2.3.1 to grow the consistent-hash cluster one node at a time from 4 nodes up to 32, and at each step record the fraction of keys that move. Plot or tabulate the moved fraction against the cluster size and confirm it tracks the predicted $1/(M{+}1)$. Then rerun with VNODES set to 1, 10, and 200 and report how the variance of the per-node load (not just the moved fraction) shrinks as virtual nodes increase. Explain in one paragraph why too few virtual nodes can leave one physical node owning a disproportionate arc of the ring.

Exercise 2.3.3: Price the Composed System Analysis

A 500-terabyte embedding store is sharded into $S$ shards, each replicated $R$ times across nodes that hold 4 terabytes apiece. (a) Write the number of nodes needed as a function of $S$ and $R$ and the per-node shard size, and check it for $S = 250$, $R = 3$. (b) A point lookup reads one up-to-date replica of one shard; a write must reach all $R$ replicas. Express the read and write fan-out and state how each scales as you raise $R$ for more fault tolerance. (c) Argue from these expressions why raising $R$ improves availability but worsens write cost and consistency latency, connecting your answer to the consistency dial of Section 3. We make the cost side of this rigorous in Chapter 3.