"They promised me I would hold the whole model. They did not mention that ten thousand workers would all want their slice of me at once, every millisecond, forever."
A Parameter Server Under Mild Staleness
A parameter server stops being a server the moment its single machine runs out of memory to hold the parameters or bandwidth to serve them; the fix is to split the parameters across many server nodes by key, so that the workers' traffic fans out and the system's capacity grows with the number of servers instead of being capped by one. In the previous section we built the push-pull contract between workers and one logical server. Now we make that server real at scale. One box cannot hold a terabyte of embeddings, and even if it could, every worker pulling and pushing through one network card would queue behind every other worker. Sharding the key space across $S$ servers turns a central bottleneck into aggregate bandwidth, the same partitioning idea introduced in Section 2.3. The catch, which this section makes concrete with a runnable measurement, is that sharding only balances the load when the keys are accessed evenly; one popular item embedding can pin an entire shard.
The push-pull architecture of Section 11.2 treated the server as a single logical store of parameters: workers pull the current weights, compute gradients, and push them back. That logical picture is correct and we keep it. What we did not say is how many physical machines sit behind that logical store. The simplest answer, one machine, is where every parameter-server story begins and where it almost immediately ends, because a single server hits two ceilings at once. It can hold only as many parameters as fit in its memory, and it can serve only as many bytes per second as its network interface delivers. A recommendation model with billions of embedding rows blows past the first ceiling; a training job with thousands of workers all pulling each step blows past the second. This section is about removing both ceilings by the same move the rest of the book keeps making: partition the work across more machines.
1. Why One Central Server Cannot Hold Beginner
Consider what a single parameter server must do each training step. Every one of $W$ workers pulls a copy of the parameters it needs and later pushes a gradient of the same shape. If the model has $P$ parameters of 4 bytes each, and for simplicity every worker touches the whole model, the server must move on the order of $2 \cdot W \cdot P \cdot 4$ bytes per step, out for the pull and back in for the push. The two quantities that determine whether this is survivable are the server's memory, which must hold $4P$ bytes of parameters plus optimizer state, and the server's network bandwidth $B$, which sets a floor of $2 W P \cdot 4 / B$ seconds on every step no matter how fast the workers compute.
Both terms grow in directions you do not control. Memory grows with the model: modern recommendation systems carry embedding tables far larger than any single machine's RAM, a problem we confront directly in Section 11.6. Bandwidth pressure grows with the worker count: doubling the workers to train faster also doubles the bytes the one server must shovel, so past a point adding workers makes each step slower, not faster. A single central server is therefore not a small version of a distributed parameter server; it is a different thing that stops working at exactly the scale where you needed it.
One server offers a fixed memory budget and a fixed bandwidth $B$, but the load on it grows with both the model size $P$ and the worker count $W$. The per-step service time $2 W P \cdot 4 / B$ rises without bound while $B$ stays put, so a central server is a bottleneck by construction, not by misconfiguration. Sharding is the structural fix: spread the parameters over $S$ servers and the aggregate bandwidth becomes $S \cdot B$, a ceiling that you raise by buying more machines rather than a faster card.
2. Sharding the Key Space Across Servers Beginner
The parameters of a model are addressable by key: weight matrix $W_3$ row $4{,}812$, or embedding row for item $98{,}214$, or simply a flat index into a parameter vector. Because parameters are keyed, we can partition them exactly the way Section 2.3 partitions any keyed dataset: assign each key to one of $S$ servers by a deterministic function of the key. The cleanest such function hashes the key and takes the result modulo $S$, so that server $s$ owns the keys with $\text{hash}(k) \bmod S = s$. Each server holds roughly $1/S$ of the parameters and answers only for its own keys.
A worker that needs keys spread across the model no longer talks to one machine. It groups the keys it wants by their owning shard, sends one pull request to each relevant server, and waits for the parts to come back; a push fans out the same way, routing each gradient to the server that owns its key. The logical push-pull contract is unchanged, but underneath it the traffic is now spread over $S$ network cards instead of funneled through one. If $W$ workers each touch the whole model uniformly, every server sees about $1/S$ of the total bytes, and the per-step service floor falls from $2 W P \cdot 4 / B$ to $2 W P \cdot 4 / (S B)$. That is the whole payoff: the central bottleneck becomes aggregate bandwidth that scales with the server count.
The hash-modulo rule of Figure 11.3.1 is the simplest placement, and it has one well-known weakness: if you change $S$ (add or remove a server), almost every key's $\text{hash}(k) \bmod S$ changes, so the whole table reshuffles. Production systems avoid that with consistent hashing, which maps keys and servers onto the same ring so that adding a server moves only about $1/S$ of the keys instead of nearly all of them. We use plain modulo in this section's demo because the load-balancing behavior we want to expose is identical and the code stays readable; the consistent-hashing refinement is a placement detail layered on top of the same idea, covered as a partitioning technique in Section 2.3.
3. The Hot Key: Skew Comes Back Intermediate
Sharding balances the load only under an assumption that real workloads violate: that keys are accessed uniformly. They are not. In a recommendation model the embedding for a blockbuster movie or a viral product is pulled and updated far more often than the embedding for an obscure one. Whichever shard happens to own that hot key inherits a flood of traffic that the hash function spread nowhere, because all those requests carry the same key and the same key maps to one shard. This is precisely the data-skew problem that haunted the Spark shuffle in Section 7.7: a uniform partitioning function gives a uniform load only on a uniform key distribution, and a skewed distribution concentrates the work on whichever partition holds the heavy keys.
The next program makes the effect measurable. It shards a key space of 100,000 keys across $S = 8$ servers with the same hash-modulo rule, then issues two million access operations under two distributions: uniform, where every key is equally likely, and skewed, where 60 percent of accesses hit a single hot key and the remaining 40 percent spread evenly. It counts how many accesses land on each shard. Nothing about the sharding changes between the two runs; only the access pattern does.
import hashlib, random
S = 8 # number of server shards
KEYS = 100_000 # distinct parameter keys (e.g. embedding rows)
ACCESSES = 2_000_000 # total push/pull operations issued by the workers
def shard_of(key):
# consistent placement: hash the key, take it modulo the shard count
h = hashlib.blake2b(str(key).encode(), digest_size=8).digest()
return int.from_bytes(h, "big") % S
random.seed(0)
# ---- Uniform access: each pull/push touches a key drawn evenly ----
uniform = [0] * S
for _ in range(ACCESSES):
k = random.randrange(KEYS) # every key equally likely
uniform[shard_of(k)] += 1
# ---- Skewed access: one hot key (a popular item embedding) dominates ----
# 60% of accesses hit a single hot key; the rest spread uniformly.
HOT_KEY = 42
skewed = [0] * S
for _ in range(ACCESSES):
k = HOT_KEY if random.random() < 0.60 else random.randrange(KEYS)
skewed[shard_of(k)] += 1
def report(name, load):
total = sum(load)
mean = total / S
hi = max(load)
bars = " ".join(f"{c/total:5.1%}" for c in load)
print(name)
print(f" per-shard share : {bars}")
print(f" busiest shard : {hi:,} accesses ({hi/total:.1%} of all)")
print(f" imbalance (max/mean): {hi/mean:.2f}x")
print()
print(f"servers S = {S}, keys = {KEYS:,}, accesses = {ACCESSES:,}")
print(f"hot key {HOT_KEY} lives on shard {shard_of(HOT_KEY)}")
print()
report("UNIFORM access", uniform)
report("SKEWED access (60% on one hot key)", skewed)
servers S = 8, keys = 100,000, accesses = 2,000,000
hot key 42 lives on shard 2
UNIFORM access
per-shard share : 12.5% 12.6% 12.5% 12.4% 12.4% 12.5% 12.5% 12.5%
busiest shard : 252,527 accesses (12.6% of all)
imbalance (max/mean): 1.01x
SKEWED access (60% on one hot key)
per-shard share : 5.0% 5.0% 65.0% 5.0% 5.0% 5.0% 5.0% 5.0%
busiest shard : 1,299,769 accesses (65.0% of all)
imbalance (max/mean): 5.20x
The uniform run confirms the promise of Section 2: eight shards, eight near-equal slices, an imbalance of 1.01x that would let the system serve roughly eight times the traffic of one central server. The skewed run confirms the threat: shard 2, which happens to own key 42, carries 65 percent of all accesses and runs 5.20 times hotter than the average shard. The aggregate bandwidth $S B$ is real, but a single worker's step is gated by the slowest shard it must wait for, and that shard is the hot one. The fix is not more shards (the hot key would still land on exactly one of them); it is to break the hot key itself, by replicating its parameter across several servers for reads or splitting its updates, which is the subject of the next two callouts and of replication below.
Add a wildly popular new item to a recommendation catalog and you can watch one parameter shard light up like a switchboard during a telethon. Engineers have a name for the keys that do this, "celebrity keys" or "hot keys", and a folklore remedy that sounds like a contradiction: store the celebrity in several places at once so no single shard has to take all its calls. The hot key does not care how cleverly you hashed it; fame routes to a single address unless you deliberately give it more.
4. Replication: Availability and Read Throughput Intermediate
Sharding solves capacity but introduces a fragility: each parameter now lives on exactly one server, so if that server crashes, the slice of the model it held is gone and training stalls. It also means a hot key's reads all hit one machine. Replication addresses both. Keep $r$ copies of each shard on $r$ different servers; a write must update the copies (or a designated primary that propagates the change), but a read can be served by any replica. For availability this means losing one server costs you nothing, because a replica still holds the shard, the fault-tolerance property we develop fully in Section 11.8. For read throughput it means a hot key's pull traffic can be spread across $r$ replicas instead of crushing one, turning the 5.20x imbalance of Output 11.3.1 into something closer to $5.20 / r$ for reads.
Replication is not free, and the tension is the familiar one from Section 2.3: more replicas give more read throughput and more resilience but cost more memory ($r$ copies of the parameters) and make writes more expensive, because a gradient push must reach every replica to keep them consistent, or accept that replicas drift slightly between syncs. Parameter servers exploit a property that ordinary databases cannot: machine learning training already tolerates a little staleness, so replicas need not be perfectly synchronized after every push. That tolerance is what makes aggressive replication and asynchronous updates practical here, and it is the bridge to Section 11.4, which asks how synchronized the updates across servers and workers really need to be.
Who: A machine learning platform engineer at a video-streaming service running the recommendation model.
Situation: The item-embedding table reached 400 GB and no longer fit on the single parameter-server box that had served it since launch.
Problem: Training workers stalled waiting on that one server, and a planned catalog expansion would push the table past 700 GB, well beyond any single machine.
Dilemma: Rent one enormous high-memory instance, simple but capped and costly, or shard the embedding table across many ordinary servers, cheaper and scalable but requiring keyed routing, replication, and a hot-key story.
Decision: They sharded the table across sixteen servers by hashing the item id, because memory was the binding ceiling and only sharding raises it by adding machines rather than buying a bigger one.
How: Workers grouped each batch's item ids by shard and issued one pull per relevant shard; the three hottest items were replicated across four servers each so their read traffic spread, exactly the remedy Output 11.3.1 motivates.
Result: The table now fits with headroom, per-step wait time fell because traffic fanned out across sixteen network cards, and the previously saturated shards dropped to roughly even load once the celebrity items were replicated.
Lesson: Shard for capacity, then replicate the few hot keys for balance and availability. Sharding alone raises the ceiling; only attacking the hot key flattens the load beneath it.
5. The Li et al. Design and Aggregate Bandwidth Advanced
The architecture this section describes is the one Mu Li and colleagues formalized in their 2014 parameter-server design, the reference point for nearly every keyed parameter store since. Their system splits parameters across a group of server nodes, each owning a range of the key space, with a server manager that tracks which node holds which keys and replicates ranges for fault tolerance. Workers hold a shard of the training data and a routing layer that turns a set of keys into per-server requests, exactly the fan-out of Figure 11.3.1. Two design choices make it scale: parameters and gradients are communicated as sparse key-value pairs, so a worker moves only the keys it actually touched rather than the whole model, and updates are aggregated on the server side, so a server applies the combined effect of many workers' pushes to its slice.
The reason the design scales is the arithmetic of Section 2 read in reverse. A central server offers bandwidth $B$; the sharded server offers $S B$, because $S$ independent machines each serve their slice in parallel over their own network card. As long as the access pattern is reasonably uniform, throughput rises almost linearly with $S$, which is why these systems were demonstrated on hundreds of server nodes serving models with billions of parameters. The qualifier "reasonably uniform" is the entire content of Section 3: the aggregate bandwidth is a ceiling the workload reaches only when no single shard is hot. This same sharded-store-with-aggregation pattern returns, transformed, when we shard a single model's weights and optimizer state across devices in Chapter 16; the parameter server shards a key-value store across machines, ZeRO and FSDP shard one contiguous weight tensor across accelerators, but the move (split by key or index, fan out the traffic, aggregate the result) is the same.
The partitioning of a keyed dataset across machines, introduced abstractly in Section 2.3, is here applied to the parameters themselves: the model becomes a partitioned key-value store. The same idea, with the same skew caveat, will return as sharded model state in Chapter 16 and as sharded vector indexes in Chapter 25. Each time you meet a system that "scales to billions" of something, look for the key it shards on and ask what happens when one key gets hot; the answer is almost always the lesson of Output 11.3.1.
The hot-key problem of Output 11.3.1 is an active engineering frontier precisely because recommendation embedding tables keep growing into the trillions of parameters. Recent industrial systems attack skew with hardware-software co-design: NVIDIA's HugeCTR and its HierarchicalKV backend (2024) cache the hottest embedding rows in GPU high-bandwidth memory while the cold tail lives in host memory or on remote shards, so a celebrity key is served from the fastest tier rather than crushing one cold shard. Meta's work on terabyte-scale embeddings reports frequency-aware sharding that places hot rows on dedicated, replicated shards and even quantizes or dimension-reduces the long cold tail to fit the memory budget. Parallel lines on "in-network aggregation" push gradient combining into programmable switches to relieve the busiest server's inbound bandwidth. The common thread for 2024 to 2026 is that uniform hash sharding is treated as a baseline to be corrected with access-frequency information, never as the final word; we apply the same frequency-aware thinking to embedding tables directly in Section 11.6 and to distributed recommendation in Chapter 38.
Code 11.3.1 routed keys to shards by hand to expose the load, but you do not implement a sharded parameter server from scratch in production. A Ray actor is a stateful process pinned to a node; declaring $S$ of them gives you $S$ server shards, and the routing collapses to choosing an actor by the key's hash:
# pip install ray
import ray
ray.init()
@ray.remote
class ParamShard: # one actor == one server shard
def __init__(self): self.params = {}
def pull(self, keys): return {k: self.params.get(k, 0.0) for k in keys}
def push(self, grads): # server-side aggregation
for k, g in grads.items(): self.params[k] = self.params.get(k, 0.0) - 0.05 * g
S = 8
shards = [ParamShard.remote() for _ in range(S)] # S processes, S nodes
def shard_for(key): return shards[hash(key) % S] # route by key
ray.get(shard_for(42).push.remote({42: 1.3})) # push lands on one shard
pull and push, and restart of a failed actor, leaving only the key-to-shard rule for you to choose.6. What Sharding Bought, and What It Did Not Intermediate
Sharding the parameter server converts a fixed-capacity central machine into a system whose memory and bandwidth grow with the server count $S$. That is a real and necessary win: it is the only way a 400 GB or 7 TB embedding table runs at all, and it is the only way thousands of workers pull each step without queuing behind one network card. Replication layers availability and read scaling on top, at the cost of memory and write coordination, with machine learning's tolerance for staleness softening the coordination cost in a way ordinary storage systems cannot enjoy.
What sharding did not buy is freedom from skew. The aggregate bandwidth $S B$ is a ceiling reached only when the load spreads evenly, and real keyed workloads, recommendation embeddings most of all, are sharply skewed. The hot key is the same adversary that unbalanced the MapReduce and Spark shuffles, met again at the parameter layer, and it is handled by the same family of remedies: replicate the heavy key, split its updates, or route its traffic specially. With the physical layout of the server settled (how many machines, partitioned by key, replicated for safety), the next question is one of timing: when many workers and many server shards all push and pull at once, how tightly must their updates be synchronized? That is where Section 11.4 begins.
In Output 11.3.1, shard 2 carries 65 percent of the traffic because it owns the single hot key. Suppose you double the server count from 8 to 16 to relieve it. Explain precisely why the hot shard's share of traffic does not fall in proportion, and what the imbalance (max/mean) would do instead. Then name two mechanisms from Section 4 that do reduce the hot shard's load, and state which one helps reads, which helps writes, and which helps availability.
Extend Code 11.3.1. Detect the hottest key (the one with the most accesses) and replicate it across $r = 4$ shards, routing each read of that key to one of its $r$ replicas chosen at random while keeping all other keys on their single shard. Re-run the skewed workload and report the new per-shard load and imbalance. By how much did replicating one key reduce the busiest shard's share, and what is the residual imbalance once the hot key is spread? Discuss why this helps reads but would complicate writes.
A model has $P = 2 \times 10^9$ parameters of 4 bytes each, $W = 500$ workers each pulling and pushing the full model every step, and each server has network bandwidth $B = 12.5$ gigabytes per second. Using the per-step service floor $2 W P \cdot 4 / (S B)$, compute the minimum seconds per step for a single central server ($S = 1$) and for a sharded server with $S = 50$, assuming uniform access. Then argue, from Output 11.3.1, how a single hot key carrying 65 percent of traffic changes the effective per-step floor for the sharded case, and why that brings you back to the remedies of Section 4. We make this bandwidth accounting rigorous in Chapter 10.