Part III: Distributed Machine Learning
Chapter 11: Parameter Servers and Distributed Embeddings

Parameter Servers vs All-Reduce in Modern Systems

"They told me I lost. Then they handed me every embedding table in the building and asked me to hold still."

A Parameter Server in Honorable Semi-Retirement
Big Picture

The parameter server and all-reduce are not winner and loser; they are two answers to the same combine step, each optimal for a different shape of gradient. All-reduce won the dense deep-learning workload because it is symmetric, has no central bottleneck, and is bandwidth-optimal when every worker has a gradient for every parameter. The parameter server kept the workloads where its asymmetry is an advantage: terabyte embedding tables where each step touches a thousandth of a percent of the parameters, asynchronous and heterogeneous workers, and reinforcement learning where one learner serves weights to a swarm of actors. The dominant industrial design does not choose; it runs data-parallel all-reduce for the dense network and a sharded parameter server for the embeddings, in one job. This section makes the trade-off quantitative, then closes Chapter 11.

Across this chapter we built the parameter server from its motivation (Section 11.1) through push-pull (Section 11.2), sharding (Section 11.3), synchronous and asynchronous updates (Section 11.4), bounded staleness (Section 11.5), the sparse embedding tables that are its signature workload (Sections 11.6 and 11.7), and fault tolerance (Section 11.8). In parallel, Chapter 4 gave us all-reduce as a symmetric collective with no central node, and Section 10.10 closed the optimization side of the story. Two complete architectures now sit on the table for the same job: take the partial gradients computed on $K$ workers and turn them into one consistent parameter update. The question this section answers is not "which is better" but "which is better for which gradient", and the answer is sharp enough to predict the design of almost every large training system in production.

1. The Two Architectures, Side by Side Beginner

A parameter server is asymmetric: stateless workers hold no authoritative copy of the parameters; they pull the current values from server shards, compute gradients, and push updates back. The servers own the state and apply the updates. All-reduce is symmetric: every worker holds a full replica of the parameters, computes a partial gradient, and the collective sums those partials so that all workers end the step holding the identical averaged gradient and the identical updated weights. No node is special, and no node owns the truth more than any other.

That structural difference decides everything downstream. The parameter server has a natural home for state that is too large to replicate, because the state lives in exactly one place (sharded across servers) rather than once per worker. All-reduce has no central link to saturate and no server to fall behind, because the work and the bandwidth are spread evenly across the ring. Figure 11.9.1 lays the decision out as a flowchart: the shape of your gradient, dense or sparse, and the rhythm of your workers, synchronous or not, route you to one architecture or to the hybrid that uses both.

What does one step touch? dense gradient, synchronous sparse, async, or huge embeddings All-reduce symmetric, no central link, bandwidth-optimal for dense Parameter server touches only used keys, holds state too big to replicate A model with BOTH a dense net and sparse embeddings? the standard recsys case Hybrid: all-reduce the dense network, shard the embeddings on a parameter server
Figure 11.9.1: The decision in one picture. A dense, synchronous gradient routes to all-reduce (green); a sparse, asynchronous, or oversized-embedding workload routes to a parameter server (orange); a model that has both a dense network and sparse embedding tables, the standard recommendation case, routes to the hybrid (purple) that runs both at once. The runnable demo below quantifies the two top routes.

2. Why All-Reduce Won the Dense Workload Intermediate

For a dense network, every worker computes a gradient for every parameter on every step. Stack the per-worker gradients into one vector of length $P$ and the combine step is exactly the all-reduce of Chapter 4: sum $K$ vectors of length $P$ and give every worker the sum. A ring all-reduce moves about $\frac{2(K-1)}{K}\,P$ elements across each worker's single link, and crucially that per-link cost is flat in $K$: adding workers does not make any one link busier. A centralized parameter server, by contrast, funnels every worker's pull and push through the server, so the server link carries traffic proportional to $2KP$, and it becomes the bottleneck the moment $K$ grows. Sharding the server across $S$ machines divides that by $S$, but the asymmetry remains: you are buying bandwidth back rather than getting it for free from the topology.

This is the heart of why the field of data-parallel deep learning standardized on all-reduce. The gradients are dense, the updates must be exact and synchronous to preserve the convergence guarantees, and the bandwidth-optimal collective from Section 4.4 moves the fewest bytes with no hotspot. The same primitive returns one layer down as reduce-scatter and all-gather inside sharded training (Chapter 16), so investing in fast collectives pays off across the whole parallel-training stack.

Key Insight: The Gradient's Shape Picks the Architecture

All-reduce is optimal when the gradient is dense: every worker has a value for every parameter, so summing fixed-length vectors with no central node is exactly the right move. A parameter server is optimal when the gradient is sparse: a step touches a tiny, data-dependent subset of an enormous table, so moving only the touched rows beats moving the whole table by orders of magnitude. The same job can have both kinds of gradient, which is why the same job often runs both architectures.

3. Where the Parameter Server Stays Decisive Intermediate

Three properties keep the parameter server in production long after all-reduce won the dense case. The first is sparsity at scale. A recommendation model's embedding table can hold $10^8$ to $10^{10}$ rows, terabytes of parameters, yet one minibatch looks up only a few thousand of them. All-reduce has no notion of "which rows are nonzero"; it would move the entire table every step. The parameter server pulls and pushes only the touched rows, which is the model-parallel embedding design of Sections 11.6 and 11.7. The second is asynchrony and heterogeneity: workers that run at different speeds, or that join and leave, do not stall each other under the push-pull model with bounded staleness (Section 11.5), whereas a synchronous all-reduce runs at the speed of its slowest member. The third is the learner-and-actors pattern of reinforcement learning, where one learner owns the weights and many actors pull the latest policy to generate experience; that is a parameter server in all but name, and Chapter 20 builds its infrastructure on exactly this asymmetry.

The runnable demo below makes the two extremes concrete in pure Python. It computes the communication of a dense step and a sparse step under both architectures and reports who moves fewer bytes. The point is not the exact constants but the order of magnitude, which is decisive in both directions.

P = 50_000_000          # dense parameters (a ranking MLP)
K = 8                   # data-parallel workers
bytes_per_param = 4     # fp32

# Ring all-reduce: each worker's single link carries ~2*(K-1)/K * P, flat in K.
allreduce_total   = 2 * (K - 1)     * P * bytes_per_param   # summed over the ring
allreduce_busiest = 2 * (K - 1) / K * P * bytes_per_param   # the busiest link
# Centralized parameter server: every worker pulls P and pushes P through the
# server, so the server link carries 2*K*P and grows with K.
ps_dense_total = 2 * K * P * bytes_per_param
ps_busiest     = 2 * K * P * bytes_per_param

print("=== DENSE model (every step touches all", f"{P:,}", "params) ===")
print(f"all-reduce busiest link/step  : {allreduce_busiest/1e9:8.2f} GB (flat in K)")
print(f"param-server busiest link/step: {ps_busiest/1e9:8.2f} GB (grows with K)")
print(f"all-reduce's busiest link is {ps_busiest/allreduce_busiest:.1f}x lighter -> all-reduce wins")
print()

# Huge embedding table: each step looks up a tiny, data-dependent set of rows.
V = 100_000_000         # rows in the table (id space)
emb_dim = 128
table_params  = V * emb_dim
ids_per_step  = 4096    # unique ids one worker's minibatch touches

# Parameter server moves ONLY the touched rows (pull + push), per worker.
ps_sparse_total = 2 * K * ids_per_step * emb_dim * bytes_per_param
# A dense all-reduce has to move the WHOLE table, because it cannot skip zeros.
allreduce_sparse_total = 2 * (K - 1) * table_params * bytes_per_param

print("=== SPARSE embedding (table has", f"{V:,}", "rows,", emb_dim, "dim) ===")
print(f"rows touched/step : {K*ids_per_step:,} of {V:,}",
      f"({(K*ids_per_step)/V*100:.4f}%)")
print(f"param-server bytes/step : {ps_sparse_total/1e6:10.3f} MB")
print(f"dense all-reduce bytes  : {allreduce_sparse_total/1e9:10.3f} GB")
print(f"param-server moves {allreduce_sparse_total/ps_sparse_total:,.0f}x LESS -> PS wins")
Code 11.9.1: A pure-Python communication accounting of one training step under each architecture, for a dense model and for a terabyte-class embedding table. No frameworks; the arithmetic is the argument.
=== DENSE model (every step touches all 50,000,000 params) ===
all-reduce busiest link/step  :     0.35 GB (flat in K)
param-server busiest link/step:     3.20 GB (grows with K)
all-reduce's busiest link is 9.1x lighter -> all-reduce wins

=== SPARSE embedding (table has 100,000,000 rows, 128 dim) ===
rows touched/step : 32,768 of 100,000,000 (0.0328%)
param-server bytes/step :     33.554 MB
dense all-reduce bytes  :    716.800 GB
param-server moves 21,362x LESS -> PS wins
Output 11.9.1: The verdict made concrete. On the dense model all-reduce's busiest link is 9.1x lighter than the parameter server's congested server link; on the sparse embedding table the parameter server moves over four orders of magnitude fewer bytes because it touches only the 0.0328% of rows the minibatch used. Each architecture wins exactly where its structure fits the gradient.
Fun Note: The Table That Refused to Be Reduced

716 gigabytes per step to all-reduce an embedding table whose minibatch touched 33 megabytes of it is the data-movement equivalent of mailing an entire library to return one overdue paperback. The parameter server's whole personality is "send me the paperback".

4. The Hybrid: How Real Recommendation Systems Are Built Advanced

A production recommendation model is two models stapled together. The bottom is a set of enormous embedding tables, one per categorical feature (user id, item id, context), holding the bulk of the parameters but touched sparsely. The top is a dense network (a multilayer perceptron or a small transformer) that consumes the looked-up embeddings and is touched densely on every step. Output 11.9.1 already told us the verdict for each half, so the industrial design writes itself: shard the embedding tables across a parameter server (or, equivalently, across the model-parallel embedding shards of Section 11.7), and wrap the dense top in data-parallel all-reduce. The forward pass pulls the relevant embedding rows from the shards, runs the dense net replicated on every worker, and the backward pass pushes sparse embedding gradients to the shards while all-reducing the dense gradients across workers. This split is the backbone of systems such as Meta's DLRM and the recommendation stacks behind every large feed, and the case study in Chapter 38 builds one end to end.

Practical Example: Splitting a Recommender Across Two Architectures

Who: A machine learning platform engineer at a video-streaming company maintaining the watch-next ranking model.

Situation: The model had a 1.8-terabyte embedding table over user and video ids plus a modest 60-million-parameter dense ranking head, and a single all-reduce job was thrashing the network.

Problem: Replicating the full model for data-parallel all-reduce was impossible: 1.8 TB does not fit on a worker, and all-reducing the embedding gradients moved the whole table every step.

Dilemma: Put everything on a parameter server, which would bottleneck the dense gradients through the server links, or all-reduce everything, which could not hold or move the embeddings at all.

Decision: They split by gradient shape, sharding the embeddings across a parameter server and wrapping only the dense head in all-reduce, exactly the routing in Figure 11.9.1.

How: Embedding lookups pulled the touched rows from 16 server shards; the dense head ran replicated under DistributedDataParallel; the backward pass pushed sparse embedding gradients and all-reduced the dense ones in the same step.

Result: Per-step network traffic fell by more than 95%, training throughput roughly tripled, and the dense head still converged with exact synchronous updates because all-reduce preserved them.

Lesson: Do not pick one architecture for a model that has two kinds of gradient. Route the dense part to all-reduce and the sparse part to a parameter server, and let each do what it is optimal at.

Library Shortcut: TorchRec Wires the Hybrid for You

Building the hybrid by hand means writing sharded embedding lookups, an all-to-all to route ids to the right shard, and a separate all-reduce path for the dense head, easily several hundred lines. TorchRec (PyTorch's recommendation library) expresses the whole split declaratively: you describe the embedding tables and a sharding plan, and it generates the model-parallel embedding path while leaving the dense modules under ordinary data-parallel all-reduce.

# pip install torchrec
from torchrec import EmbeddingBagCollection
from torchrec.distributed import DistributedModelParallel

# Sparse half: huge tables, model-parallel sharded (the "parameter server" role).
ebc = EmbeddingBagCollection(tables=embedding_table_configs, device="meta")
model = DLRM(embedding_bag_collection=ebc, dense_arch=mlp_config)

# One wrapper shards embeddings across ranks AND keeps the dense MLP data-parallel.
model = DistributedModelParallel(module=model, device=torch.device("cuda"))
# forward/backward now do sparse all-to-all for embeddings + all-reduce for dense.
Code 11.9.2: The hybrid of Section 4 in a few lines. DistributedModelParallel handles the sparse embedding sharding and the dense all-reduce together, so the engineer specifies the plan instead of coding the two communication paths by hand.

5. Mapping the Choice onto the Optimization Trade-offs Intermediate

The architecture choice is not only a communication-volume question; it inherits the optimization trade-offs of Chapter 10. All-reduce gives you exact synchronous SGD: every step uses the true averaged gradient, convergence matches the single-machine analysis, and the cost is that the step runs at the speed of the slowest worker, so stragglers tax you directly. The parameter server's natural mode is asynchronous, which removes the straggler tax but introduces gradient staleness, the very effect that bounded-staleness schedules (Section 11.5) and the stale-gradient analysis of Chapter 10 exist to control. So the all-reduce-versus-parameter-server decision is the synchronous-versus-asynchronous decision of Chapter 10 wearing different clothes: pick all-reduce when you want exactness and your workers are homogeneous, pick the parameter server when heterogeneity or sparsity makes synchronous updates too expensive and you can tolerate a bounded amount of staleness. The hybrid even splits the optimization regime: the dense head trains synchronously while the embeddings tolerate mild asynchrony, because a stale row that few minibatches touch does little harm.

Research Frontier: Where the Line Is Moving (2024 to 2026)

The boundary between the two architectures is being actively redrawn. On the recommendation side, systems in the lineage of NVIDIA's HugeCTR and Meta's TorchRec push sharded embeddings onto GPU memory with hierarchical and frequency-aware caching, so the hottest rows behave almost like dense parameters while the long tail stays on a parameter server. On the training side, communication-avoiding methods such as the DiLoCo and OpenDiLoCo line (Douillard et al., 2024) revive local-update, infrequent-synchronization training that looks like a coarse-grained parameter server over the wide-area network, blurring the once-sharp all-reduce-versus-PS distinction. And distributed reinforcement learning frameworks continue to refine the learner-and-actors split that is a parameter server by another name, with weight-broadcast schemes that borrow ring-collective ideas to serve policies to thousands of actors. The verdict of this chapter is stable; the exact crossover point is research in motion.

6. Chapter 11 in One View Beginner

Chapter 11 told a single arc. The parameter server is the architecture for state that is too large to replicate and gradients that are sparse or asynchronous. We motivated it (11.1), gave it a push-pull interface (11.2), sharded it for capacity (11.3), and chose its consistency model along the synchronous-to-asynchronous axis (11.4) with bounded staleness as the principled middle (11.5). Its defining workload is the distributed embedding table (11.6) at terabyte scale (11.7), made reliable by replication and checkpointing (11.8). This final section placed it against all-reduce and reached a clear-eyed verdict: all-reduce owns the dense, synchronous, homogeneous case, the parameter server owns the sparse, asynchronous, oversized-embedding case, and real recommendation systems run both at once. Table 11.9.1 is the chapter's decision distilled to one grid.

Table 11.9.1: The Chapter 11 verdict. Match the workload's gradient shape and worker rhythm to the architecture; recommendation models hit the last row and use both.
Workload signatureArchitectureWhy
Dense gradient, synchronous, homogeneous workersAll-reduceNo central link, bandwidth-optimal, exact updates
Terabyte embedding table, tiny per-step footprintParameter serverMoves only touched rows; state too big to replicate
Heterogeneous or elastic workers, staleness toleratedParameter server (async)No straggler stall under bounded staleness
One learner serving weights to many actors (RL)Parameter server patternAsymmetric weight serving, see Chapter 20
Dense network plus sparse embeddings (recsys)HybridAll-reduce the dense head, shard the embeddings
Thesis Thread: Sharding the State Returns, Scaled Out

The parameter-server idea of sharding parameters across machines that workers pull from and push to does not retire when all-reduce wins the dense case; it is reborn one level up. In Chapter 16, ZeRO and FSDP shard the optimizer state and parameters across the data-parallel workers themselves, then all-gather them just in time and reduce-scatter the gradients, fusing the parameter server's "state lives in one place" with all-reduce's "no central node". Whenever a later chapter shards model state, recall this chapter: you are watching the parameter server's core insight, scaled out into the collective.

Key Takeaway

All-reduce and the parameter server solve the same combine step for different gradients. All-reduce won dense deep learning because it is symmetric, bottleneck-free, exact, and bandwidth-optimal when every worker has a gradient for every parameter. The parameter server endures where its asymmetry pays: terabyte-scale sparse embeddings that touch a thousandth of a percent of the table per step, asynchronous and heterogeneous workers, and the learner-and-actors structure of distributed RL. The standard industrial recommendation system refuses to choose, running data-parallel all-reduce on the dense network and a sharded parameter server on the embeddings in one job. The choice is the synchronous-versus-asynchronous trade-off of Chapter 10 in architectural form, and the right answer is dictated by the shape of your gradient, not by which architecture is fashionable.

Exercise 11.9.1: Read the Crossover Off the Numbers Analysis

Using the accounting in Code 11.9.1, derive the fraction of embedding rows a step must touch before a dense all-reduce of the table moves fewer bytes than a parameter server moving only the touched rows. Express the crossover as a function of $K$, the table size $V$, and the rows-per-step, then plug in the demo's numbers and state in one sentence why real recommendation workloads sit so far on the parameter-server side of that line.

Exercise 11.9.2: Which Architecture, and Why Conceptual

For each system, name the architecture from Table 11.9.1 and justify it in two sentences using the gradient shape and worker rhythm: (a) pretraining a 7-billion-parameter dense transformer on a homogeneous GPU pod; (b) a click-through-rate model with a 4-terabyte user-and-ad embedding table and a small dense head; (c) an asynchronous distributed reinforcement-learning job with 2,000 actors and one learner; (d) an elastic training job on spot instances that join and leave every few minutes. For at least one, explain why the wrong choice would either fail to fit or stall on stragglers.

Exercise 11.9.3: Build the Sparse-Gradient Combine Coding

Extend Code 11.9.1 into a tiny working simulation. Represent an embedding table as a NumPy array of shape $(V, d)$ with small $V$, generate $K$ workers' minibatches as random id sets, and implement two combine functions: a parameter-server style update that gathers only the touched rows, averages their gradients, and writes them back, and a dense all-reduce that sums full-table gradient arrays. Verify both produce the identical updated table (the result must match), then print the bytes each moved and confirm the parameter server's count scales with the touched ids while all-reduce's scales with $V$.

Project Ideas

1. Build a sharded parameter server from scratch. Implement a multi-process parameter server with $S$ server shards and $K$ workers communicating over sockets or multiprocessing queues. Support push-pull, hash-partition the keys across shards, and add a bounded-staleness barrier (Section 11.5). Measure per-step latency as you vary $K$ and $S$, and reproduce the server-link bottleneck of Output 11.9.1 by watching one shard's traffic grow with $K$.

2. Parameter server versus all-reduce on a sparse problem. Train the same factorization or shallow recommendation model two ways: once with a synchronous all-reduce over full-table gradients, once with a parameter server that moves only touched rows. Plot wall-clock time and bytes moved per step against the embedding table size, and locate empirically the crossover you derived in Exercise 11.9.1. Report where each architecture wins and why.

3. Implement the recsys hybrid. Build a minimal DLRM-style model with a sharded embedding bottom and a data-parallel dense top. Route sparse embedding gradients to the shards and all-reduce the dense gradients, then compare against an all-reduce-everything baseline and a parameter-server-everything baseline on throughput and convergence, demonstrating the hybrid dominates both, as in the practical example above.