"I kept the freshest copy of the weights in the world. Everyone read from me, everyone wrote to me, and not one of them agreed on what version they had just seen."
A Parameter Server Under Mild Staleness
Part III takes the training axis seriously for the first time: it shows how the act of fitting a model, computing gradients, updating parameters, and converging to a solution, is itself redistributed across machines when the optimization, the parameter state, or the data graph no longer fits on one node. Parts I and II gave you the vocabulary of distribution and the data pipelines that feed a learner; this part turns that machinery on the learner itself. The five chapters move from the mathematics of distributed optimization, through the parameter-server and embedding systems that hold enormous sparse state, across the classical and graph-structured models that predate deep learning yet still run most production pipelines, and finally to federated learning, where the data refuses to move and the training must come to it. Read together, they establish the central pattern that Part IV then scales to foundation models: split the optimization across workers, exchange just enough information to stay coordinated, and converge despite staleness, partition, and failure.
Part Overview
The training axis is where distribution stops being a data-movement problem and becomes a mathematical one. When a single worker computes a gradient, the answer is exact and the bookkeeping is trivial. The moment that gradient is split across many workers, three questions appear at once: how do the partial updates combine into a correct step, how stale is each worker allowed to be before convergence suffers, and where does the shared parameter state physically live when it is too large for any one machine to hold. Part III answers these questions in the order they bite, building from the optimization theory that guarantees correctness up to the systems that hold billion-entry embedding tables and the protocols that train across data that never leaves its owner. This is the part that connects the abstract identity of Chapter 1, that the gradient of an average loss decomposes exactly across workers, to the concrete machinery that makes that decomposition pay off at scale.
The progression is deliberate. Chapter 10 establishes the foundation by treating distributed optimization as the core problem: synchronous and asynchronous stochastic gradient descent, the staleness that asynchrony buys and what it costs, gradient compression, and the convergence guarantees that tell you when a distributed run is still solving the same problem as a single-node one. Chapter 11 takes the parameter-server architecture from Chapter 1's coordinator spectrum and makes it concrete for the workload that needs it most: the enormous sparse embedding tables of recommendation and ranking, where the model state, not the compute, is what forces distribution. Chapter 12 then steps away from gradient descent entirely to distribute the classical models that still dominate tabular and structured prediction: linear models, gradient-boosted trees, clustering, and the communication patterns each one demands.
The final two chapters generalize the setting along two different axes. Chapter 13 distributes learning over graph-structured data, where the partition cuts edges that the computation needs to follow, making the partitioning strategy itself a first-class design decision rather than an afterthought. Chapter 14 closes the part at the decentralized end of Chapter 1's architecture spectrum: federated and gossip-based learning, where the data is held by many parties that will not pool it, and training proceeds by exchanging models rather than examples, under constraints of privacy, heterogeneity, and unreliable participation. Across all five chapters the parameter-server and all-reduce contrast introduced in Part I remains the organizing tension, and the sharding ideas developed here for sparse state return, transformed, as the ZeRO and FSDP techniques of Part IV.
If you hold one idea across these five chapters, hold this: distributing training is the discipline of converging correctly while every worker sees a slightly different, slightly stale view of the shared parameters. Synchronous methods pay communication to keep that view consistent; asynchronous and federated methods relax consistency to buy throughput and tolerate stragglers, then earn convergence back through the math. Where the parameter state is too large to replicate, it is sharded, and the same sharding logic that holds a billion-row embedding table in Chapter 11 reappears, scaled and renamed, as the optimizer-state partitioning of Part IV. The architecture spectrum from centralized parameter server to fully decentralized gossip, drawn first in Chapter 1, is the map every chapter in this part places itself on.
Part Roadmap
- 10 Distributed Optimization Synchronous and asynchronous distributed SGD, the staleness asynchrony buys and what it costs, gradient compression, and the convergence guarantees that say when a distributed run still solves the single-node problem.
- 11 Parameter Servers and Distributed Embeddings The parameter-server architecture made concrete for enormous sparse embedding tables, where the model state rather than the compute is what forces distribution across many shards.
- 12 Distributed Classical Machine Learning Distributing the models that still run most production pipelines: linear models, gradient-boosted trees, and clustering, with the communication pattern each method demands.
- 13 Distributed Graph Machine Learning Learning over graph-structured data where the partition cuts edges the computation must follow, making graph partitioning a first-class design decision rather than an afterthought.
- 14 Federated and Decentralized Learning The decentralized end of the architecture spectrum: training across data that never leaves its owner, exchanging models rather than examples under privacy, heterogeneity, and unreliable participation.
Read the five chapters in order and the training axis becomes a single coherent design space. Chapter 10 supplies the optimization theory every later chapter assumes; Chapter 11 and Chapter 12 show where the shared state lives for sparse and classical models; and Chapters 13 and 14 stretch the same ideas to graphs and to data that will not move. The parameter-server sharding you build in this part is the direct ancestor of the deep-learning parallelism that follows, which is why Part IV reads as a continuation rather than a fresh start.
What's Next?
Part III distributed the training of classical and structured models, where the parameters number in the millions and the bottleneck is usually the sparse state or the data graph. Part IV: Parallel Deep Learning and Large Models pushes the same axis to its limit, models so large that no single accelerator can hold even one copy of the weights, the gradients, or the optimizer state. There the data parallelism of this part returns as one option among several, joined by model, pipeline, sharded, and expert parallelism, and the embedding-table sharding of Chapter 11 reappears as the ZeRO and FSDP partitioning that makes trillion-parameter training feasible. The optimization, staleness, and sharding intuitions you build here are exactly the tools Part IV scales up, so carry them forward.