Part III: Distributed Machine Learning

Part III: Distributed Machine Learning

Distributing the training axis for classical and structured machine learning: optimization, parameter servers, classical models, graphs, and federation.

"I kept the freshest copy of the weights in the world. Everyone read from me, everyone wrote to me, and not one of them agreed on what version they had just seen."

A Parameter Server Under Mild Staleness
Big Picture

Part III takes the training axis seriously for the first time: it shows how the act of fitting a model, computing gradients, updating parameters, and converging to a solution, is itself redistributed across machines when the optimization, the parameter state, or the data graph no longer fits on one node. Parts I and II gave you the vocabulary of distribution and the data pipelines that feed a learner; this part turns that machinery on the learner itself. The five chapters move from the mathematics of distributed optimization, through the parameter-server and embedding systems that hold enormous sparse state, across the classical and graph-structured models that predate deep learning yet still run most production pipelines, and finally to federated learning, where the data refuses to move and the training must come to it. Read together, they establish the central pattern that Part IV then scales to foundation models: split the optimization across workers, exchange just enough information to stay coordinated, and converge despite staleness, partition, and failure.

Part Overview

The training axis is where distribution stops being a data-movement problem and becomes a mathematical one. When a single worker computes a gradient, the answer is exact and the bookkeeping is trivial. The moment that gradient is split across many workers, three questions appear at once: how do the partial updates combine into a correct step, how stale is each worker allowed to be before convergence suffers, and where does the shared parameter state physically live when it is too large for any one machine to hold. Part III answers these questions in the order they bite, building from the optimization theory that guarantees correctness up to the systems that hold billion-entry embedding tables and the protocols that train across data that never leaves its owner. This is the part that connects the abstract identity of Chapter 1, that the gradient of an average loss decomposes exactly across workers, to the concrete machinery that makes that decomposition pay off at scale.

The progression is deliberate. Chapter 10 establishes the foundation by treating distributed optimization as the core problem: synchronous and asynchronous stochastic gradient descent, the staleness that asynchrony buys and what it costs, gradient compression, and the convergence guarantees that tell you when a distributed run is still solving the same problem as a single-node one. Chapter 11 takes the parameter-server architecture from Chapter 1's coordinator spectrum and makes it concrete for the workload that needs it most: the enormous sparse embedding tables of recommendation and ranking, where the model state, not the compute, is what forces distribution. Chapter 12 then steps away from gradient descent entirely to distribute the classical models that still dominate tabular and structured prediction: linear models, gradient-boosted trees, clustering, and the communication patterns each one demands.

The final two chapters generalize the setting along two different axes. Chapter 13 distributes learning over graph-structured data, where the partition cuts edges that the computation needs to follow, making the partitioning strategy itself a first-class design decision rather than an afterthought. Chapter 14 closes the part at the decentralized end of Chapter 1's architecture spectrum: federated and gossip-based learning, where the data is held by many parties that will not pool it, and training proceeds by exchanging models rather than examples, under constraints of privacy, heterogeneity, and unreliable participation. Across all five chapters the parameter-server and all-reduce contrast introduced in Part I remains the organizing tension, and the sharding ideas developed here for sparse state return, transformed, as the ZeRO and FSDP techniques of Part IV.

The Thread to Carry Through Part III

If you hold one idea across these five chapters, hold this: distributing training is the discipline of converging correctly while every worker sees a slightly different, slightly stale view of the shared parameters. Synchronous methods pay communication to keep that view consistent; asynchronous and federated methods relax consistency to buy throughput and tolerate stragglers, then earn convergence back through the math. Where the parameter state is too large to replicate, it is sharded, and the same sharding logic that holds a billion-row embedding table in Chapter 11 reappears, scaled and renamed, as the optimizer-state partitioning of Part IV. The architecture spectrum from centralized parameter server to fully decentralized gossip, drawn first in Chapter 1, is the map every chapter in this part places itself on.

Part Roadmap

Read the five chapters in order and the training axis becomes a single coherent design space. Chapter 10 supplies the optimization theory every later chapter assumes; Chapter 11 and Chapter 12 show where the shared state lives for sparse and classical models; and Chapters 13 and 14 stretch the same ideas to graphs and to data that will not move. The parameter-server sharding you build in this part is the direct ancestor of the deep-learning parallelism that follows, which is why Part IV reads as a continuation rather than a fresh start.

What's Next?

Part III distributed the training of classical and structured models, where the parameters number in the millions and the bottleneck is usually the sparse state or the data graph. Part IV: Parallel Deep Learning and Large Models pushes the same axis to its limit, models so large that no single accelerator can hold even one copy of the weights, the gradients, or the optimizer state. There the data parallelism of this part returns as one option among several, joined by model, pipeline, sharded, and expert parallelism, and the embedding-table sharding of Chapter 11 reappears as the ZeRO and FSDP partitioning that makes trillion-parameter training feasible. The optimization, staleness, and sharding intuitions you build here are exactly the tools Part IV scales up, so carry them forward.