"I sent my gradient three milliseconds ago. The coordinator has not acknowledged it, two of my peers have crashed, and somebody just told me the parameters I read are already stale. This is, apparently, the normal case."
A Worker That Lost Its Coordinator
Once an AI workload spans more than one machine, correctness stops being a property of your code alone and becomes a property of how machines communicate, agree, fail, and recover. Chapter 1 named the six axes along which AI distributes; this chapter supplies the systems machinery each axis depends on. You will meet the small set of ideas that every later part reuses: the roles a process can play, the ways processes exchange and synchronize state, how data is split and copied for capacity and safety, what happens when a machine dies mid-computation, how fresh the state each worker reads really is, when a cluster must reach agreement and when it can avoid it, why the slowest worker sets the pace, why moving computation to data beats moving data to computation, and the recurring patterns that assemble these primitives into working systems. Each concept arrives through the AI operation that needs it, so by the end you can look at distributed SGD, a sharded model, or an elastic training job and name exactly which systems property it is leaning on.
Chapter Overview
Distributed systems is an old and deep field, and a graduate text could spend a year on consensus alone. This chapter does something narrower and more useful for an AI audience: it isolates the handful of concepts that distributed AI actually exercises and presents each one through a training or serving operation you will meet again. The framing throughout is the tension Chapter 1 introduced. Every machine you add buys capacity, and every machine you add must communicate with the others and can fail independently of them. The concepts here are the named tools for buying that capacity while keeping the two taxes, communication and failure, under control.
The chapter moves from the actors to the hazards to the patterns. Section 2.1 defines the cast: processes, nodes, workers, and coordinators, the roles every distributed AI job assigns. Section 2.2 shows how those actors exchange state through messages, collectives, and synchronization barriers, the substrate distributed SGD runs on. Section 2.3 splits and copies state through partitioning, sharding, and replication, the moves behind data parallelism, sharded optimizers, and replicated serving. Section 2.4 confronts the defining fact of large clusters, that machines fail during the job, and builds checkpointing and recovery as the answer that elastic training later automates.
The second half makes correctness and speed precise. Section 2.5 defines consistency models and walks the line from tolerable parameter staleness to the CAP trade-off that serving systems negotiate. Section 2.6 separates the control plane, where schedulers and registries do need consensus, from the training data plane, which deliberately uses collectives to avoid it. Section 2.7 explains why the slowest worker, the straggler, sets the throughput of a synchronous job, a thread that returns in distributed optimization and at serving scale. Section 2.8 argues that moving computation to data is often cheaper than the reverse, the locality principle every data and serving chapter inherits. Section 2.9 assembles all of it into the named patterns, master-worker, parameter server, all-reduce, sharded-replicated, and pipeline, that the rest of the book instantiates.
A word on scope. This chapter borrows freely from classical distributed systems, but it scopes every borrowed idea to what AI actually uses. Synchronous training leans hard on collectives and barely touches general consensus; the control plane is the opposite. Where AI uses less of a primitive than a database or an HPC code would, the text says so plainly. The goal is not to teach distributed systems in full; it is to give you exactly the vocabulary the parallel-training and serving chapters assume on every page.
Prerequisites
This chapter assumes you have read Chapter 1: What Is Scale-Out AI? and carry its vocabulary with you: the six axes of distribution, the scale-out versus scale-up distinction, the coordinator-to-no-coordinator architecture spectrum, and the four operational metrics of throughput, latency, cost, and reliability. The systems concepts here are the machinery those axes run on, so the axes are the map and this chapter is the toolbox. No prior distributed-systems coursework is needed; every primitive is built from first principles as the AI operation that uses it introduces it. The Python, machine-learning, probability, and linear-algebra background the book takes for granted, refreshed in Appendix A: Mathematical Background, remains sufficient.
Learning Objectives
- Assign the roles of process, node, worker, and coordinator to the components of a distributed training or serving job, and explain what each one owns.
- Describe how distributed AI actors exchange state through message passing, collective operations, and synchronization barriers, and where each appears in distributed SGD.
- Distinguish partitioning, sharding, and replication, and map each onto a concrete AI use: data parallelism, sharded optimizers, and replicated inference.
- Reason about failure as the normal case at cluster scale and design checkpoint-and-recovery so a training job survives the loss of a machine.
- Place a workload on the consistency spectrum from staleness-tolerant parameter reads to the CAP trade-off, and decide when a system needs consensus versus when collectives suffice.
- Identify the straggler and locality effects that bound real throughput, and recognize the master-worker, parameter-server, all-reduce, and sharded-replicated patterns when they appear later in the book.
If you keep one thing from this chapter, keep this: distributing an AI workload means choosing how machines split state, exchange it, and survive each other's failures, and every such choice is a trade between capacity gained and the communication and failure it costs. Read forward, the sections are the menu of those choices: roles, communication, partitioning, recovery, consistency, consensus, stragglers, and locality, assembled into a short list of named patterns. Read as a question, the chapter is a diagnostic you apply to any distributed AI system: which state is split and how, how do the pieces talk, what happens when one dies, how fresh is what each one reads, and where is the slowest link? The patterns of Section 2.9 are the standard answers the rest of the book reaches for. The roadmap below walks the nine sections that build them.
Chapter Roadmap
- 2.1 Processes, Nodes, Workers, and Coordinators The cast of a distributed AI job: what a process, a node, a worker, and a coordinator each are, and how a training run or serving fleet assigns those roles.
- 2.2 Communication, Synchronization, and Coordination How distributed actors exchange state through messages, collective operations, and barriers, the synchronization substrate that distributed SGD runs on top of.
- 2.3 Partitioning, Sharding, and Replication The three ways to split and copy state, mapped onto data parallelism, sharded optimizers, and replicated inference, with the capacity-versus-redundancy trade each strikes.
- 2.4 Fault Tolerance and Recovery Failure as the normal case at cluster scale, and the checkpoint-and-recovery machinery that lets a long training job survive losing a machine, the seed of elastic training.
- 2.5 Consistency Models: From Parameter Staleness to the CAP Trade-off How fresh the state each worker reads really is, from tolerable parameter staleness in asynchronous SGD to the availability-versus-consistency choice serving systems negotiate.
- 2.6 Coordination and Consensus in the Control Plane Where AI genuinely needs agreement: schedulers, model registries, and leader election, and why the training data plane uses collectives to sidestep consensus entirely.
- 2.7 Stragglers and Bottlenecks Why the slowest worker sets the pace of a synchronous job, how to find the binding bottleneck, and the mitigations that return in distributed optimization and at serving scale.
- 2.8 Data Locality and Compute Locality Why moving computation to where the data lives usually beats moving data to the computation, the locality principle every data-processing and serving chapter inherits.
- 2.9 Distributed System Patterns for AI The chapter's capstone: the master-worker, parameter-server, all-reduce, sharded-replicated, and pipeline patterns that assemble these primitives into the systems the book builds.
Read the nine sections in order and you will hold the systems vocabulary the rest of the book assumes: Section 2.1 names the actors, Sections 2.2 through 2.8 name the hazards and the remedies, and Section 2.9 packages both into patterns. The thread to watch begins with the collectives of Section 2.2: the all-reduce you see here as a synchronization primitive returns as the engine of data-parallel training in Part IV, made precise by the communication primitives of Chapter 4.
What's Next?
This chapter gave you the qualitative machinery of distributed systems: roles, communication, partitioning, recovery, consistency, consensus, stragglers, locality, and the patterns that combine them. Chapter 3: Scalability and Performance Models makes that machinery quantitative. The straggler you met in Section 2.7 becomes a term in a throughput equation; the communication of Section 2.2 becomes a cost you can predict before you run; and the question Chapter 1 first posed, when does adding a machine actually help, becomes a calculation through Amdahl's law, Gustafson's law, and the speedup and efficiency models of a distributed job. Read it next, and the trade-offs this chapter described in words will become numbers you can put a budget against.
Bibliography & Further Reading
Foundational Papers
Dean, J., Ghemawat, S. "MapReduce: Simplified Data Processing on Large Clusters." OSDI 2004. research.google
The master-worker pattern of Section 2.9 in its most influential form; its handling of stragglers and worker failure is the template for fault-tolerant batch computation.
Zaharia, M., Chowdhury, M., Das, T., et al. "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing." NSDI 2012. usenix.org
The RDD lineage idea recovers lost partitions by recomputation rather than replication; a concrete answer to the fault-tolerance question of Section 2.4.
Dean, J., Barroso, L. A. "The Tail at Scale." Communications of the ACM 56(2), 2013. cacm.acm.org
The definitive account of why the slowest component dominates latency at scale and how to mitigate it, the systems grounding for the straggler treatment of Section 2.7.
Li, M., Andersen, D. G., Park, J. W., et al. "Scaling Distributed Machine Learning with the Parameter Server." OSDI 2014. usenix.org
The canonical parameter-server design and its bounded-staleness consistency model, the live example behind both Section 2.5 and the pattern catalog of Section 2.9.
Consensus & Consistency
Lamport, L. "Paxos Made Simple." ACM SIGACT News 32(4), 2001. lamport.azurewebsites.net
The foundational algorithm for agreement under failures; the theory behind the control-plane consensus that Section 2.6 isolates from the training data plane.
Ongaro, D., Ousterhout, J. "In Search of an Understandable Consensus Algorithm (Raft)." USENIX ATC 2014. raft.github.io
The understandable consensus algorithm that powers etcd and most modern control planes; the leader election of Section 2.6 in practical form.
Brewer, E. "CAP Twelve Years Later: How the Rules Have Changed." IEEE Computer 45(2), 2012. infoq.com
The author of the CAP theorem revisits the consistency-versus-availability trade; the precise framing Section 2.5 uses for serving systems under partition.
Vogels, W. "Eventually Consistent." Communications of the ACM 52(1), 2009. cacm.acm.org
The practitioner's case for weak consistency in large systems; the intuition for why staleness-tolerant parameter reads in Section 2.5 are a feature, not a bug.
Books & Surveys
Kleppmann, M. "Designing Data-Intensive Applications." O'Reilly, 2017. dataintensive.net
The standard practitioner text on partitioning, replication, and consistency; the systems foundation this chapter specializes toward AI training and serving.
Dean, J. "Designs, Lessons and Advice from Building Large Distributed Systems." LADIS 2009 keynote. research.google
The back-of-the-envelope latency and failure-rate numbers that make "failure is the normal case" a quantitative claim, the grounding for Section 2.4.
Tools & Libraries
etcd Documentation: a distributed, reliable key-value store. etcd.io/docs
The Raft-backed key-value store that holds the control-plane state of Kubernetes and many ML schedulers; the consensus of Section 2.6 as a service you call.
Apache ZooKeeper Documentation. zookeeper.apache.org
The coordination service for leader election, configuration, and distributed locks; the classic realization of the control-plane coordination Section 2.6 describes.
PyTorch Distributed: torch.distributed collectives and DistributedDataParallel. pytorch.org/docs
The collective and process-group APIs that implement the all-reduce and barrier primitives of Section 2.2 without a coordinator; the toolkit Part IV builds on.