"They asked who was in charge. I said everyone. They asked again, slower, and I gave the same answer, because that is the whole point of a gossip protocol."
A Peer With No Coordinator
Once work is split across machines, someone has to hold the authority that keeps them acting as one, and where that authority lives, in a single coordinator, spread evenly among peers, or arranged in a hierarchy, is the architectural decision that shapes every distributed AI system you will build. The previous section drew the line between scaling up one machine and scaling out across many. This section asks the next question: among the many machines, who decides? A centralized design names one coordinator that holds the authoritative state; a decentralized design gives no machine that role and lets peers agree through exchange; a hybrid design nests the two, with local coordinators reporting to a global one. These are not three rival camps but three points on a single spectrum, and the same distributed-training task can be expressed at each point, with sharply different costs in failure tolerance, bandwidth, and convergence. We will see all three on one running example: training the same model with a parameter server, with all-reduce, and with gossip averaging.
Every system in this book partitions some essential activity across machines, but partitioning the work is only half the design. The other half is coordination: keeping the partitioned pieces consistent enough to produce a single coherent result. Coordination needs a source of truth, a place (or a procedure) that resolves disagreement and says what the current state is. The architectural choice is where that source of truth lives. At one extreme it lives in a single node that every other node defers to. At the other extreme there is no such node, and truth is whatever the peers collectively converge to. Between them sits a family of hierarchies that localize coordination to keep it cheap while preserving one global view. Naming these three patterns, and knowing the price of each, is the goal of this section.
1. The Centralized Pattern: One Coordinator Holds the Truth Beginner
In a centralized architecture, one machine is designated the coordinator and holds the authoritative copy of the shared state. Every other machine is a worker that reads the current state from the coordinator, does local work, and writes its contribution back. The coordinator serializes the contributions, applies them in a defined order, and answers the next read with the updated truth. The canonical instance in distributed machine learning is the parameter server: a node (or sharded set of nodes) holds the model weights, workers pull the current weights, compute gradients on their data shard, and push those gradients back, where the server applies them to produce the next version of the weights. The full push-pull mechanics, weight sharding, and the staleness that creeps in when pushes and pulls interleave are developed in Chapter 11.
The appeal of centralization is conceptual simplicity. There is exactly one place to look for the current state, one place to enforce an update order, and one place to add a new worker. Consistency is easy to reason about because the coordinator imposes a total order on updates by construction. Many systems beyond training share this shape: a scheduler that owns the cluster's placement decisions, a metadata server that owns the namespace of a distributed file system, a coordination service that owns the membership list. In each, a single logical authority trades scalability for a clean mental model.
A central coordinator gives you a clean, totally ordered, easy-to-reason-about source of truth, and charges two unavoidable taxes for it. First, it is a single point of failure: if the coordinator dies, the system stops until it recovers, so a serious deployment must replicate it and run a consensus protocol to keep the replicas agreed, the subject of Chapter 2. Second, it is a bandwidth and request bottleneck: every worker's traffic funnels through it, so its network link and its update rate cap the whole system no matter how many workers you add. The decentralized and hybrid patterns exist to dodge one or both of these taxes.
2. The Decentralized Pattern: Peers Agree Without a Center Intermediate
A decentralized architecture removes the privileged node entirely. No machine holds the authoritative state; instead each peer holds its own copy, exchanges with a subset of other peers, and the group's agreed state is whatever those exchanges converge to. Two decentralized patterns dominate distributed training, and they sit at different points on the same axis. All-reduce over a ring or tree is decentralized in topology, every worker plays an identical role and there is no server, yet it still computes the exact global average each step by passing partial sums symmetrically around the ring; it is the workhorse of data-parallel training and is built up in detail in Chapter 4. Gossip averaging relaxes even the requirement of an exact global step: each peer mixes its value only with its immediate neighbors, and global agreement emerges gradually as information diffuses across the graph. Gossip and its federated relatives are the subject of Chapter 14.
The reward for giving up the center is that no single node can fail the whole system and no single link carries everyone's traffic; both taxes from Section 1 are dodged. The price is paid in convergence and visibility. Because no peer holds the global state, no peer has a global view: a peer knows only what has reached it through the graph, so a freshly changed value at one corner takes several rounds of exchange to influence the far corner. All-reduce keeps the per-step result exact but still pays a synchronization cost that grows with the ring. Gossip is cheaper per round but only approximate, converging toward the global average rather than hitting it in one step, and its convergence rate is governed by the connectivity of the communication graph, which is exactly why the control planes of Chapter 2 reach for consensus protocols when peers must agree on a single discrete value rather than a continuous average.
To make the contrast concrete, consider the simplest possible shared-state task: $K$ peers each hold a number, and they want to agree on the mean. A centralized solution sends all $K$ numbers to a coordinator, which averages them and broadcasts the result, exact in one round but funneled through one node. A gossip solution never gathers the numbers anywhere; each peer repeatedly replaces its value with a blend of its own and its ring neighbors' values, using a doubly stochastic mixing matrix $W$ so that the global sum is preserved while the values pull together. The update for the whole vector of peer values $g$ is $g \leftarrow W g$, and the deviation from the true mean shrinks geometrically at a rate set by the second-largest eigenvalue of $W$. The code below runs both and reports how the gossip peers close the gap to the global mean round by round.
import numpy as np
rng = np.random.default_rng(0)
K = 8 # peers, arranged in a ring
x = rng.uniform(-10, 10, size=K) # each peer's local value
global_mean = x.mean() # the answer all peers should reach
# Centralized average: one coordinator gathers all values and broadcasts the mean.
central = np.full(K, x.mean())
# Decentralized gossip averaging on a ring: each round every peer replaces its
# value with a weighted blend of itself and its two ring neighbors. No peer ever
# sees the whole vector; information diffuses one hop per round.
g = x.copy()
W = 0.5 * np.eye(K)
for i in range(K):
W[i, (i - 1) % K] += 0.25
W[i, (i + 1) % K] += 0.25 # doubly-stochastic mixing matrix
print(f"global mean (target) : {global_mean:.6f}")
print(f"central coordinator result : {central[0]:.6f}")
print(f"{'round':>5} {'max peer deviation from mean':>30}")
for r in range(1, 41):
g = W @ g
dev = np.max(np.abs(g - global_mean))
if r in (1, 2, 5, 10, 20, 40):
print(f"{r:>5} {dev:>30.2e}")
print(f"gossip result (peer 0) : {g[0]:.6f}")
global mean (target) : 0.066019
central coordinator result : 0.066019
round max peer deviation from mean
1 8.22e+00
2 6.51e+00
5 3.58e+00
10 1.57e+00
20 3.21e-01
40 1.35e-02
gossip result (peer 0) : 0.064894
The output tells the whole story of the trade-off in two columns. The centralized result is exact at once, because one node saw everything. The gossip result is only approximate after any finite number of rounds, and the deviation falls by a roughly constant factor each round, a geometric decay whose rate is the spectral gap of the ring. Add more peers or a sparser graph and that rate slows; wire the graph more densely and it speeds up, at the cost of more messages per round. This is the decentralized bargain in miniature: you trade the exactness and the single bottleneck of the center for many cheap local exchanges and a convergence you must wait for.
Gossip averaging looks like the children's game of telephone, where a message decays as it passes ear to ear, and the comparison even suggests it should drift toward nonsense. It does the opposite. Because the mixing matrix is doubly stochastic, the sum of all peer values is conserved exactly at every round, so the average the peers are chasing never moves; only their disagreement about it shrinks. The game cannot lose the answer, it can only sharpen the consensus around it. The single thing the peers must get right is that everyone uses weights that preserve the total, which is why decentralized learning papers obsess over making $W$ doubly stochastic.
Code 1.4.1 implemented the mixing by hand with a matrix multiply on values colocated in one process. On a real cluster the peers live on different machines and you never assemble the full vector anywhere. PyTorch's torch.distributed exposes both the centralized and the decentralized exchange through one abstraction, the process group: all_reduce performs the exact symmetric average with no server, while a point-to-point send / recv between ring neighbors expresses one gossip round. The library handles process-group setup, the ring or tree schedule, and the network transport that Chapter 4 unpacks:
# Run with: torchrun --nproc_per_node=8 thisfile.py
import torch, torch.distributed as dist
dist.init_process_group("nccl") # join the group of K peers, no server
val = torch.tensor([local_value], device="cuda")
# Decentralized AND exact: every peer ends with the global mean, no coordinator.
dist.all_reduce(val, op=dist.ReduceOp.SUM)
val /= dist.get_world_size()
# One gossip round instead: exchange only with ring neighbors (approximate, cheaper).
rank, world = dist.get_rank(), dist.get_world_size()
recv = torch.empty_like(val)
dist.send(val, dst=(rank + 1) % world)
dist.recv(recv, src=(rank - 1) % world)
val = 0.5 * val + 0.5 * recv # mix with the neighbor that sent to us
all_reduce replaces a hand-built doubly stochastic average and stays exact with no server; the neighbor send / recv pair expresses one decentralized gossip round, the building block of the methods in Chapter 14.3. The Hybrid Pattern: Hierarchies of Coordinators Intermediate
Pure centralization bottlenecks at one node; pure decentralization waits for slow global convergence. The hybrid pattern resolves the tension by nesting them. Machines are grouped, each group has a local coordinator that handles fast, frequent coordination within the group, and the local coordinators reconcile with a single global coordinator far less often. This is the right-hand panel of Figure 1.4.1, and it matches the physical reality of modern clusters: communication inside a server (across its GPUs over a fast interconnect) is cheap, communication across servers is dear, and communication across data centers is dearer still. A hierarchy places frequent exchange where bandwidth is abundant and rare exchange where it is scarce.
Hierarchical all-reduce is the standard concrete case. Workers on the same machine reduce their gradients locally over the fast intra-node link, one representative per machine then all-reduces across machines over the slower network, and the machine-level result is broadcast back down to the local workers. The expensive cross-machine collective runs once per step instead of once per worker, so a job with eight GPUs on each of sixteen machines pays for sixteen participants in the slow collective rather than one hundred twenty-eight. The same nesting appears in federated learning, where edge devices coordinate through a regional aggregator that in turn syncs with a global server, and in geo-distributed training, where each data center reduces internally before a thin cross-region exchange. Hierarchy is less an exotic third option than the practical default whenever the network is itself tiered.
Who: An ML platform engineer standing up recommendation-model training on a new thirty-two-node cluster, eight GPUs per node.
Situation: The first version used a single parameter server holding the embedding tables and dense weights; every one of the 256 GPUs pulled and pushed against that one node each step.
Problem: The parameter server's network link saturated at around forty GPUs of traffic, so adding the other 216 GPUs bought almost no speedup, and a server restart stalled the entire job.
Dilemma: Go fully decentralized with ring all-reduce, removing the bottleneck and the single point of failure but forcing every one of 256 GPUs into one cross-node collective per step, or keep a coordinator but stop it from being a chokepoint.
Decision: They chose a hybrid shape. Dense weights moved to hierarchical all-reduce (reduce within each node over the fast link, then a thirty-two-way cross-node reduce); the large sharded embedding tables stayed on a set of parameter servers, because sparse lookups suit push-pull better than dense all-reduce.
How: Intra-node reduction ran over the on-board interconnect, the cross-node step involved one rank per node rather than all eight, and the embedding shards were spread across several server nodes so no single link carried the whole table's traffic.
Result: The slow cross-node collective went from 256 participants to 32, throughput scaled close to linearly to all 256 GPUs, and losing any single node degraded rather than halted the run.
Lesson: Real systems rarely pick one point on the coordinator spectrum for everything. Match the coordination shape to the traffic: hierarchy for dense synchronous gradients, sharded centralization for sparse embedding lookups, all within one job.
The centralized-to-decentralized axis introduced here is one of the book's load-bearing spines. The parameter server returns as the staleness-managed push-pull system of Chapter 11; all-reduce becomes the default gradient-synchronization engine of data-parallel deep learning in Chapter 15; gossip and its decentralized descendants drive federated and over-the-internet learning in Chapter 14; and the consensus that keeps a replicated central coordinator alive is the control-plane subject of Chapter 2. Whenever a later chapter introduces a distributed method, ask first where its source of truth lives, because that single question predicts its failure modes and its bandwidth bill.
4. Reading the Trade-offs Off the Spectrum Advanced
The three patterns are best held in mind as a single table of trade-offs rather than as a taxonomy to memorize. Table 1.4.1 lines them up against the four properties that decide most architectural arguments: where the source of truth lives, how the system tolerates a failure, where the communication bottleneck falls, and what kind of convergence the agreement step gives you. The running examples from this section, parameter server, all-reduce, and gossip, occupy the centralized, decentralized-exact, and decentralized-approximate rows; the hybrid pattern is the engineered compromise that picks a different row for different parts of the same job, exactly as the practical example did.
| Pattern | Source of truth | Failure behavior | Bottleneck | Agreement step |
|---|---|---|---|---|
| Centralized (parameter server) | One coordinator node | Single point of failure; needs replication and consensus | Coordinator's link and update rate | Exact, totally ordered |
| Decentralized, exact (all-reduce) | None; symmetric peers | No single point of failure; a lost peer stalls the collective | Per-step sync cost grows with ring size | Exact global average per step |
| Decentralized, approximate (gossip) | None; peers converge | Most robust; group tolerates churn | Many cheap neighbor messages | Approximate; geometric convergence |
| Hybrid (hierarchy) | Global coordinator over local ones | Local failure is contained; global node still critical | Rare cross-group, frequent intra-group | Exact within group, reconciled globally |
Two readings of the table are worth stating outright. First, no row dominates; the right choice depends on which resource is scarce and which failure you most fear, the same disciplined matching of remedy to ceiling urged in Section 1.1. If the model fits and the network inside a node is fast, hierarchical all-reduce is hard to beat; if peers join and leave constantly and an approximate average suffices, gossip is the robust answer; if the shared state is a giant sparse table of embeddings touched unevenly, a sharded parameter server earns its keep. Second, the patterns compose. A production training job often runs all-reduce for dense weights, a parameter server for sparse embeddings, and a consensus-backed coordinator for cluster membership, three points on the spectrum inside one system, which is why fluency with the whole spectrum, not allegiance to one pattern, is the skill this section builds.
The pressure that revived decentralized architectures is bandwidth scarcity at the largest scale: synchronizing every step across thousands of accelerators, let alone across data centers or the public internet, makes the central or fully synchronous collective the binding cost. Local-update methods in the DiLoCo lineage (Douillard et al., 2024) let each group of workers take many local optimizer steps between rare global syncs, slashing communication by orders of magnitude with little accuracy loss, and have been pushed toward genuinely geo-distributed and internet-scale runs by open efforts such as Prime Intellect's INTELLECT-1 (2024) and the OpenDiLoCo work. These are hybrids in the precise sense of this section: dense exact coordination inside a group, infrequent decentralized reconciliation across groups. A parallel thread revisits gossip-based and asynchronous decentralized SGD for settings where no participant will tolerate a coordinator at all. We return with the optimization machinery to evaluate these methods in Chapter 10 and the federated framing in Chapter 14; for now, note that the frontier is not picking a single point on the coordinator spectrum but engineering precisely how often each level of a hierarchy needs to agree.
We now have the third lens of this chapter in place. Section 1.1 split scale-out from scale-up, Section 1.2 mapped the six axes of distribution, Section 1.3 weighed scaling up against scaling out, and this section located the source of truth that any scaled-out system needs. The next dimension is temporal: whether the system processes a fixed batch, a flowing stream, one online example at a time, or an interactive request that must answer now. That distinction reshapes every architecture in this section, and it is the subject of Section 1.5.
For each system, state whether its coordination is centralized, decentralized (and exact or approximate), or hybrid, and name the single point of failure if one exists: (a) a data-parallel training job using ring all-reduce for gradient synchronization; (b) a recommendation trainer whose embedding tables live on a sharded parameter server while dense layers use all-reduce; (c) a fleet of phones learning a shared keyboard model through a regional aggregator that periodically syncs with a global server; (d) a cluster scheduler that owns all placement decisions for a thousand nodes. Explain which of the two centralization taxes from Section 1 each design pays.
Modify Code 1.4.1 to compare three communication graphs over the same eight peers: the ring used in the section, a fully connected graph (every peer mixes with all others, with the doubly stochastic weights $1/K$), and a sparse line graph (the ring with one edge removed). For each, report the round at which the maximum deviation from the global mean first drops below $10^{-3}$. Relate the ordering you observe to the second-largest eigenvalue magnitude of each mixing matrix $W$, and explain in one sentence why the fully connected graph behaves like the centralized coordinator.
A training job has $M$ machines with $G$ GPUs each. Intra-machine communication costs $a$ seconds per all-reduce participant and cross-machine communication costs $b \gg a$ seconds per participant. Write the per-step communication time for (i) a flat all-reduce over all $MG$ GPUs and (ii) a hierarchical all-reduce that reduces within each machine, then across the $M$ machines, then broadcasts back. Under what relationship between $a$, $b$, $M$, and $G$ does the hierarchical scheme win, and what does your answer say about why hierarchy is the default once $b/a$ is large? You do not need the exact all-reduce cost model of Chapter 4; a participant-count proxy is enough.