"I finished my gradient in twenty milliseconds and then stood at the barrier for seventeen more, watching the slow worker, learning patience I never asked for."
A GPU Idling on a Communication Barrier
Workers that compute in isolation are useless; the value of a distributed AI system lives entirely in how its workers exchange information and agree on when to do so. The previous section gave us the cast of processes, nodes, workers, and coordinators. This section is about the verbs that connect them: how one worker hands a value to another (communication), and how the group decides who waits for whom (synchronization). Almost every distributed training method you will meet is a particular answer to two questions: do workers share memory or pass messages, and do they march in lockstep or run free? The dominant answer for deep learning, lockstep message passing organized into repeating rounds, has a precise name borrowed from parallel computing, the Bulk Synchronous Parallel model, and once you see a training step as one of its supersteps, the rest of the book's synchronization machinery falls into place.
In the previous section we separated the roles a machine can play: a worker that holds a shard of the computation, a coordinator that hands out work, a process that may live on the same node as another or across the network. That gave us nouns. By themselves they do nothing useful, because a distributed computation is defined not by its parts but by the information flowing between them. A data-parallel training cluster is exactly its gradient exchange; remove the exchange and you have eight machines computing eight different, useless models. This section supplies the verbs, and it does so along two independent axes that, taken together, classify nearly every coordination strategy in this book.
The first axis is how information moves: through memory two workers can both read, or through explicit messages one worker sends to another. The second axis is when information moves relative to computation: synchronously, where the group waits at agreed-upon points, or asynchronously, where workers proceed without waiting. We take the two axes in turn, then assemble them into the model that organizes synchronous distributed training, and finally contrast it with the relaxed alternatives that later chapters exploit when the barrier becomes too expensive.
1. Message Passing Versus Shared State Beginner
There are two fundamental ways for two workers to share a value. In the shared-state model, both workers can read and write the same piece of memory; communication is implicit in the act of reading what another wrote. In the message-passing model, each worker has only its own private memory, and the only way to move a value is to send it explicitly across a channel that the receiver explicitly reads. The distinction sounds academic until you notice that it is forced by physics: two threads on one machine can share an address space, but two GPUs in two servers in two racks cannot, so any computation spanning machines is message passing whether the programmer sees it or not.
This is why distributed AI is overwhelmingly a message-passing world. When a data-parallel training job runs across eight GPUs, no shared variable holds "the gradient"; each worker holds its own copy and the framework moves bytes between them over an interconnect. The shared-state illusion that a single optimizer is updating a single model is reconstructed, on top of message passing, by a collective operation that makes every worker's copy identical after each step. The parameter-server architecture of Chapter 11 goes a step further and provides an explicit shared store that workers push to and pull from, a deliberate piece of shared-state semantics built out of messages. Knowing which model you are in tells you immediately where your costs and your bugs will come from: shared state breeds race conditions and lock contention, message passing breeds latency and the question of who waits for whom.
Every value that crosses a machine boundary is a message, full stop. Shared-state abstractions like a parameter server or a replicated optimizer state are conveniences built on top of message passing, and they cost real network bytes and real waiting. When you reason about a distributed AI system, start from "what messages move, and when", because that is the layer where performance and correctness actually live. The synchronization question, who waits for whom before the next message, is the subject of the rest of this section.
2. Synchronous Versus Asynchronous Communication Beginner
Given that workers pass messages, the next question is whether the sender and the group wait. A synchronous exchange couples the participants in time: the workers reach an agreed point, exchange, and only then continue, so every worker observes the same combined result before taking its next step. An asynchronous exchange decouples them: a worker sends its contribution and proceeds immediately, reading whatever combined value happens to be available, which may be slightly out of date. Synchronous exchange buys determinism and a clean mental model at the price of waiting for the slowest participant; asynchronous exchange buys speed and straggler tolerance at the price of working with stale information.
It helps to separate two senses of the word "blocking" that often get conflated. A blocking call does not return to the caller until the operation completes locally; a non-blocking call returns immediately with a handle the caller can wait on later. This is a local property of one call. Synchronous versus asynchronous, by contrast, is a global property of the group's coordination. A collective such as all-reduce can be issued as a non-blocking call so that a worker launches the exchange and overlaps it with other computation, yet the algorithm as a whole can still be globally synchronous because every worker must complete the all-reduce before the next training step begins. The most important performance idea in distributed training, hiding communication behind computation, lives exactly in this gap: keep the group synchronous for correctness, but use non-blocking collectives so no worker sits idle while the bytes are in flight. We return to this overlap as a research frontier at the end of the section, and the collectives chapter, Chapter 4, makes the blocking and non-blocking variants concrete.
3. Barriers and the Bulk Synchronous Parallel Model Intermediate
A barrier is the primitive that makes a group synchronous. It is a point in the program that no worker may pass until every worker has reached it, so the fast workers wait for the slow ones and the group leaves the barrier together. A barrier carries no data by itself; it only synchronizes time. Yet it is the single most consequential line in many distributed programs, because the group's pace is set by its slowest member at every barrier, and a cluster of a thousand workers crossing a barrier every step inherits the speed of whichever worker happened to be slowest that step.
The Bulk Synchronous Parallel model, introduced by Valiant in 1990, organizes an entire computation out of barriers. A BSP computation is a sequence of supersteps, and each superstep has exactly three phases in this order: a compute phase where every worker computes locally on the data it already holds, touching no other worker; a communicate phase where workers exchange the messages produced by that computation; and a barrier phase where all workers synchronize, so that messages sent in one superstep are guaranteed delivered before the next superstep's compute phase begins. The genius of the model is its simplicity: because communication and synchronization happen only at superstep boundaries, the programmer never reasons about messages arriving mid-computation, and the cost of the whole program is just the sum of the per-superstep costs. Figure 2.2.1 shows three workers moving through two supersteps, made uneven on purpose so the barrier idle is visible.
The reason BSP matters for this book is that one step of synchronous data-parallel training is precisely one BSP superstep. The compute phase is the forward and backward pass, where each worker computes a gradient on its own shard touching no other worker. The communicate phase is the gradient all-reduce, where workers exchange and sum their partial gradients. The barrier phase is the implicit synchronization that all-reduce contains: no worker can apply the optimizer update and start the next step until the all-reduce has completed on every worker, because the averaged gradient depends on all of them. The gradient synchronization that Section 1.1 introduced as an exact reorganization of the math is, viewed through this lens, a BSP barrier wearing the costume of a collective operation.
Section 1.1 showed that data parallelism is mathematically exact: eight workers compute the identical gradient that one machine would. This section names the price of that exactness. The all-reduce that makes the gradient identical is a BSP barrier, and a barrier runs at the speed of its slowest worker. So the scale-out tax is not approximation error, it is waiting: every fast worker idling until the straggler arrives, every step. The rest of the book's synchronization machinery, from bounded-staleness parameter servers (Chapter 11) to elastic training (Chapter 18), is an attempt to lower this barrier tax without breaking the math.
The demo below makes the barrier tax measurable. It models a four-worker BSP computation over three supersteps, where one worker is a persistent straggler. Each superstep it records every worker's compute time, the idle each fast worker spends waiting at the barrier, and the superstep's total cost, which is set by the slowest worker plus the fixed communication. It then reports the wall-clock and how much of it was pure barrier idle.
import random
random.seed(7)
WORKERS, STEPS = 4, 3
# Per-worker compute time (ms) per superstep: workers are uneven, so one straggles.
base = [20.0, 22.0, 21.0, 38.0] # worker 3 is a persistent straggler
comm = 6.0 # all-reduce (communicate) cost, ms
print(f"{'step':>4} | " + " | ".join(f"w{w} compute" for w in range(WORKERS))
+ " | barrier wait (slowest-self) | superstep ms")
bsp_total = 0.0
for step in range(STEPS):
# SUPERSTEP PHASE 1: local compute (jitter around each worker's base cost).
compute = [c + random.uniform(-2, 2) for c in base]
slowest = max(compute)
# PHASE 3: the BARRIER. No worker leaves until the slowest arrives, so each
# fast worker idles for (slowest - its own compute) before the all-reduce.
waits = [slowest - c for c in compute]
# PHASE 2 (after barrier): communicate. One superstep = compute + barrier + comm.
superstep = slowest + comm
bsp_total += superstep
print(f"{step:>4} | " + " | ".join(f"{c:8.2f}ms" for c in compute)
+ f" | {' '.join(f'{w:4.1f}' for w in waits)} | {superstep:8.2f}")
# Counterfactual: if every worker ran the straggler's load with NO barrier idle.
ideal = sum(base) / WORKERS * STEPS
print()
print(f"BSP wall-clock (paced by slowest each step) : {bsp_total:7.2f} ms")
print(f"idle injected by barriers (vs perfect balance): {bsp_total - ideal - comm*STEPS:7.2f} ms")
print(f"straggler share of total compute time : "
f"{base[3] / sum(base) * 100:5.1f}% of the work, but it sets the pace")
waits column is the barrier idle that turns a fast worker into a waiting one.step | w0 compute | w1 compute | w2 compute | w3 compute | barrier wait (slowest-self) | superstep ms
0 | 19.30ms | 20.60ms | 21.60ms | 36.29ms | 17.0 15.7 14.7 0.0 | 42.29
1 | 20.14ms | 21.46ms | 19.23ms | 38.03ms | 17.9 16.6 18.8 0.0 | 44.03
2 | 18.15ms | 21.73ms | 19.28ms | 36.36ms | 18.2 14.6 17.1 0.0 | 42.36
BSP wall-clock (paced by slowest each step) : 128.68 ms
idle injected by barriers (vs perfect balance): 34.93 ms
straggler share of total compute time : 37.6% of the work, but it sets the pace
The numbers tell the whole story of synchronous training. The straggler never does most of the work, only about 38 percent of the total compute, yet it sets the pace of every superstep, and the three faster workers collectively burn about 35 ms standing at barriers across just three steps. Multiply that by a hundred thousand steps and a thousand workers and the idle becomes the dominant cost. This is not a flaw in the model; it is the honest price of guaranteeing that every worker sees the same gradient before it steps. The question the next section asks is whether you can stop paying it.
4. Relaxing the Barrier: Asynchronous and Stale-Synchronous Models Advanced
If the barrier is the tax, the obvious move is to remove it. Asynchronous training does exactly that: each worker computes a gradient and applies it to a shared model immediately, without waiting for anyone, then fetches the latest model and continues. No worker ever idles at a barrier, so a straggler slows only itself, and a thousand workers run at close to a thousand workers' worth of throughput. The price is that the gradient a worker computes is stale: by the time it is applied, other workers have already moved the model, so the update is based on parameters that no longer exist. Asynchronous stochastic gradient descent, which we develop fully in Chapter 10, trades the determinism and clean convergence of BSP for raw throughput, and whether that trade pays off depends on how much staleness the optimization can absorb.
Between the two extremes sits the stale-synchronous parallel model, which keeps a barrier but loosens it. Instead of forcing every worker to synchronize every step, it allows a worker to run ahead by up to $s$ steps before it must wait for the slowest worker to catch up. Formally, a worker at step $t$ may proceed as long as the slowest worker has completed step $t - s$, so the staleness any worker sees is bounded by $s$. Setting $s = 0$ recovers strict BSP, where every worker is exactly in step; letting $s \to \infty$ recovers fully asynchronous training, where no worker ever waits. The bounded-staleness parameter server of Chapter 11 is built on exactly this idea, and the consistency models of Section 2.5 give it a formal vocabulary. The point for now is that synchronization is a dial, not a switch: BSP is one setting of it, the strictest, and most of the interesting engineering lives in how far you can relax the dial before the math stops converging.
There is a recurring comedy in synchronous clusters: the most powerful, most expensive GPUs in the building spend much of their life waiting on one slow node, perhaps a machine whose cooling fan is failing and whose clocks have quietly throttled. A thousand accelerators, each costing more than a car, idling in unison because node 487 runs four percent slow. This is why straggler detection (Section 2.7) is not a luxury feature: in a strict-BSP world, your slowest worker is your whole cluster's clock speed, and finding it is finding free performance.
5. Coordination Beyond the Training Loop Intermediate
So far "coordination" has meant the data-plane question of when workers exchange gradients. There is a second, separate sense: the control-plane coordination of agreeing on facts that must be consistent across the cluster, such as which worker is the leader, which configuration is current, or whether a checkpoint committed. These two kinds of coordination use deliberately different machinery. The training data plane uses collectives and barriers because it values throughput and the participants are known and fixed. The control plane uses consensus protocols such as Raft because it values agreement and durability over raw speed, and it must tolerate participants joining and leaving. Section 2.6 develops this split in full; the thing to carry forward now is that "synchronization" in a distributed AI system is not one mechanism but two, chosen by what is being coordinated.
The reason the training loop does not use a consensus protocol for its gradient exchange is instructive. Consensus protocols are built to agree on a value in the presence of failures and disagreement, and they pay for that robustness with multiple rounds of messages per decision. A training step cannot afford that; it needs to sum a billion-element vector across a thousand workers in milliseconds, and it can assume the participants agree on the goal because they are running the same program. The collective is the right tool precisely because it drops the generality that consensus provides and keeps only the fast path. Matching the coordination mechanism to the job, a barrier for the lockstep data plane, a consensus round for the durable control plane, is the same "right tool" discipline that runs through the whole book.
You almost never implement a barrier or a superstep boundary by hand. PyTorch's distributed package exposes the BSP superstep structure directly: a blocking collective such as all_reduce already contains the barrier, because it cannot return until every worker has contributed, and dist.barrier() gives you a bare synchronization point when you need one without data. The roughly twenty lines of timing logic in Code 2.2.1 that modeled "wait for the slowest worker" are, in a real job, this:
# Run with: torchrun --nproc_per_node=4 thisfile.py
import torch, torch.distributed as dist
dist.init_process_group("nccl") # the group of workers in this superstep
grad = compute_local_gradient() # PHASE 1: compute, each worker on its shard
dist.all_reduce(grad, op=dist.ReduceOp.SUM) # PHASE 2+3: communicate AND barrier in one
grad /= dist.get_world_size() # every worker now holds the identical mean
dist.barrier() # a bare barrier: synchronize, exchange no data
all_reduce fuses the communicate and barrier phases of Figure 2.2.1 into one call, so the barrier idle measured in Output 2.2.1 happens inside the library; dist.barrier() is the bare synchronization primitive when no data needs to move.Who: An ML platform engineer running data-parallel training for a computer-vision team on a shared 64-GPU cluster.
Situation: A training job benchmarked at near-linear scaling on a clean cluster suddenly ran 30 percent slower in production, with no change to the model or the data.
Problem: GPU utilization graphs showed every accelerator dropping to near-zero for a few milliseconds at the end of each step, in perfect unison, a sawtooth of full-then-idle.
Dilemma: Switch the whole job to asynchronous training to eliminate the waiting, which would have decoupled the workers but risked convergence changes the research team had not approved, or keep the synchronous BSP step and find the worker that was setting the pace.
Decision: They kept BSP and treated the unison idle as a barrier symptom, because the synchronized sawtooth is the unmistakable signature of fast workers waiting on a straggler, exactly the pattern in Output 2.2.1.
How: Per-worker step-time logging found one node consistently 25 percent slower; its GPU had silently throttled from a failing fan, and because the step is a BSP barrier, that one node was the cluster's clock.
Result: Draining the bad node restored near-linear scaling and recovered the lost 30 percent, with no change to the optimization at all. The fix was operational, not algorithmic.
Lesson: In a strict-BSP training loop, a synchronized idle sawtooth across all workers means one thing: a barrier waiting on the slowest member. Before you relax synchronization, find out whether you simply have a sick worker setting the pace.
6. Overlapping Communication With Computation Advanced
There is a third way to lower the barrier tax that changes neither the math nor the staleness: hide the communication behind the computation so the waiting overlaps with useful work. Recall from Section 2 that a non-blocking collective returns immediately and runs in the background. In the backward pass of a deep network, gradients become available layer by layer, from the output layer first. So a worker can launch the all-reduce for the last layer's gradient the instant it is computed, while it continues computing gradients for the earlier layers. By the time the backward pass finishes, much of the communication has already happened in the shadow of the compute, and the worker waits at the barrier for far less time. The group stays globally synchronous, so the math of Section 1.1 is untouched, but the idle of Output 2.2.1 shrinks toward zero. This bucketing and overlap is what DistributedDataParallel does automatically, and it is the reason real data-parallel training scales far better than the naive "compute, then communicate, then barrier" model would predict.
Two research lines are actively attacking the barrier tax this section quantified. The first pushes communication-computation overlap to its limit: systems such as Google's overlap-centric compiler passes and the open Centauri and FLUX work (2024) fuse and schedule collectives so finely that the all-reduce nearly vanishes behind the matrix multiplies of the backward pass, driving barrier idle toward zero while keeping strict synchronous semantics. The second line relaxes the barrier itself across slow links: DiLoCo and its descendants (Douillard et al., 2024) let workers take many local steps between rare synchronizations, turning a per-step barrier into a per-hundred-step one, which makes training viable over the open internet and across data centers where a per-step barrier would be ruinous. Both lines accept the framing of this section, that the barrier is a cost to be engineered down, and differ only in whether they hide it (overlap) or thin it out (local steps). We give the optimization-side analysis in Chapter 10 and the collective-scheduling side in Chapter 4.
We now have both axes that classify distributed coordination: message passing versus shared state, and synchronous versus asynchronous, with the barrier and its BSP superstep as the synchronous workhorse and the stale-synchronous dial as the relaxation. We have also seen that one synchronous training step is a BSP superstep, which is why everything from straggler hunting to overlap scheduling is, at bottom, an argument about a barrier. The next section turns from when workers coordinate to how their data and model are split in the first place, the partitioning and sharding decisions that determine what each worker even has to communicate. That begins in Section 2.3.
For each system, place it on both axes of this section (message passing versus shared state, and synchronous versus asynchronous) and justify the placement in one sentence each: (a) eight GPUs in one server running data-parallel training with a per-step gradient all-reduce; (b) a parameter server where workers push gradients and pull parameters whenever they finish, never waiting for each other; (c) a stale-synchronous job that allows workers to run up to three steps ahead. Then state, for each, what single failure or slowdown would hurt it most, and connect that to the axis placement.
Extend Code 2.2.1 into a stale-synchronous model. Add a staleness bound $s$ and let any worker run ahead until the slowest worker is within $s$ supersteps; a worker that would exceed $s$ must idle at a barrier, otherwise it proceeds. Sweep $s$ from $0$ (strict BSP) to a large value (effectively asynchronous) and plot total wall-clock and total barrier idle against $s$ for the same straggler profile. Explain the shape of both curves, and state what the plot does not capture that would make a large $s$ a bad choice in a real training job.
A data-parallel step takes 200 ms of backward-pass compute and needs a 60 ms gradient all-reduce. Under the naive "compute then communicate" model, estimate the per-step time and the fraction spent communicating. Now assume the gradient becomes available evenly across the backward pass and the all-reduce can be launched layer by layer as a non-blocking collective, overlapping with the remaining compute. Estimate the best-case overlapped step time and the residual communication that cannot be hidden. Argue from these numbers why overlap helps most when computation per step is large relative to communication, and connect this to the communication-cost model we build in Chapter 3.