"We all compute, we all wait, we all agree, we all step. It is a slow kind of harmony, but nobody ever wakes up holding a different model than mine."
An All-Reduce That Has Seen Some Gradients
Synchronous distributed stochastic gradient descent is the dominant algorithm for training at scale, and its defining property is that it computes exactly the same updates as a single machine running mini-batch SGD with a larger batch. Each of $K$ workers holds an identical copy of the model, draws its own local mini-batch, and computes a local gradient. At a barrier the workers all-reduce those gradients into one averaged gradient, and every worker applies the identical update. Because averaging the per-worker gradients is the same arithmetic as averaging over one global batch of size $K b$, the distributed run is not an approximation of single-machine SGD; it is single-machine SGD, reorganized across machines. That exactness is what makes the algorithm trustworthy. The price is the barrier itself: the slowest worker sets the pace, and moving the gradients can cost more than computing them. This section establishes the identity, runs it, and names the two taxes that the rest of the chapter works to reduce.
The previous section built mini-batch SGD on one machine: draw a batch of $b$ examples, average their gradients, take a step, repeat. That algorithm has a single throughput ceiling, namely how many examples one machine can push through per second. The way past that ceiling, established as an exact reorganization back in Section 1.1, is data parallelism: keep $K$ copies of the model in step with one another and let each copy chew through a different slice of the data. Synchronous distributed SGD is the disciplined version of that idea. The word "synchronous" is the whole story: the workers proceed in lockstep, separated by a barrier at every step, so that they never disagree about the current parameters. This section is about what that buys (exactness) and what it costs (the barrier).
1. The Algorithm: Compute Locally, All-Reduce, Step Together Beginner
Fix $K$ workers. Every worker holds the same parameter vector $w_t$ at the start of step $t$. The step has three phases, and keeping them distinct is the key to understanding everything that follows. First, each worker $k$ draws its own local mini-batch $B_k$ of $b$ examples and computes a local gradient $g_k$, the average loss-gradient over just those $b$ examples. This phase is embarrassingly parallel: no worker needs anything from any other, so all $K$ gradients are computed at the same time on $K$ machines. Second, the workers combine their gradients into the single averaged gradient $\bar{g} = \frac{1}{K}\sum_{k=1}^{K} g_k$. This is a reduction across machines, and it is exactly the all-reduce collective from Section 4.3: every worker ends the phase holding the same summed-and-averaged vector. Third, every worker applies the identical update $w_{t+1} = w_t - \eta\,\bar{g}$. Because all workers started from the same $w_t$ and all applied the same $\bar{g}$, they all hold the same $w_{t+1}$, and the invariant ("every worker holds the same model") is preserved into the next step.
The second phase contains a barrier, and the barrier is not incidental: it is the synchronous in synchronous SGD. No worker may apply the update until the all-reduce is complete, and the all-reduce is not complete until every worker has contributed its local gradient. This is precisely the bulk-synchronous-parallel (BSP) discipline from Section 2.2: a phase of independent local computation, then a global synchronization point, then the next phase. Figure 10.3.1 draws one such step.
2. The Exactness Identity: It Is Single-Machine SGD with a Bigger Batch Intermediate
Now comes the property that makes synchronous SGD the algorithm to beat. Let worker $k$ hold the mini-batch $B_k$ of $b$ examples, and write its local gradient as the average over its own batch,
$$g_k = \frac{1}{b} \sum_{i \in B_k} \nabla \ell(w_t; x_i, y_i).$$The averaged gradient that synchronous SGD applies is then
$$\bar{g} = \frac{1}{K} \sum_{k=1}^{K} g_k = \frac{1}{K} \sum_{k=1}^{K} \frac{1}{b} \sum_{i \in B_k} \nabla \ell(w_t; x_i, y_i) = \frac{1}{Kb} \sum_{i \in B_1 \cup \cdots \cup B_K} \nabla \ell(w_t; x_i, y_i).$$Read the rightmost expression carefully. The double sum, with the $\frac{1}{Kb}$ in front, is exactly the mini-batch gradient you would compute on a single machine if you pooled all $K$ batches into one global batch of size $Kb$ and averaged over it. The two equal-sized averaging steps (average within each worker's batch, then average across workers) collapse into one average over the union, because every example carries the same weight $\frac{1}{Kb}$. So $\bar{g}$ is the gradient of the loss on a batch of $Kb$ examples, and the update $w_{t+1} = w_t - \eta\,\bar{g}$ is one mini-batch SGD step with that global batch. This is the gradient identity of Section 1.1 turned into a training loop: the same algebra that made a single distributed gradient exact now makes an entire distributed trajectory exact, step for step.
With $K$ workers at local batch $b$, synchronous distributed SGD produces the identical sequence of parameter vectors as one machine running mini-batch SGD at global batch $Kb$, up to floating-point rounding. The distribution is invisible to the optimizer: it sees one gradient per step, computed on $Kb$ examples, exactly as if one machine had done it. Every convergence guarantee, learning-rate schedule, and stability result you have for mini-batch SGD therefore transfers without modification. You are not trading accuracy for parallelism; you are reorganizing where the same arithmetic happens. The only thing the optimizer "notices" is that the batch got bigger, which is a known knob (large-batch training) the chapter returns to later, not a distribution artifact.
This identity is the reason synchronous SGD is the default. A practitioner can reason about the whole distributed system using single-machine intuition, because the math promises that the cluster computes what one big machine would. The caveat in the insight, that the batch grows to $Kb$, is real and is its own topic: very large batches can need a tuned learning-rate warmup and scaling to converge as fast per step, which is why the chapter has a dedicated section on large-batch training. But that is a property of large batches in general, not a defect introduced by splitting work across machines.
3. Verifying the Identity in Pure Python Intermediate
The claim is strong enough to deserve a direct test: run synchronous distributed SGD over $K$ simulated workers for many steps, run single-machine SGD at global batch $Kb$ for the same steps, and check that the two parameter trajectories coincide, not just at the end but at every step. The code below does exactly that on a linear-regression loss, where the gradient is a clean closed form. The "all-reduce" is the one line that averages the per-worker gradients; everything else is two ordinary SGD loops.
import numpy as np
rng = np.random.default_rng(7)
N, d, K = 4096, 20, 8 # global examples, features, workers
b = N // K # per-worker local batch size
X = rng.standard_normal((N, d))
w_true = rng.standard_normal(d)
y = X @ w_true + 0.1 * rng.standard_normal(N)
def grad_mse(w, Xs, ys):
# gradient of (1/m) * sum (x.w - y)^2 over the rows given
m = len(ys)
return (2.0 / m) * (Xs.T @ (Xs @ w - ys))
# ---- Reference: single machine, ONE big batch of size N = K*b, T steps ----
lr, T = 0.05, 30
w_ref = np.zeros(d)
ref_traj = [w_ref.copy()]
for _ in range(T):
g = grad_mse(w_ref, X, y) # full global-batch gradient
w_ref = w_ref - lr * g
ref_traj.append(w_ref.copy())
# ---- Synchronous distributed SGD over K workers, identical update ----
shards = np.array_split(np.arange(N), K) # disjoint shard per worker
w_sync = np.zeros(d) # every worker starts identical
sync_traj = [w_sync.copy()]
for _ in range(T):
# 1) each worker computes a local gradient on its own shard (in parallel)
locals_ = [grad_mse(w_sync, X[s], y[s]) for s in shards]
# 2) BARRIER + ALL-REDUCE: average the K local gradients
g_bar = np.mean(locals_, axis=0)
# 3) every worker applies the SAME averaged-gradient update
w_sync = w_sync - lr * g_bar
sync_traj.append(w_sync.copy())
ref_traj = np.array(ref_traj)
sync_traj = np.array(sync_traj)
max_dev = np.max(np.abs(ref_traj - sync_traj))
final_dev = np.linalg.norm(ref_traj[-1] - sync_traj[-1])
print("workers K :", K)
print("local batch b :", b)
print("global batch K*b :", K * b)
print("steps T :", T)
print("max |w_sync - w_ref| (all):", f"{max_dev:.2e}")
print("final ||w_sync - w_ref|| :", f"{final_dev:.2e}")
print("trajectories identical :", bool(max_dev < 1e-10))
max_dev is the largest per-coordinate gap across all 31 recorded parameter vectors), not just the final point, so any drift would show up immediately.workers K : 8
local batch b : 512
global batch K*b : 4096
steps T : 30
max |w_sync - w_ref| (all): 6.66e-16
final ||w_sync - w_ref|| : 3.16e-16
trajectories identical : True
The gap is not "small", it is zero up to rounding, and it stays zero across all thirty steps rather than slowly drifting. That is the experimental face of the identity: synchronous SGD is a faithful reorganization of single-machine SGD, so the optimizer's whole theory carries over untouched. The interesting engineering, as always in this book, is not in the correctness of the arithmetic; it is in the cost of the all-reduce when the gradient is billions of numbers long and the workers number in the thousands.
The single line g_bar = np.mean(locals_, axis=0) is the all-reduce of Section 4.3 standing in for the network collective. The thesis of this book is that one primitive, scaled out, drives method after method, and synchronous SGD is the clearest instance: the entire algorithm is "local gradient, all-reduce, identical step". When this loop moves onto real hardware in data-parallel deep learning (Chapter 15), the only thing that changes is that the average is computed by a ring or tree of GPUs exchanging buckets over the network, overlapped with the backward pass. The optimizer is identical; the collective is the engine.
Code 10.3.1 averaged the per-worker gradients by hand with one np.mean. On a real cluster you never gather gradients to one place; each worker keeps its own and the framework all-reduces them directly over the network during the backward pass. PyTorch's DistributedDataParallel wires the collective into autograd, so the synchronous step becomes ordinary training code:
# Run with: torchrun --nproc_per_node=8 thisfile.py
import torch, torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
dist.init_process_group("nccl") # join the group of K workers
model = DDP(build_model().cuda()) # wrap once; same weights on every worker
for x, y in local_shard_loader(): # each worker draws its own local batch B_k
loss = loss_fn(model(x), y)
loss.backward() # backward fires the all-reduce of g_k -> g_bar
optimizer.step() # every worker applies the SAME averaged update
optimizer.zero_grad()
DDP wrapper: backward() triggers the gradient all-reduce automatically, bucketing the parameters and overlapping the collective with computation so the barrier is partly hidden. The optimizer code is unchanged from single-machine training.4. The Price of the Barrier: Stragglers and Communication Intermediate
Exactness has a cost, and the cost is the barrier. Because every worker must arrive at the all-reduce before any worker can step, the duration of a step is set by the slowest participant. If seven workers finish their local gradient in 90 milliseconds and the eighth takes 300 because its host is sharing a noisy network link or throttling on heat, all eight spend 300 milliseconds on that step and seven of them sit idle for 210. This is the straggler problem from Section 2.7, and synchronous SGD is maximally exposed to it: there is no slack anywhere in the schedule, so a single slow worker taxes every step for the whole run. The more workers you add, the more likely it is that at least one of them is slow at any given moment, so the straggler tax tends to grow, not shrink, with scale.
The second tax is communication. The all-reduce moves the entire gradient between workers on every step, and a modern model's gradient is the same size as its parameters: hundreds of millions to hundreds of billions of numbers. The time to move those bytes does not get smaller as the workers compute faster; it is set by the network. When the per-step communication cost approaches or exceeds the per-step computation cost, adding workers stops helping, because the barrier you pay at every step swallows the speedup the extra workers were supposed to provide. This is the regime that the communication-cost models of Section 4.1 and Chapter 3 quantify, and it is exactly why so much of this chapter is about making the combine step cheaper rather than the compute step faster.
Who: An ML platform engineer running data-parallel training for a recommendation team on an eight-GPU node.
Situation: A synchronous DDP job that normally finished an epoch in 22 minutes had crept to 41 minutes overnight, with no change to the code or the data.
Problem: Throughput per step had halved, but every worker reported healthy utilization when sampled individually, so nothing looked broken.
Dilemma: Switch to an asynchronous scheme to remove the barrier and tolerate the slow worker, at the cost of giving up the exactness of Section 2, or find and fix the straggler and keep synchronous SGD's clean convergence.
Decision: Keep synchronous SGD and hunt the straggler, because the team relied on the reproducibility that exactness gives and did not want to debug stale-gradient convergence under a deadline.
How: Per-step timing per rank showed rank 6 consistently arriving last at the all-reduce; nvidia-smi revealed that GPU thermal throttling on one slot was clocking it 30 percent slower than its siblings.
Result: Reseating the card and restoring airflow brought rank 6 back in line, and the epoch returned to 22 minutes. The barrier had been faithfully reporting the truth: one slow worker had been setting the pace for all eight.
Lesson: In synchronous SGD the slowest worker is the system's clock. Watch per-rank arrival times at the barrier, because a single straggler is invisible in aggregate utilization yet doubles wall-clock time.
These two taxes, the straggler tax and the communication tax, are the agenda for the rest of Chapter 10. The next section relaxes the barrier itself: asynchronous distributed SGD lets workers step without waiting for the slowest, trading the exactness of Section 2 for resistance to stragglers, and inheriting a new problem (stale gradients) in the bargain. Later sections attack the communication tax directly with gradient compression and local-update schemes, and make the all-reduce itself the object of study. The exactness established here is the baseline every one of those methods is measured against.
Because synchronous SGD's exactness is so valuable, a major research thrust is to preserve it while defeating its two taxes rather than abandoning the barrier. On the communication side, local-update schemes in the DiLoCo lineage (Douillard et al., 2024) let each worker take many inner steps between global synchronizations, cutting all-reduce frequency by one or two orders of magnitude and enabling synchronous training over the open internet and across data centers; follow-on work (streaming and async DiLoCo, 2024 to 2025) pushes the idea toward geographically distributed clusters. On the straggler side, production frameworks pair synchronous SGD with elastic and fault-tolerant runtimes (PyTorch's torchrun elastic agent, and systems-level work on bubble-free pipelines) so that a slow or dead worker is reconfigured around rather than allowed to stall the barrier, a thread Chapter 18 develops. The common message of this line is that the exactness of Section 2 is an asset worth engineering hard to keep, not a constraint to be relaxed at the first sign of a straggler.
Synchronous SGD is the one group project where the rule "nobody hands in until everybody hands in" is enforced by physics. The upside is that the final report is always perfectly consistent: every member literally holds the same document. The downside is the one teammate who is still "almost done" at the barrier, holding the other seven hostage at every single step, forever. The rest of this chapter is, in a sense, an extended meditation on how to deal with that teammate without giving up the consistency.
The identity in Section 2 collapses "average within each worker, then average across workers" into a single average over the global batch, but only because every worker's batch has the same size $b$. (a) Write out $\bar{g}$ for the case where worker $k$ has $b_k$ examples and $\sum_k b_k = M$, and show that the plain unweighted mean $\frac{1}{K}\sum_k g_k$ no longer equals the global-batch gradient $\frac{1}{M}\sum_{i} \nabla\ell$. (b) State the size-weighted combination that does recover the global-batch gradient. (c) Explain what this implies for a synchronous job in which the last batch of an epoch is smaller than the others, and how a framework can keep the update exact anyway.
Extend Code 10.3.1 into a timing simulation. Keep the exact gradient logic, but before each worker's gradient give it a simulated compute time drawn from a distribution: seven workers take a time near $t_0$ and one worker (the straggler) takes $\lambda t_0$ for a multiplier $\lambda \ge 1$. Model the synchronous step time as the maximum over the $K$ worker times plus a fixed all-reduce cost $c$. Plot wall-clock time per step against $\lambda$ for $\lambda \in \{1, 1.5, 2, 3, 5\}$, and separately against $K \in \{2, 8, 32, 128\}$ when each worker's time is random. Confirm numerically that the per-step time tracks the slowest worker, and that the expected straggler penalty grows with $K$.
A model has $P = 2 \times 10^8$ parameters (4 bytes each). One worker computes its local gradient in $t_{\text{comp}} = 80$ milliseconds. The all-reduce of a $P$-parameter gradient across $K$ workers on a ring takes roughly $t_{\text{comm}} \approx 2\,\frac{(K-1)}{K}\,\frac{4P}{\beta}$ seconds, where $\beta$ is the per-link bandwidth (use $\beta = 10$ gigabytes per second). (a) Compute $t_{\text{comp}} + t_{\text{comm}}$ for $K = 8$ and for $K = 64$, and report the fraction of each step spent communicating. (b) At what $K$ does communication first exceed computation for this model and bandwidth? (c) Argue from your numbers why this section's exact algorithm, run naively, can stop scaling well before you run out of machines, and name one technique from later in the chapter that attacks the term you identified. Make the estimate rigorous against the cost models of Chapter 3.