"They kept asking me to send fewer bits. I asked them how few. Nobody had read the theorem that already answered them."
An All-Reduce That Has Seen Some Gradients
Distributed optimization spends two distinct resources, computation and communication, and there is a hard floor on how little communication any algorithm can use to reach a given accuracy. Everything earlier in this chapter, synchronous and asynchronous SGD, all-reduce, stale gradients, gradient compression, and local updates, can be read as moves in one game: reach error $\epsilon$ while exchanging as few rounds and as few bits as possible. This section makes the resource explicit. We count communication rounds and bits per round, not just floating-point operations; we write the convergence rate of SGD as a function of iterations and gradient noise and see exactly how distribution changes its constants; we state the communication-computation trade-off of local-update methods as a formal object; and we state, at a teaching level, the lower bounds proving you cannot communicate arbitrarily little. The conclusion that governs the rest of Part III and Part IV: clever systems tricks move you toward the bound, never past it.
The previous sections of this chapter treated communication as a cost to be reduced by engineering: overlap it with computation (Section 10.5), compress it (Section 10.7), or amortize it across several local steps (also Section 10.7). Each of those is a tactic. This section supplies the strategy by asking the question that bounds them all: what is the least communication any method could possibly get away with? Once you can write down that floor, every tactic gets a yardstick. A compression scheme that halves the bytes is good; a local-update schedule that needs ten times fewer rounds is better; but neither can drop below the floor that the structure of the problem imposes, and knowing where the floor is tells you when to stop optimizing and start accepting.
1. Two Resources, Not One: Rounds and Bits Beginner
In single-machine optimization the only resource that matters is computation, usually counted in gradient evaluations or floating-point operations. The moment the work is distributed, a second resource appears and quickly dominates: communication. We measure it along two independent axes. The first is the number of communication rounds, meaning how many times the workers must synchronize, each round paying the network's fixed startup latency $\alpha$. The second is the number of bits per round, meaning how large each synchronized message is, paying the inverse-bandwidth cost $\beta$ per word. This is exactly the $\alpha$-$\beta$ cost model introduced in Chapter 3 and applied to collectives in Chapter 4; here we use it to score whole optimization algorithms rather than single messages.
The two axes call for different remedies, and confusing them is a common mistake. Gradient compression (Section 10.7) attacks the bits-per-round axis: it shrinks each message while keeping the number of rounds fixed. Local-update methods attack the rounds axis: they keep each message full size but communicate far less often. A method can be excellent on one axis and useless on the other, so an honest accounting always reports both. The cost of running an iterative optimizer for $R$ rounds, each exchanging a $P$-word model, is approximately
$$T_{\text{comm}} \;\approx\; R \,\bigl(\alpha + \beta \, b(P)\bigr),$$where $b(P)$ is the bits actually sent per round after any compression. Reducing $R$ and reducing $b(P)$ are separate levers, and the lower bounds we reach in Section 4 constrain them separately.
The cost of a distributed optimizer is not its FLOP count; it is the number of synchronization rounds times the bits per round. Two algorithms can do identical gradient work yet differ by orders of magnitude in wall-clock time because one synchronizes every step and the other synchronizes every hundredth step. When you analyze a distributed method, count rounds first: latency $\alpha$ is paid once per round regardless of message size, and on a real cluster $\alpha$ is often the term that dominates for the small, frequent messages that synchronous SGD produces.
2. How Distribution Reshapes the SGD Convergence Rate Intermediate
To talk about "communication needed to reach accuracy $\epsilon$" we first need the rate at which stochastic gradient descent reaches $\epsilon$ at all. For a convex objective optimized by SGD with stochastic gradients of variance bounded by $\sigma^2$, the classic result is that after $T$ iterations the expected sub-optimality decays as
$$\mathbb{E}\bigl[f(\bar w_T) - f^\star\bigr] \;=\; O\!\left(\frac{1}{\sqrt{T}}\right) \;+\; O\!\left(\frac{\sigma^2}{\sqrt{T}}\right),$$so two terms govern the error: an optimization term that shrinks as you take more steps, and a noise term set by the variance $\sigma^2$ of the gradient estimates. This is the rate that Section 10.2 established for mini-batch SGD on one machine. The crucial fact for distribution is what happens to $\sigma^2$ when $K$ workers each compute a gradient on independent data and average the results: the variance of the averaged gradient falls like $\sigma^2 / K$. More workers do not change the shape of the rate; they shrink the constant in front of the noise term. This linear variance reduction is the entire statistical reason to use many workers, and it is the same averaging identity that made data parallelism exact in Section 1.1.
For a $\mu$-strongly-convex, $L$-smooth objective the picture sharpens into a form we can turn into a round count. Writing the condition number as $\kappa = L/\mu$, full-batch gradient descent reaches error $\epsilon$ in $O\!\left(\kappa \log(1/\epsilon)\right)$ iterations, and accelerated (momentum) methods improve this to $O\!\left(\sqrt{\kappa}\,\log(1/\epsilon)\right)$. In a distributed setting where every iteration requires one synchronization, those iteration counts are the round counts. Distribution helps the noise constant through the $\sigma^2/K$ term, but the logarithmic-in-$\epsilon$, square-root-in-$\kappa$ dependence on rounds does not vanish: it is the part that the lower bounds of Section 4 prove is irreducible. The benefit of more workers is real but bounded, exactly the "when distribution helps and when it hurts" tension first raised in Section 1.1 and quantified by the performance models of Chapter 3.
3. The Communication-Computation Trade-Off as a Formal Object Intermediate
Local-update methods, the local-SGD family of Section 10.7, make a precise bargain: each worker takes $H$ local gradient steps before the workers average their models, so one communication round now covers $H$ iterations of computation. If a fully synchronous run needs $T$ iterations, the local-update run needs only about $T/H$ rounds, an $H$-fold reduction in the rounds axis. That is not free. Letting workers drift for $H$ steps before reconciling introduces a discrepancy between their local models, which inflates the effective noise and slows convergence per round; the analysis shows the optimization term degrades in a way that grows with $H$. The trade-off is therefore a genuine formal object, not a vibe: rounds scale as $T/H$ while the per-round progress shrinks, and there is an optimal $H$ that minimizes wall-clock time for a given network. Picking $H$ too small wastes latency on needless rounds; picking it too large wastes computation chasing a drifting average.
Figure 10.9.1 plots this as a region. The horizontal axis is communication rounds, the vertical axis is accuracy reached, and the dashed frontier is the lower bound: no point below it is achievable by any algorithm. Full-communication SGD lives near the frontier but pays for it in rounds; local-update methods slide left into the interior, buying fewer rounds with extra computation, but they too are stopped by the same frontier. The demo below measures real points in this picture.
4. Lower Bounds: You Cannot Communicate Arbitrarily Little Advanced
The optimistic reading of Section 3 is that clever scheduling can drive the round count as low as you like. It cannot, and the reason is a line of results on the communication complexity of distributed and convex optimization. The central message, stated at a teaching level without the full proofs, is an existence claim about a floor: for the class of $L$-smooth, $\mu$-strongly-convex problems split across machines whose data is connected over a network, any first-order algorithm that reaches accuracy $\epsilon$ must perform at least
$$R \;=\; \Omega\!\left(\sqrt{\kappa}\,\log\tfrac{1}{\epsilon}\right)$$communication rounds, where $\kappa = L/\mu$ is the condition number. The $\log(1/\epsilon)$ factor says high accuracy fundamentally costs more rounds; the $\sqrt{\kappa}$ factor says badly conditioned problems fundamentally cost more rounds; and the $\Omega$ says this is a floor, not a target. No amount of overlap, compression, or local stepping removes it, because it follows from how slowly information about the global optimum can propagate through the workers, independent of implementation. Companion results bound the bits: reaching $\epsilon$ also requires a minimum total number of bits exchanged, so even infinitely many rounds of one-bit messages cannot win. These bounds descend from the convex-optimization complexity theory of Nemirovski and Yudin and were sharpened for the explicitly distributed setting by Tsitsiklis and Luo and, more recently, by Arjevani and Shamir and by Scaman and colleagues, whose graph-dependent bounds tie the round count to the network's spectral gap.
Every communication-reduction technique in this chapter, overlap, compression, sparsification, local updates, moves a real algorithm closer to the lower-bound frontier. None of them moves the frontier. The practical consequence is liberating rather than discouraging: once a method is operating within a small constant factor of $\Omega(\sqrt{\kappa}\,\log(1/\epsilon))$ rounds, further engineering on the communication axis has almost nothing left to win, and your effort is better spent on the problem's conditioning $\kappa$ (preconditioning, normalization) or on accepting a looser $\epsilon$. The bound tells you when to stop.
Two practical levers do remain, and they are exactly the terms in the bound. Improving the conditioning $\kappa$, through preconditioning or normalization, lowers the floor itself, which is why so much of large-batch training (Section 10.8) is really about keeping $\kappa$ in check. And relaxing the target $\epsilon$ moves you off the steep part of the $\log(1/\epsilon)$ curve. What you cannot do is hold $\kappa$ and $\epsilon$ fixed and wish the round count below the frontier.
5. Measuring the Trade-Off and the Frontier Intermediate
The code below makes the abstract frontier concrete on a strongly-convex quadratic with stochastic gradients. It runs distributed SGD across $K = 16$ workers and measures the number of communication rounds needed to reach a fixed accuracy $\epsilon$, sweeping the local-step count $H$ from full communication ($H = 1$) up to $H = 32$. For each setting it also reports the gradient evaluations per worker, so the computation paid for fewer rounds is visible. Finally it prints the lower-bound reference $\sqrt{\kappa}\,\log(1/\epsilon)$, the floor that no row can cross.
import numpy as np
# Minimize the strongly-convex quadratic f(w) = 0.5 (w - w*)^T H (w - w*) by SGD.
# Each stochastic gradient carries zero-mean noise of variance sigma^2. K workers
# average their iterates at every COMMUNICATION ROUND. Local-step SGD takes H local
# steps between rounds, trading more computation per round for fewer rounds.
rng = np.random.default_rng(0)
d = 20
eigs = np.linspace(1.0, 10.0, d) # mu = 1, L = 10, so kappa = 10
H = np.diag(eigs)
mu, L = eigs.min(), eigs.max()
w_star = rng.standard_normal(d)
K = 16 # workers
sigma2 = 0.2 # per-step gradient-noise variance
lr = 0.05 # fixed step size
def f_gap(w):
diff = w - w_star
return 0.5 * diff @ (H @ diff)
def grad(w): # true gradient + zero-mean noise
return H @ (w - w_star) + np.sqrt(sigma2) * rng.standard_normal(d)
def rounds_to_eps(local_steps, epsilon, max_rounds=400000):
w = np.zeros(d)
rounds = 0
while f_gap(w) > epsilon and rounds < max_rounds:
states = [w.copy() for _ in range(K)]
for _ in range(local_steps): # H local steps on each worker
for k in range(K):
states[k] -= lr * grad(states[k])
w = np.mean(states, axis=0) # all-reduce / averaging = one round
rounds += 1
return rounds, f_gap(w)
epsilon = 5e-3
print(f"target epsilon = {epsilon:.0e}, K = {K} workers, kappa = {L/mu:.0f}\n")
print(f"{'local_steps_H':>14} {'comm_rounds':>12} {'grad_evals/worker':>18} {'final_gap':>12}")
for Hsteps in [1, 2, 4, 8, 16, 32]:
r, g = rounds_to_eps(Hsteps, epsilon)
label = "1 (full-comm)" if Hsteps == 1 else str(Hsteps)
print(f"{label:>14} {r:>12d} {r*Hsteps:>18d} {g:>12.2e}")
# Lower-bound reference for mu-strongly-convex, L-smooth first-order optimization.
kappa = L / mu
lb = np.sqrt(kappa) * np.log(1.0 / epsilon)
print(f"\nlower-bound rounds reference : ~{lb:.0f} (Omega(sqrt(kappa) log(1/eps)))")
local_steps, the number of local iterations $H$ between averaging rounds.target epsilon = 5e-03, K = 16 workers, kappa = 10
local_steps_H comm_rounds grad_evals/worker final_gap
1 (full-comm) 26 26 4.17e-03
2 17 34 4.59e-03
4 9 36 3.20e-03
8 5 40 4.07e-03
16 3 48 3.14e-03
32 1 32 3.20e-03
lower-bound rounds reference : ~17 (Omega(sqrt(kappa) log(1/eps)))
The numbers reproduce the whole story of this section in one table. Rounds shrink roughly in proportion to $H$, the formal trade-off of Section 3; the gradient-evaluation column shows that the rounds are bought with computation, not magic; and full-communication SGD already operates within a small factor of the lower-bound reference, confirming that for a well-conditioned problem there is little frontier left to chase. The local-step rows reach the target with fewer rounds because the extra local computation does real optimization work, but none of them violates the floor, exactly as the theory of Section 4 requires.
Who: A distributed-systems engineer at a startup training a mid-size language model across two data centers on different continents.
Situation: The cross-continent link had a 140-millisecond round-trip, so every synchronous all-reduce round paid a latency $\alpha$ that dwarfed the gradient compute time.
Problem: The team had spent two weeks tuning gradient compression to shrink each message, but wall-clock training barely moved, because the bottleneck was the number of rounds, not the bits per round.
Dilemma: Keep investing in compression on the bits axis, or switch effort to the rounds axis with local-update steps, which risked a small loss in final accuracy.
Decision: They estimated the round floor from $\sqrt{\kappa}\,\log(1/\epsilon)$, saw their synchronous run was using far more rounds than the floor demanded, and adopted local-SGD with $H$ chosen to cut rounds toward that floor.
How: Each worker took several dozen local steps between averaging rounds, communicating roughly thirty times less often, with a small slowdown in per-round progress that the analysis of Section 3 predicted.
Result: Wall-clock training time fell by more than half, dominated now by computation rather than latency, with a negligible accuracy gap. Further compression work was shelved as near-frontier and not worth the engineering.
Lesson: Identify which axis binds before optimizing. When latency dominates, rounds are the resource; the lower bound tells you how few rounds are even possible, and how much room is left to win.
Code 10.9.1 builds the local-step averaging by hand to expose the rounds. In production you do not maintain per-worker state and average it yourself; a framework runs the local steps and the periodic all-reduce for you. PyTorch ships a post-local-SGD optimizer hook that takes $H$ local steps between communications, collapsing the entire round-management loop to a few lines of setup:
# Run with: torchrun --nproc_per_node=16 thisfile.py
import torch.distributed as dist
import torch.distributed.algorithms.model_averaging.averagers as averagers
dist.init_process_group("nccl") # join the K workers
averager = averagers.PeriodicModelAverager(period=8, warmup_steps=100) # H = 8
for step, batch in enumerate(loader):
loss = model(batch).loss
loss.backward()
optimizer.step() # local step, no communication
averager.average_parameters(model.parameters()) # all-reduce only every 8th step
period=8 setting. The dozen-line manual averaging loop collapses to a configured averager, and the library handles the process group, the every-$H$-th all-reduce, and the warmup that keeps early training synchronous.6. Where the Bound Leaves Us Beginner
The communication lower bound is the theoretical backbone the rest of this chapter rests on. It explains why all-reduce SGD cannot be made free, why local updates and compression help but eventually run out of room, and why large-batch methods are partly a fight over the conditioning term $\kappa$ that sets the floor. It also reframes the systems chapters ahead: the topology-aware collectives of Chapter 4 and the sharded communication schedules of Chapter 16 are engineering that drives the constant factor down toward the bound, not a way around it. Knowing the floor turns "communicate less" from an open-ended wish into a question with an answer.
The current frontier is about reaching the $\Omega(\sqrt{\kappa}\,\log(1/\epsilon))$ round floor under harsher conditions than the clean convex theory assumes. DiLoCo and its successors (Douillard et al., 2024) push local-update training to genuinely geo-distributed and over-the-internet settings, communicating hundreds of times less often than synchronous SGD while staying close to the same loss, and 2025 streaming-DiLoCo variants overlap the rare communication with computation to shave the remaining latency. In parallel, decentralized and graph-based analyses tighten the lower bounds themselves: Scaman-style results make the round count depend on the network's spectral gap, so a poorly connected topology raises the floor, and recent work derives matching upper-bound algorithms that provably hit it. A separate thread proves lower bounds for the nonconvex, deep-learning-relevant regime, where the clean $\sqrt{\kappa}$ scaling is replaced by bounds in terms of gradient smoothness and noise. The unifying theme is that the bound is no longer just a cautionary theorem; it is a design target that 2024 to 2026 methods are explicitly engineered to meet.
Every distributed-training team eventually has the meeting where someone proposes squeezing communication a little further, then someone else proposes squeezing it further still, and the whiteboard fills with schemes. The lower bound is the adult in the room. It does not say your schemes are bad; it says there is a number below which all of them fail at once, and it hands you that number. The first time you compute $\sqrt{\kappa}\,\log(1/\epsilon)$ for your actual problem and discover you are already within a factor of two of it, the meeting ends early, and that is the theorem doing its job.
With the floor in hand, the chapter's final section steps back from any single method and asks how convergence speed, communication cost, and the practical realities of a cluster combine into the choices an engineer actually makes. That synthesis, weighing the rate of Section 10.2 against the costs of this section, is the subject of Section 10.10.
A team reports that their gradient-compression scheme cut bytes per all-reduce by 16x but barely changed wall-clock training time, while a colleague's local-SGD change with $H = 50$ cut wall-clock time in half. Using the cost model $T_{\text{comm}} \approx R(\alpha + \beta\,b(P))$ from Section 1, explain which axis each change attacks and why, on a high-latency network where $\alpha$ dominates, the local-SGD change won. State what kind of network would have made the compression change the more effective of the two.
Modify Code 10.9.1 to sweep the condition number by changing the top eigenvalue (try $L \in \{5, 10, 20, 40\}$ with $\mu = 1$). For each $L$, record the full-communication round count and the lower-bound reference $\sqrt{\kappa}\,\log(1/\epsilon)$, and plot both against $\kappa$. Confirm that the measured round count grows with $\kappa$ in the same direction as the bound, and discuss why the constant factor between them is not exactly one. Then push the local-step count $H$ very large (say 256) and explain, from the final-gap column, what eventually stops local-SGD from reaching $\epsilon$ in a single round.
You are training a model where measurements suggest an effective condition number $\kappa \approx 10^4$ and you target $\epsilon = 10^{-6}$ relative error. Using $R = \sqrt{\kappa}\,\log(1/\epsilon)$ as the round floor, estimate the minimum number of communication rounds any first-order method could need. If your synchronous run currently uses $2\times10^5$ rounds, by what factor are you above the floor, and is local stepping or preconditioning the better lever to pursue first? Now suppose accelerated methods are unavailable and you must use the non-accelerated $O(\kappa\log(1/\epsilon))$ rate: recompute the floor and comment on how much the loss of acceleration costs you in rounds.