Part III: Distributed Machine Learning
Chapter 10: Distributed Optimization

Communication Complexity and Lower Bounds

"They kept asking me to send fewer bits. I asked them how few. Nobody had read the theorem that already answered them."

An All-Reduce That Has Seen Some Gradients
Big Picture

Distributed optimization spends two distinct resources, computation and communication, and there is a hard floor on how little communication any algorithm can use to reach a given accuracy. Everything earlier in this chapter, synchronous and asynchronous SGD, all-reduce, stale gradients, gradient compression, and local updates, can be read as moves in one game: reach error $\epsilon$ while exchanging as few rounds and as few bits as possible. This section makes the resource explicit. We count communication rounds and bits per round, not just floating-point operations; we write the convergence rate of SGD as a function of iterations and gradient noise and see exactly how distribution changes its constants; we state the communication-computation trade-off of local-update methods as a formal object; and we state, at a teaching level, the lower bounds proving you cannot communicate arbitrarily little. The conclusion that governs the rest of Part III and Part IV: clever systems tricks move you toward the bound, never past it.

The previous sections of this chapter treated communication as a cost to be reduced by engineering: overlap it with computation (Section 10.5), compress it (Section 10.7), or amortize it across several local steps (also Section 10.7). Each of those is a tactic. This section supplies the strategy by asking the question that bounds them all: what is the least communication any method could possibly get away with? Once you can write down that floor, every tactic gets a yardstick. A compression scheme that halves the bytes is good; a local-update schedule that needs ten times fewer rounds is better; but neither can drop below the floor that the structure of the problem imposes, and knowing where the floor is tells you when to stop optimizing and start accepting.

1. Two Resources, Not One: Rounds and Bits Beginner

In single-machine optimization the only resource that matters is computation, usually counted in gradient evaluations or floating-point operations. The moment the work is distributed, a second resource appears and quickly dominates: communication. We measure it along two independent axes. The first is the number of communication rounds, meaning how many times the workers must synchronize, each round paying the network's fixed startup latency $\alpha$. The second is the number of bits per round, meaning how large each synchronized message is, paying the inverse-bandwidth cost $\beta$ per word. This is exactly the $\alpha$-$\beta$ cost model introduced in Chapter 3 and applied to collectives in Chapter 4; here we use it to score whole optimization algorithms rather than single messages.

The two axes call for different remedies, and confusing them is a common mistake. Gradient compression (Section 10.7) attacks the bits-per-round axis: it shrinks each message while keeping the number of rounds fixed. Local-update methods attack the rounds axis: they keep each message full size but communicate far less often. A method can be excellent on one axis and useless on the other, so an honest accounting always reports both. The cost of running an iterative optimizer for $R$ rounds, each exchanging a $P$-word model, is approximately

$$T_{\text{comm}} \;\approx\; R \,\bigl(\alpha + \beta \, b(P)\bigr),$$

where $b(P)$ is the bits actually sent per round after any compression. Reducing $R$ and reducing $b(P)$ are separate levers, and the lower bounds we reach in Section 4 constrain them separately.

Key Insight: Communication Has Its Own Complexity, Counted in Rounds

The cost of a distributed optimizer is not its FLOP count; it is the number of synchronization rounds times the bits per round. Two algorithms can do identical gradient work yet differ by orders of magnitude in wall-clock time because one synchronizes every step and the other synchronizes every hundredth step. When you analyze a distributed method, count rounds first: latency $\alpha$ is paid once per round regardless of message size, and on a real cluster $\alpha$ is often the term that dominates for the small, frequent messages that synchronous SGD produces.

2. How Distribution Reshapes the SGD Convergence Rate Intermediate

To talk about "communication needed to reach accuracy $\epsilon$" we first need the rate at which stochastic gradient descent reaches $\epsilon$ at all. For a convex objective optimized by SGD with stochastic gradients of variance bounded by $\sigma^2$, the classic result is that after $T$ iterations the expected sub-optimality decays as

$$\mathbb{E}\bigl[f(\bar w_T) - f^\star\bigr] \;=\; O\!\left(\frac{1}{\sqrt{T}}\right) \;+\; O\!\left(\frac{\sigma^2}{\sqrt{T}}\right),$$

so two terms govern the error: an optimization term that shrinks as you take more steps, and a noise term set by the variance $\sigma^2$ of the gradient estimates. This is the rate that Section 10.2 established for mini-batch SGD on one machine. The crucial fact for distribution is what happens to $\sigma^2$ when $K$ workers each compute a gradient on independent data and average the results: the variance of the averaged gradient falls like $\sigma^2 / K$. More workers do not change the shape of the rate; they shrink the constant in front of the noise term. This linear variance reduction is the entire statistical reason to use many workers, and it is the same averaging identity that made data parallelism exact in Section 1.1.

For a $\mu$-strongly-convex, $L$-smooth objective the picture sharpens into a form we can turn into a round count. Writing the condition number as $\kappa = L/\mu$, full-batch gradient descent reaches error $\epsilon$ in $O\!\left(\kappa \log(1/\epsilon)\right)$ iterations, and accelerated (momentum) methods improve this to $O\!\left(\sqrt{\kappa}\,\log(1/\epsilon)\right)$. In a distributed setting where every iteration requires one synchronization, those iteration counts are the round counts. Distribution helps the noise constant through the $\sigma^2/K$ term, but the logarithmic-in-$\epsilon$, square-root-in-$\kappa$ dependence on rounds does not vanish: it is the part that the lower bounds of Section 4 prove is irreducible. The benefit of more workers is real but bounded, exactly the "when distribution helps and when it hurts" tension first raised in Section 1.1 and quantified by the performance models of Chapter 3.

3. The Communication-Computation Trade-Off as a Formal Object Intermediate

Local-update methods, the local-SGD family of Section 10.7, make a precise bargain: each worker takes $H$ local gradient steps before the workers average their models, so one communication round now covers $H$ iterations of computation. If a fully synchronous run needs $T$ iterations, the local-update run needs only about $T/H$ rounds, an $H$-fold reduction in the rounds axis. That is not free. Letting workers drift for $H$ steps before reconciling introduces a discrepancy between their local models, which inflates the effective noise and slows convergence per round; the analysis shows the optimization term degrades in a way that grows with $H$. The trade-off is therefore a genuine formal object, not a vibe: rounds scale as $T/H$ while the per-round progress shrinks, and there is an optimal $H$ that minimizes wall-clock time for a given network. Picking $H$ too small wastes latency on needless rounds; picking it too large wastes computation chasing a drifting average.

Accuracy reached versus communication rounds communication rounds R (more to the right) accuracy (1 / error, higher is better) lower-bound frontier (no method reaches below) infeasible region: too few rounds for this accuracy full-comm SGD ~26 rounds, near the frontier local-step SGD, H=8 ~5 rounds, same accuracy H=32 ~1 round, more compute/round
Figure 10.9.1: The achievable region (shaded) of accuracy versus communication rounds, bounded below by the lower-bound frontier (dashed orange). No algorithm can reach a point below the frontier: that is the infeasible region. Full-communication SGD sits as a point near the frontier at many rounds; local-step SGD variants move leftward into the region, reaching the same accuracy with far fewer rounds at the cost of more computation per round. The arrow traces the trajectory the demo in Code 10.9.1 measures as the local-step count $H$ rises.

Figure 10.9.1 plots this as a region. The horizontal axis is communication rounds, the vertical axis is accuracy reached, and the dashed frontier is the lower bound: no point below it is achievable by any algorithm. Full-communication SGD lives near the frontier but pays for it in rounds; local-update methods slide left into the interior, buying fewer rounds with extra computation, but they too are stopped by the same frontier. The demo below measures real points in this picture.

4. Lower Bounds: You Cannot Communicate Arbitrarily Little Advanced

The optimistic reading of Section 3 is that clever scheduling can drive the round count as low as you like. It cannot, and the reason is a line of results on the communication complexity of distributed and convex optimization. The central message, stated at a teaching level without the full proofs, is an existence claim about a floor: for the class of $L$-smooth, $\mu$-strongly-convex problems split across machines whose data is connected over a network, any first-order algorithm that reaches accuracy $\epsilon$ must perform at least

$$R \;=\; \Omega\!\left(\sqrt{\kappa}\,\log\tfrac{1}{\epsilon}\right)$$

communication rounds, where $\kappa = L/\mu$ is the condition number. The $\log(1/\epsilon)$ factor says high accuracy fundamentally costs more rounds; the $\sqrt{\kappa}$ factor says badly conditioned problems fundamentally cost more rounds; and the $\Omega$ says this is a floor, not a target. No amount of overlap, compression, or local stepping removes it, because it follows from how slowly information about the global optimum can propagate through the workers, independent of implementation. Companion results bound the bits: reaching $\epsilon$ also requires a minimum total number of bits exchanged, so even infinitely many rounds of one-bit messages cannot win. These bounds descend from the convex-optimization complexity theory of Nemirovski and Yudin and were sharpened for the explicitly distributed setting by Tsitsiklis and Luo and, more recently, by Arjevani and Shamir and by Scaman and colleagues, whose graph-dependent bounds tie the round count to the network's spectral gap.

Key Insight: Systems Tricks Approach the Bound, They Do Not Beat It

Every communication-reduction technique in this chapter, overlap, compression, sparsification, local updates, moves a real algorithm closer to the lower-bound frontier. None of them moves the frontier. The practical consequence is liberating rather than discouraging: once a method is operating within a small constant factor of $\Omega(\sqrt{\kappa}\,\log(1/\epsilon))$ rounds, further engineering on the communication axis has almost nothing left to win, and your effort is better spent on the problem's conditioning $\kappa$ (preconditioning, normalization) or on accepting a looser $\epsilon$. The bound tells you when to stop.

Two practical levers do remain, and they are exactly the terms in the bound. Improving the conditioning $\kappa$, through preconditioning or normalization, lowers the floor itself, which is why so much of large-batch training (Section 10.8) is really about keeping $\kappa$ in check. And relaxing the target $\epsilon$ moves you off the steep part of the $\log(1/\epsilon)$ curve. What you cannot do is hold $\kappa$ and $\epsilon$ fixed and wish the round count below the frontier.

5. Measuring the Trade-Off and the Frontier Intermediate

The code below makes the abstract frontier concrete on a strongly-convex quadratic with stochastic gradients. It runs distributed SGD across $K = 16$ workers and measures the number of communication rounds needed to reach a fixed accuracy $\epsilon$, sweeping the local-step count $H$ from full communication ($H = 1$) up to $H = 32$. For each setting it also reports the gradient evaluations per worker, so the computation paid for fewer rounds is visible. Finally it prints the lower-bound reference $\sqrt{\kappa}\,\log(1/\epsilon)$, the floor that no row can cross.

import numpy as np

# Minimize the strongly-convex quadratic f(w) = 0.5 (w - w*)^T H (w - w*) by SGD.
# Each stochastic gradient carries zero-mean noise of variance sigma^2. K workers
# average their iterates at every COMMUNICATION ROUND. Local-step SGD takes H local
# steps between rounds, trading more computation per round for fewer rounds.

rng = np.random.default_rng(0)
d = 20
eigs = np.linspace(1.0, 10.0, d)          # mu = 1, L = 10, so kappa = 10
H = np.diag(eigs)
mu, L = eigs.min(), eigs.max()
w_star = rng.standard_normal(d)
K = 16                                     # workers
sigma2 = 0.2                               # per-step gradient-noise variance
lr = 0.05                                  # fixed step size

def f_gap(w):
    diff = w - w_star
    return 0.5 * diff @ (H @ diff)

def grad(w):                               # true gradient + zero-mean noise
    return H @ (w - w_star) + np.sqrt(sigma2) * rng.standard_normal(d)

def rounds_to_eps(local_steps, epsilon, max_rounds=400000):
    w = np.zeros(d)
    rounds = 0
    while f_gap(w) > epsilon and rounds < max_rounds:
        states = [w.copy() for _ in range(K)]
        for _ in range(local_steps):       # H local steps on each worker
            for k in range(K):
                states[k] -= lr * grad(states[k])
        w = np.mean(states, axis=0)        # all-reduce / averaging = one round
        rounds += 1
    return rounds, f_gap(w)

epsilon = 5e-3
print(f"target epsilon = {epsilon:.0e},  K = {K} workers,  kappa = {L/mu:.0f}\n")
print(f"{'local_steps_H':>14} {'comm_rounds':>12} {'grad_evals/worker':>18} {'final_gap':>12}")
for Hsteps in [1, 2, 4, 8, 16, 32]:
    r, g = rounds_to_eps(Hsteps, epsilon)
    label = "1 (full-comm)" if Hsteps == 1 else str(Hsteps)
    print(f"{label:>14} {r:>12d} {r*Hsteps:>18d} {g:>12.2e}")

# Lower-bound reference for mu-strongly-convex, L-smooth first-order optimization.
kappa = L / mu
lb = np.sqrt(kappa) * np.log(1.0 / epsilon)
print(f"\nlower-bound rounds reference : ~{lb:.0f}  (Omega(sqrt(kappa) log(1/eps)))")
Code 10.9.1: Rounds-to-accuracy for full-communication versus local-step SGD on a conditioned quadratic, with the lower-bound reference computed in the same run. The only knob that changes between rows is local_steps, the number of local iterations $H$ between averaging rounds.
target epsilon = 5e-03,  K = 16 workers,  kappa = 10

 local_steps_H  comm_rounds  grad_evals/worker    final_gap
 1 (full-comm)           26                 26     4.17e-03
             2           17                 34     4.59e-03
             4            9                 36     3.20e-03
             8            5                 40     4.07e-03
            16            3                 48     3.14e-03
            32            1                 32     3.20e-03

lower-bound rounds reference : ~17  (Omega(sqrt(kappa) log(1/eps)))
Output 10.9.1: Communication rounds fall steeply as the local-step count rises, from 26 rounds at full communication to a single round at $H = 32$, while every variant still reaches the target $\epsilon = 5\times10^{-3}$. The price is in the third column: gradient evaluations per worker climb from 26 to as high as 48, the extra computation that buys the rounds. Full-communication SGD sits at the same order of magnitude as the $\sqrt{\kappa}\,\log(1/\epsilon)$ floor of about 17 rounds, the near-frontier point in Figure 10.9.1.

The numbers reproduce the whole story of this section in one table. Rounds shrink roughly in proportion to $H$, the formal trade-off of Section 3; the gradient-evaluation column shows that the rounds are bought with computation, not magic; and full-communication SGD already operates within a small factor of the lower-bound reference, confirming that for a well-conditioned problem there is little frontier left to chase. The local-step rows reach the target with fewer rounds because the extra local computation does real optimization work, but none of them violates the floor, exactly as the theory of Section 4 requires.

Practical Example: The Geo-Distributed Run That Stopped Optimizing the Wrong Axis

Who: A distributed-systems engineer at a startup training a mid-size language model across two data centers on different continents.

Situation: The cross-continent link had a 140-millisecond round-trip, so every synchronous all-reduce round paid a latency $\alpha$ that dwarfed the gradient compute time.

Problem: The team had spent two weeks tuning gradient compression to shrink each message, but wall-clock training barely moved, because the bottleneck was the number of rounds, not the bits per round.

Dilemma: Keep investing in compression on the bits axis, or switch effort to the rounds axis with local-update steps, which risked a small loss in final accuracy.

Decision: They estimated the round floor from $\sqrt{\kappa}\,\log(1/\epsilon)$, saw their synchronous run was using far more rounds than the floor demanded, and adopted local-SGD with $H$ chosen to cut rounds toward that floor.

How: Each worker took several dozen local steps between averaging rounds, communicating roughly thirty times less often, with a small slowdown in per-round progress that the analysis of Section 3 predicted.

Result: Wall-clock training time fell by more than half, dominated now by computation rather than latency, with a negligible accuracy gap. Further compression work was shelved as near-frontier and not worth the engineering.

Lesson: Identify which axis binds before optimizing. When latency dominates, rounds are the resource; the lower bound tells you how few rounds are even possible, and how much room is left to win.

Library Shortcut: Local-Update Rounds Without Writing the Averaging Loop

Code 10.9.1 builds the local-step averaging by hand to expose the rounds. In production you do not maintain per-worker state and average it yourself; a framework runs the local steps and the periodic all-reduce for you. PyTorch ships a post-local-SGD optimizer hook that takes $H$ local steps between communications, collapsing the entire round-management loop to a few lines of setup:

# Run with: torchrun --nproc_per_node=16 thisfile.py
import torch.distributed as dist
import torch.distributed.algorithms.model_averaging.averagers as averagers

dist.init_process_group("nccl")                    # join the K workers
averager = averagers.PeriodicModelAverager(period=8, warmup_steps=100)  # H = 8

for step, batch in enumerate(loader):
    loss = model(batch).loss
    loss.backward()
    optimizer.step()                               # local step, no communication
    averager.average_parameters(model.parameters())  # all-reduce only every 8th step
Code 10.9.2: The same rounds-versus-computation trade-off as Output 10.9.1, now as a one-line period=8 setting. The dozen-line manual averaging loop collapses to a configured averager, and the library handles the process group, the every-$H$-th all-reduce, and the warmup that keeps early training synchronous.

6. Where the Bound Leaves Us Beginner

The communication lower bound is the theoretical backbone the rest of this chapter rests on. It explains why all-reduce SGD cannot be made free, why local updates and compression help but eventually run out of room, and why large-batch methods are partly a fight over the conditioning term $\kappa$ that sets the floor. It also reframes the systems chapters ahead: the topology-aware collectives of Chapter 4 and the sharded communication schedules of Chapter 16 are engineering that drives the constant factor down toward the bound, not a way around it. Knowing the floor turns "communicate less" from an open-ended wish into a question with an answer.

Research Frontier: Closing the Gap to the Round Floor (2024 to 2026)

The current frontier is about reaching the $\Omega(\sqrt{\kappa}\,\log(1/\epsilon))$ round floor under harsher conditions than the clean convex theory assumes. DiLoCo and its successors (Douillard et al., 2024) push local-update training to genuinely geo-distributed and over-the-internet settings, communicating hundreds of times less often than synchronous SGD while staying close to the same loss, and 2025 streaming-DiLoCo variants overlap the rare communication with computation to shave the remaining latency. In parallel, decentralized and graph-based analyses tighten the lower bounds themselves: Scaman-style results make the round count depend on the network's spectral gap, so a poorly connected topology raises the floor, and recent work derives matching upper-bound algorithms that provably hit it. A separate thread proves lower bounds for the nonconvex, deep-learning-relevant regime, where the clean $\sqrt{\kappa}$ scaling is replaced by bounds in terms of gradient smoothness and noise. The unifying theme is that the bound is no longer just a cautionary theorem; it is a design target that 2024 to 2026 methods are explicitly engineered to meet.

Fun Note: The Theorem That Ends the Argument

Every distributed-training team eventually has the meeting where someone proposes squeezing communication a little further, then someone else proposes squeezing it further still, and the whiteboard fills with schemes. The lower bound is the adult in the room. It does not say your schemes are bad; it says there is a number below which all of them fail at once, and it hands you that number. The first time you compute $\sqrt{\kappa}\,\log(1/\epsilon)$ for your actual problem and discover you are already within a factor of two of it, the meeting ends early, and that is the theorem doing its job.

With the floor in hand, the chapter's final section steps back from any single method and asks how convergence speed, communication cost, and the practical realities of a cluster combine into the choices an engineer actually makes. That synthesis, weighing the rate of Section 10.2 against the costs of this section, is the subject of Section 10.10.

Exercise 10.9.1: Read the Two Axes Conceptual

A team reports that their gradient-compression scheme cut bytes per all-reduce by 16x but barely changed wall-clock training time, while a colleague's local-SGD change with $H = 50$ cut wall-clock time in half. Using the cost model $T_{\text{comm}} \approx R(\alpha + \beta\,b(P))$ from Section 1, explain which axis each change attacks and why, on a high-latency network where $\alpha$ dominates, the local-SGD change won. State what kind of network would have made the compression change the more effective of the two.

Exercise 10.9.2: Push the Frontier and Watch It Push Back Coding

Modify Code 10.9.1 to sweep the condition number by changing the top eigenvalue (try $L \in \{5, 10, 20, 40\}$ with $\mu = 1$). For each $L$, record the full-communication round count and the lower-bound reference $\sqrt{\kappa}\,\log(1/\epsilon)$, and plot both against $\kappa$. Confirm that the measured round count grows with $\kappa$ in the same direction as the bound, and discuss why the constant factor between them is not exactly one. Then push the local-step count $H$ very large (say 256) and explain, from the final-gap column, what eventually stops local-SGD from reaching $\epsilon$ in a single round.

Exercise 10.9.3: Estimate Your Own Floor Analysis

You are training a model where measurements suggest an effective condition number $\kappa \approx 10^4$ and you target $\epsilon = 10^{-6}$ relative error. Using $R = \sqrt{\kappa}\,\log(1/\epsilon)$ as the round floor, estimate the minimum number of communication rounds any first-order method could need. If your synchronous run currently uses $2\times10^5$ rounds, by what factor are you above the floor, and is local stepping or preconditioning the better lever to pursue first? Now suppose accelerated methods are unavailable and you must use the non-accelerated $O(\kappa\log(1/\epsilon))$ rate: recompute the floor and comment on how much the loss of acceleration costs you in rounds.