Part III: Distributed Machine Learning
Chapter 10: Distributed Optimization

Convergence and Practical Trade-Offs

"I can converge fast, run fast, or talk little. Pick two, and explain to the budget why I could not pick three."

An Optimizer Reading Its Own Profile
Big Picture

Every distributed optimization method this chapter built is a different point on the same three-way trade-off, and choosing one is choosing which of three quantities you are willing to spend. The first is statistical efficiency: how much the loss drops per gradient step. The second is hardware throughput: how many steps the cluster completes per second. The third is communication cost: how many bytes and synchronization rounds each step demands. Synchronous all-reduce SGD maximizes statistical efficiency and exactness but pays a barrier on every step. Asynchronous SGD maximizes throughput but spends statistical efficiency on stale gradients. Local-update and compressed methods spend the fewest rounds but accept a small statistical penalty. No method wins all three, so the practical skill is reading which quantity binds for your workload and picking the method that protects it. This closing section turns the whole chapter into one decision framework, makes the trade-off concrete with a runnable measurement, and folds the chapter into a single takeaway.

The preceding sections each optimized for one corner of the problem. Synchronous distributed SGD (Section 10.3) and all-reduce aggregation (Section 10.5) gave us a step that is mathematically identical to single-machine SGD, at the cost of a barrier where every worker waits for the slowest. Asynchronous SGD (Section 10.4) removed the barrier and let fast workers run ahead, at the cost of gradients computed on stale weights (Section 10.6). Communication-efficient methods (Section 10.7) cut the bytes and the rounds through compression and local updates, at the cost of a small drift from the exact trajectory. Large-batch scaling (Section 10.8) and the communication lower bounds (Section 10.9) told us how far each lever can be pushed before it stops paying. Standing back, these are not five unrelated techniques; they are five settings of the same three dials, and this section gives you the framework to set them deliberately.

Statistical efficiency spacer (progress per step) Low communication (few bytes, few rounds) Hardware throughput (steps per second) Synchronous all-reduce exact, barrier cost Asynchronous SGD no barrier, staleness Local-update / compressed fewest rounds, slight drift
Figure 10.10.1: The distributed-optimization trade-off triangle. The three corners are the quantities in tension: statistical efficiency (loss drop per step), hardware throughput (steps per second), and low communication (bytes and rounds per step). Each method sits near the corner it protects: synchronous all-reduce near exactness, asynchronous SGD near throughput, local-update and compressed methods near low communication. No method reaches all three corners at once, which is the whole point of the framework.

Figure 10.10.1 is the mental model for the rest of the section. Read a method's position as a statement about which corner it sacrifices. Synchronous all-reduce sits high, near maximum statistical efficiency, because every step uses the exact full-batch-equivalent gradient (Section 10.5), but it sits far from the throughput corner because the barrier idles fast workers behind slow ones. Asynchronous SGD slides toward throughput because nobody waits, but it falls away from statistical efficiency because stale gradients waste some of each step. Local-update and compressed methods sit near the low-communication corner because they touch the network rarely, accepting a small drop in per-step progress in exchange. The art is knowing which corner your workload cannot afford to leave.

1. The Three Quantities in Tension Intermediate

Make the three quantities precise, because a framework you cannot measure is a slogan. The wall-clock time to reach a target loss factors cleanly into the three dials. Write $T$ for total training time, $S$ for the number of optimization steps needed to reach the target, and $t_{\text{step}}$ for the time per step. Then

$$T \;=\; \underbrace{S}_{\text{statistical efficiency}} \;\times\; \underbrace{t_{\text{step}}}_{\text{throughput}}, \qquad t_{\text{step}} \;=\; t_{\text{compute}} + t_{\text{comm}}.$$

Statistical efficiency is $1/S$: a method that needs fewer steps to hit the target loss is more statistically efficient. Hardware throughput is $1/t_{\text{step}}$: a method that completes each step faster has higher throughput. Communication cost lives inside $t_{\text{comm}}$, the part of each step spent moving bytes and waiting at barriers, and the communication lower bounds of Section 10.9 set the floor on how small it can be for a given accuracy. The tension is now visible as algebra: synchronous SGD minimizes $S$ (exact gradients, fewest steps) but inflates $t_{\text{comm}}$ (a barrier every step); asynchronous SGD shrinks $t_{\text{comm}}$ to a one-sided push and raises throughput, but inflates $S$ because stale gradients make each step less productive; local-update SGD takes $H$ cheap local steps between communications, cutting the number of expensive rounds by a factor of $H$ while letting the workers drift slightly apart, which nudges $S$ up.

Key Insight: You Are Always Trading Steps Against Step-Time

Total training time is the product of how many steps you need and how long each step takes. Every method in this chapter improves one factor at the expense of the other. Synchronous all-reduce minimizes the step count by keeping the gradient exact, but pays a barrier that inflates step time. Asynchronous and local-update methods slash step time by touching the network less or never waiting, but they spend statistical efficiency, so they need more steps. The right method is the one whose gain on the binding factor outweighs its loss on the other, which is an empirical question about your interconnect, your batch size, and your model, not a universal ranking.

2. A Runnable Comparison Intermediate

Abstract trade-offs become convincing only when you can see the numbers move together. The demo below runs synchronous, asynchronous, and local-step SGD on one shared logistic-regression problem, splitting the data across eight simulated workers, and reports four quantities for each method: the final loss (statistical outcome), the number of synchronization rounds and the communication volume in floats (communication cost), and a simulated wall-clock in cost units (throughput outcome). The wall-clock model is deliberately simple: every local step costs one compute unit, a synchronous barrier-plus-transport costs four units, an asynchronous one-sided push costs only a fraction of that because there is no barrier, and local-step SGD pays the barrier only once every ten steps. The point is not the exact constants but the shape of the result.

import numpy as np

# A shared convex problem: logistic regression on synthetic data.
rng = np.random.default_rng(7)
N, d, K = 60_000, 30, 8
X = rng.standard_normal((N, d))
w_star = rng.standard_normal(d)
p = 1.0 / (1.0 + np.exp(-(X @ w_star)))
y = (rng.random(N) < p).astype(float)
shards = np.array_split(np.arange(N), K)

def loss(w):
    z = X @ w
    return float(np.mean(np.logaddexp(0.0, z) - y * z))   # stable mean logistic loss

def grad(idx, w):
    Xi, yi = X[idx], y[idx]
    s = 1.0 / (1.0 + np.exp(-(Xi @ w)))
    return Xi.T @ (s - yi) / len(idx)

# Cost model (cost units): a local step costs COMP; one synchronization round
# costs COMM (barrier + transport). BYTES_PER_ROUND counts floats moved per sync.
COMP, COMM = 1.0, 4.0
BYTES_PER_ROUND = K * d
EPOCH_STEPS, lr = 200, 0.5

def run_sync():
    w = np.zeros(d); rounds = comm = 0; clock = 0.0
    for t in range(EPOCH_STEPS):
        g = np.mean([grad(s, w) for s in shards], axis=0)   # all-reduce EVERY step
        w -= lr * g
        rounds += 1; comm += BYTES_PER_ROUND; clock += COMP + COMM
    return w, rounds, comm, clock

def run_async():
    w = np.zeros(d); rounds = comm = 0; clock = 0.0
    stale = [w.copy() for _ in range(K)]                    # each worker's stale view
    for t in range(EPOCH_STEPS):
        k = t % K
        g = grad(shards[k], stale[k])                       # gradient on STALE weights
        w = w - lr * g; stale[k] = w.copy()
        rounds += 1; comm += d; clock += COMP + 0.15 * COMM  # no barrier: light push
    return w, rounds, comm, clock

def run_local():
    H = 10                                                  # local steps between syncs
    w = np.zeros(d); rounds = comm = 0; clock = 0.0
    locals_ = [w.copy() for _ in range(K)]
    for t in range(EPOCH_STEPS):
        for k in range(K):
            locals_[k] = locals_[k] - lr * grad(shards[k], locals_[k])
        clock += COMP
        if (t + 1) % H == 0:                                # synchronize only every H
            w = np.mean(locals_, axis=0)
            locals_ = [w.copy() for _ in range(K)]
            rounds += 1; comm += BYTES_PER_ROUND; clock += COMM
    return np.mean(locals_, axis=0), rounds, comm, clock

print(f"{'method':<14}{'final loss':>12}{'sync rounds':>13}{'comm (floats)':>15}{'wall-clock':>12}")
for name, fn in [("synchronous", run_sync), ("asynchronous", run_async), ("local-step", run_local)]:
    w, rounds, cb, clock = fn()
    print(f"{name:<14}{loss(w):>12.5f}{rounds:>13}{cb:>15}{clock:>12.1f}")
Code 10.10.1: Three optimizers on one problem. All three minimize the same logistic loss over the same eight shards; they differ only in when they synchronize and on which weights each gradient is computed, so the reported rounds, communication volume, and simulated wall-clock isolate the cost of the synchronization choice.
method          final loss  sync rounds  comm (floats)  wall-clock
synchronous        0.27392          200          48000      1000.0
asynchronous       0.27316          200           6000       320.0
local-step         0.27392           20           4800       280.0
Output 10.10.1: The trade-off triangle in numbers. All three methods reach essentially the same loss (about $0.273$) on this well-conditioned convex problem, so statistical efficiency barely separates them here; what separates them sharply is cost. Synchronous SGD spends $48{,}000$ floats and $1000$ wall-clock units; asynchronous SGD cuts both by removing the barrier; local-step SGD reaches the same loss in only $20$ synchronization rounds and the lowest communication of all.

Output 10.10.1 makes Figure 10.10.1 concrete. On this benign convex problem the statistical-efficiency corner is nearly tied, so the three methods separate almost entirely on the other two axes, exactly as the framework predicts when the loss surface is forgiving. Synchronous SGD pays the most communication and the most wall-clock because it barriers on every one of its two hundred steps. Asynchronous SGD keeps the same step count but shrinks both communication and wall-clock because each push is one-sided and nobody waits. Local-step SGD synchronizes only once per ten steps, so it reaches the same loss with one tenth the rounds and the lowest communication volume in the table. The catch this small problem hides is that on a stiff, non-convex deep network the staleness of asynchronous SGD and the drift of local-step SGD would inflate the step count $S$, pulling those two methods down from the top corner; that is the statistical penalty the next section weighs against the savings shown here.

Fun Note: The Tortoise Synchronizes Once a Lap

Local-step SGD is the tortoise that beats the hare by refusing to check the leaderboard every stride. Each worker runs ten steps with its head down, then looks up once to average notes with the others. The hare, synchronous SGD, stops to compare with everyone after every single stride and spends most of the race standing still at the barrier. On a smooth track they finish at the same loss; the tortoise just spent a tenth of the time talking about it.

3. The Decision Guide Beginner

The framework becomes useful when it produces a default for the workload in front of you. Three properties of a workload decide which corner of the triangle you cannot leave: the density of the gradient, the speed and reach of the interconnect, and the conditioning of the loss. Table 10.10.1 turns these into recommendations. Read it as a starting point to be confirmed by measurement, never as a verdict, because the binding factor is ultimately empirical.

Table 10.10.1: A decision guide for distributed optimization. Match the binding constraint of the workload to the method that protects the corresponding corner of Figure 10.10.1.
WorkloadBinding constraintDefault methodWhy
Dense deep network on a fast interconnect (one datacenter, NVLink or InfiniBand)Statistical efficiency; communication is cheap and overlappableSynchronous all-reduce SGDExact gradients, barrier hides behind the backward pass; the default for Chapter 15
Sparse / embedding model (recommendation, large vocabularies)Throughput; only a few coordinates update per exampleAsynchronous SGD with a parameter serverUpdates rarely collide, so staleness is mild; no barrier wastes no time on idle shards
Geo-distributed or over-the-internet training (slow, high-latency links)Communication rounds; bandwidth is scarce and latency is highLocal-update SGD (local SGD, DiLoCo-style)Synchronizing once per $H$ steps cuts rounds by $H$, the dominant cost here
Bandwidth-bound dense training (large model, modest network)Bytes per roundGradient compression (quantization, sparsification, PowerSGD)Keeps the synchronous schedule but shrinks each message toward the Section 10.9 floor

The first row is the common case for the rest of Part IV: when the model is a dense deep network living in one datacenter, communication is cheap enough to overlap with the backward pass, so the barrier nearly vanishes and you keep the exact gradient that maximizes statistical efficiency. That is why data-parallel deep learning (Chapter 15) defaults to synchronous all-reduce and treats it as the baseline every other method must beat. The second and third rows are where the defaults flip: sparse and embedding models tolerate asynchrony because updates rarely collide, which is exactly the regime the parameter-server architecture of Chapter 11 was built for, and geo-distributed training cannot afford frequent rounds, so it leans on the local-update methods of Section 10.7.

Practical Example: Choosing the Optimizer for a Two-Continent Training Run

Who: An ML platform engineer pooling idle GPUs across a European and a North American datacenter for a budget pre-training run.

Situation: The two sites had spare capacity but were linked only by a commodity internet path: roughly 100 ms round-trip and a fraction of the bandwidth available inside either datacenter.

Problem: A standard synchronous all-reduce job, the team's default from in-datacenter training, spent more than half its wall-clock idle at the cross-ocean barrier, so adding the second site made training slower, not faster.

Dilemma: Keep synchronous all-reduce and accept that the slow link erases the extra GPUs, or switch to a method that synchronizes rarely and risk a hit to final accuracy from the workers drifting apart between rounds.

Decision: They switched to a local-update schedule, each site running many local steps and averaging models only periodically, because Table 10.10.1 says communication rounds are the binding constraint when the link is the bottleneck.

How: They set the local-step count so that one synchronization landed roughly every few hundred steps, mirroring the local-step path in Code 10.10.1 but with the round interval tuned to the measured link latency.

Result: Cross-site communication dropped by more than an order of magnitude, the second datacenter now contributed real speedup, and final accuracy stayed within a small margin of the synchronous baseline, the slight statistical cost the framework predicts.

Lesson: When the network is the ceiling, protect the low-communication corner. The exact gradient is worthless if the cluster spends its life waiting for it to cross an ocean.

4. How the Choice Interacts with Batch Size and the Communication Bound Advanced

The method you pick does not act alone; it couples to the batch size of Section 10.8 and to the lower bounds of Section 10.9, and ignoring those couplings is how good methods get tuned into bad ones. Batch size is the lever that moves a method along the throughput axis without changing the algorithm. A larger global batch means each worker processes more examples before the next synchronization, so the same number of barriers covers more data, raising the compute-to-communication ratio and pushing any synchronous method rightward toward the throughput corner. This is why large-batch training (Section 10.8) and synchronous all-reduce are natural partners: scaling the batch amortizes the barrier. The limit is statistical, not mechanical. Past a critical batch size the per-step progress stops growing with the batch, so adding more examples per step buys throughput that no longer converts into faster convergence, and the learning-rate scaling rules of Section 10.8 are what keep you below that ceiling.

The communication lower bounds of Section 10.9 set the other boundary. They establish a floor on the bytes and rounds any method must spend to reach a target accuracy, which means compression and local-update methods are not free lunches but movements toward a hard limit. A compression scheme that claims to send almost nothing is either close to the Section 10.9 floor, in which case it is near-optimal, or it is below the floor for the accuracy it promises, in which case the accuracy claim is wrong. Reading a new method against the bound tells you immediately whether its communication savings are physically possible or whether they have been borrowed from statistical efficiency without saying so. The framework, the batch-size lever, and the lower bound together let you predict where a method lands on Figure 10.10.1 before you run a single step.

Library Shortcut: One Flag Moves You Around the Triangle

You rarely implement these methods by hand; modern frameworks expose the corner you want as a configuration choice. The same model trains under any of the three regimes by swapping the wrapper or a single argument:

# Corner 1: synchronous all-reduce (exact, the dense-network default)
model = torch.nn.parallel.DistributedDataParallel(model)   # all-reduce per step

# Corner 2: low communication via gradient compression, schedule unchanged
from torch.distributed.algorithms.ddp_comm_hooks import default_hooks as ch
model.register_comm_hook(state=None, hook=ch.bf16_compress_hook)   # half the bytes

# Corner 3: local-update SGD (synchronize every H steps)
from torch.distributed.algorithms.model_averaging.averagers import PeriodicModelAverager
averager = PeriodicModelAverager(period=H)   # workers average once per H local steps
Code 10.10.2: The manual synchronization logic of Code 10.10.1 collapses to a wrapper choice. DistributedDataParallel gives the synchronous corner, a communication hook moves toward low bytes, and PeriodicModelAverager gives the local-update corner; the library handles the bucketing, all-reduce scheduling, and averaging that the from-scratch loop spelled out by hand.
Research Frontier: Pushing the Low-Communication Corner (2024 to 2026)

The most active frontier in this triangle is the low-communication corner, where researchers are trying to reach in-datacenter convergence while synchronizing as rarely as a slow link allows. The DiLoCo line of local-update methods (Douillard et al., 2024) trains language models with an inner-outer optimizer that communicates only every few hundred steps, and follow-on work has pushed it toward genuinely streaming and over-the-internet settings where workers join and leave. Open collaborative runs in the spirit of DeMo and the INTELLECT distributed pre-training efforts (2024 to 2025) have trained billion-parameter models over commodity internet links, reporting communication reductions of one to two orders of magnitude against synchronous all-reduce with small accuracy gaps. In parallel, structured gradient compression in the lineage of PowerSGD and its low-rank and error-feedback descendants keeps tightening how close real schemes sit to the Section 10.9 lower bound. The throughline is that the trade-off triangle is not static: each result moves a method closer to a corner that was thought to require sacrificing another, which is precisely why a framework, rather than a fixed ranking, is the durable way to reason about these methods.

5. Chapter Summary Beginner

This chapter built distributed optimization from the ground up and arrived at a single organizing idea. We began with empirical risk minimization at scale (Section 10.1) and mini-batch SGD (Section 10.2), then distributed the gradient step three ways: synchronously with a barrier and exact all-reduce gradients (Sections 10.3 and 10.5), asynchronously without a barrier but with stale gradients (Sections 10.4 and 10.6), and communication-efficiently through compression and local updates (Section 10.7). Large-batch scaling (Section 10.8) showed how batch size trades throughput for statistical efficiency up to a critical limit, and the communication lower bounds (Section 10.9) set the floor that no method can undercut without paying in accuracy. This final section unified all of it into the trade-off triangle of Figure 10.10.1: statistical efficiency, hardware throughput, and communication cost are the three quantities in permanent tension, every method protects one corner at the expense of another, and the practitioner's job is to identify the binding constraint and pick accordingly. The runnable comparison in Code 10.10.1 made the trade-off measurable, and the decision guide in Table 10.10.1 turned it into defaults you can act on.

Thesis Thread: The Sync-vs-Async Dial Spins Through the Whole Book

The synchronous-versus-asynchronous choice you just learned to make is not local to optimization; it is a thread that began with coordination in Chapter 2 and runs forward through the book. The exact all-reduce gradient returns as the engine of data-parallel deep learning (Chapter 15); the bounded-staleness idea becomes the operating regime of parameter servers (Chapter 11); and the same sync-versus-async tension reappears as the choice between synchronous and asynchronous actor-learner architectures in distributed reinforcement learning (Chapter 20). Whenever a later chapter distributes an iterative computation, it is choosing a corner of this same triangle.

Key Takeaway

Distributed optimization is the management of one three-way trade-off: statistical efficiency (progress per step), hardware throughput (steps per second), and communication cost (bytes and rounds). No method wins all three. Synchronous all-reduce SGD buys exactness and statistical efficiency with a barrier; asynchronous SGD buys throughput with staleness; local-update and compressed methods buy low communication with a slight statistical penalty. Choose by identifying which quantity binds for your workload: dense deep learning on a fast interconnect favors synchronous all-reduce, while sparse, embedding, and geo-distributed training favor asynchronous or local-update methods. The batch size and the communication lower bound tell you how far each lever can move before it stops paying. The whole of Part III and Part IV is the working-out of these choices at scale.

Exercise 10.10.1: Place a Method on the Triangle Conceptual

For each of the following, state which corner of Figure 10.10.1 it protects and which it sacrifices, and name the workload from Table 10.10.1 it suits best: (a) 1-bit Adam, which quantizes each gradient coordinate to a single bit; (b) Hogwild-style lock-free asynchronous SGD on a sparse model; (c) local SGD with $H = 500$ local steps between averages. Explain why running method (c) on a stiff, ill-conditioned dense network would move it away from the corner it normally protects.

Exercise 10.10.2: Make the Triangle Bite Coding

Modify Code 10.10.1 so the loss surface is poorly conditioned: scale the columns of X by widely different factors (for example, multiply column $j$ by $10^{j/d}$) and reduce the learning rate accordingly. Rerun all three methods with a fixed step budget and report the final-loss column again. Confirm that the asynchronous and local-step methods now reach a worse loss than synchronous SGD for the same wall-clock, and explain how this realizes the statistical-efficiency penalty that the benign problem in Output 10.10.1 hid. Then increase the synchronization frequency of the local-step method until it recovers the synchronous loss, and report how much communication that recovery cost.

Exercise 10.10.3: Predict from the Cost Model Analysis

Using the factorization $T = S \times (t_{\text{compute}} + t_{\text{comm}})$, suppose synchronous SGD needs $S = 1000$ steps with $t_{\text{compute}} = 50$ ms and $t_{\text{comm}} = 200$ ms, while local-step SGD with $H = 20$ needs $S = 1150$ steps and pays the same $t_{\text{comm}}$ only once every $H$ steps. Compute the total wall-clock for each, decide which wins, and then find the value of $t_{\text{comm}}$ at which the two methods tie. Explain in one sentence what kind of interconnect that crossover corresponds to and connect your answer to the communication lower bound of Section 10.9.

Project Ideas

These open-ended projects turn the chapter's trade-off framework into something you build and measure. Each one starts from Code 10.10.1 or a small PyTorch training loop and grows into a study you could write up.

1. Reproduce a staleness-versus-convergence study. Implement asynchronous SGD with an explicit, controllable staleness bound $\tau$ (each gradient is computed on weights that are at most $\tau$ steps old) on a real model such as a small CNN on a standard image dataset. Sweep $\tau$ from $0$ (fully synchronous) upward and plot final accuracy and wall-clock against $\tau$. Identify the staleness budget beyond which accuracy collapses, and relate it to the bounded-staleness regime of Chapter 11.

2. Measure the compression trade-off against the lower bound. Take a synchronous data-parallel training run and apply progressively more aggressive gradient compression (top-$k$ sparsification, then quantization to fewer bits). For each setting, record the bytes per round and the steps to a target loss, and plot the resulting points on a bytes-versus-accuracy curve. Mark where the curve bends away from the communication lower bound of Section 10.9, which is the point where further compression starts costing accuracy faster than it saves bytes.

3. Build the triangle for your own cluster. Instrument a real multi-GPU or multi-node training job to log step count, per-step time, and communication volume for synchronous all-reduce, a compressed variant, and a local-update variant. Place all three as measured points on a reproduction of Figure 10.10.1 for your specific interconnect and model, and use the result to recommend a default method, then verify the recommendation by training to convergence.