"They asked me to send a billion numbers every step. I sent ten that mattered and a sticky note of what I owed them. Nobody noticed the difference except the network bill."
A Gradient That Learned to Pack Light
Synchronous distributed SGD is correct but expensive: every step ships a full gradient vector across the network, and for a billion-parameter model that traffic, not the arithmetic, is what limits the speedup from adding workers. Communication-efficient optimization attacks that tax along two independent dimensions. You can send fewer bytes per round, by quantizing each coordinate to a handful of bits, by transmitting only the few coordinates that matter, or by sending a low-rank sketch of the gradient. Or you can communicate fewer rounds, by letting each worker take several local optimizer steps before it synchronizes. Both directions trade a little per-step progress for a large drop in network cost, and a single trick, error feedback, recovers almost all of the lost progress by remembering what each compression threw away. This section makes that trade quantitative and shows a top-k compressor matching full-gradient training while sending a tenth of the numbers.
The previous section traced what happens when gradients arrive late: stale and delayed updates still converge, but only if the staleness stays bounded. That discussion took the size of each message for granted. Here we attack the size itself. The communication tax was named in Chapter 4, where the cost of a single all-reduce was shown to grow with the length of the vector being reduced; Chapter 3 turned that into the $\alpha$-$\beta$ model, in which the per-step communication time is $\alpha + \beta P$ for a $P$-parameter gradient. The arithmetic of a training step is fixed by the model, but the bandwidth term $\beta P$ is something we can engineer down, and for large models it dominates. Every method in this section is a way of shrinking either $P$ (bytes per round) or the number of rounds, while keeping the optimizer converging to the same place.
1. Quantization: Fewer Bits per Coordinate Intermediate
The most direct way to shrink a gradient is to spend fewer bits on each number. A standard gradient stores every coordinate as a 32-bit float; if a coarse approximation of each coordinate suffices for the optimizer, we can transmit one, two, or a handful of bits instead and cut the bandwidth term by an order of magnitude. The earliest production instance, 1-bit SGD, sent only the sign of each coordinate (scaled by a per-bucket magnitude) and folded the rounding loss back through error feedback; it shipped in speech-recognition training where the bandwidth between servers was the binding constraint. The principle generalizes: replace the true gradient $g$ with a quantized $Q(g)$ that is cheap to send.
What separates a good quantizer from a lossy one is unbiasedness. QSGD makes the quantization stochastic, so that each coordinate rounds up or down at random with probabilities chosen to satisfy $\mathbb{E}[Q(g)] = g$. An unbiased compressor preserves the expected gradient, so SGD still descends in the right direction on average; the price is added variance, which the convergence analysis absorbs as a slightly smaller effective step. For a coordinate $g_i$ with magnitude bounded by $\lVert g \rVert$, a stochastic quantizer to one of $s$ levels rounds to the nearer level $\ell$ with probability proportional to its closeness:
$$Q(g_i) = \lVert g \rVert \cdot \operatorname{sign}(g_i) \cdot \xi_i, \qquad \mathbb{E}[\xi_i] = \frac{|g_i|}{\lVert g \rVert}, \qquad \text{so } \mathbb{E}[Q(g_i)] = g_i.$$TernGrad pushes the same idea to three levels $\{-1, 0, +1\}$ times a shared scale, an extreme that still trains deep networks because the stochastic rounding keeps the estimate unbiased. The common thread is that quantization touches every coordinate but spends almost nothing on each, so the index bookkeeping is trivial; what you give up is precision, recovered in expectation by randomization and in practice by error feedback.
A compressor $C$ is safe for SGD when it is unbiased, $\mathbb{E}[C(g)] = g$, because the optimizer only ever sees gradients in expectation. Unbiasedness converts the lost information into extra variance rather than a systematic error, and variance is the currency SGD already trades in. Biased compressors (top-k is biased, since it always keeps the same large coordinates) are not safe on their own; they need error feedback to stay unbiased over time. The whole design space of gradient compression is the management of this one distinction: keep the estimate unbiased, by randomization or by feedback, and convergence survives.
2. Sparsification: Send Only What Matters Intermediate
Quantization keeps every coordinate and cheapens each. Sparsification does the opposite: it keeps every retained coordinate at full precision but transmits only a small subset. The observation that motivates it is empirical and striking. In deep-network training, the gradient is dominated by a few large entries; the vast majority of coordinates are near zero at any given step. Top-$k$ sparsification exploits this by sending only the $k$ largest-magnitude coordinates (with their indices) and zeroing the rest, where $k$ might be one percent of $P$ or less.
The famous result is that this is far more aggressive than it sounds. Deep Gradient Compression reported that transmitting roughly $0.1\%$ of gradient coordinates per step, with the right corrections, matched the accuracy of dense training across image and language models, a hundredfold to thousandfold reduction in the bytes per round. The corrections matter because top-$k$ is biased: a coordinate that is consistently medium-sized would be perpetually dropped, and its information would vanish. The fix is to never truly discard a dropped coordinate but to bank it. Each worker keeps an error buffer $e_t$, the running sum of everything compression has thrown away, and adds it back before the next compression. Formally, with a compressor $C$ and step $\eta$,
$$\tilde{g}_t = C\!\left(g_t + e_t\right), \qquad e_{t+1} = \left(g_t + e_t\right) - \tilde{g}_t, \qquad w_{t+1} = w_t - \eta\, \tilde{g}_t.$$The buffer $e_{t+1}$ holds exactly the residual the compressor refused to send; on a later step, once those banked coordinates have grown large enough to enter the top-$k$, they are transmitted in full. No gradient information is lost, only delayed, which is why error feedback turns an aggressively biased compressor into a convergent optimizer. We make this mechanism concrete in Section 4 and watch it work in Code 10.7.1.
Error feedback runs the books like an honest bartender. Every step, the compressor only pours the gradient coordinates it can afford to send and writes the rest on a tab (the error buffer). The tab never gets erased; it just grows until some coordinate's running total is large enough to make the cut, and then it gets paid in full. Over a whole training run the optimizer settles every cent it ever owed, just not on the step it incurred the charge. Convergence proofs are, in essence, the statement that the bar always closes its tab.
3. Low-Rank Compression: Send a Sketch Advanced
A third family treats the gradient of a weight matrix as a matrix, not a flat vector, and exploits the fact that a matrix can be approximated by a low-rank product. For a layer whose gradient is a matrix $G \in \mathbb{R}^{m \times n}$, a rank-$r$ approximation $G \approx P Q^\top$ with $P \in \mathbb{R}^{m \times r}$ and $Q \in \mathbb{R}^{n \times r}$ replaces $mn$ numbers with $r(m+n)$ numbers, a large saving whenever $r \ll \min(m, n)$. PowerSGD computes such a factorization cheaply with a single step of power iteration per round, reusing the previous step's factors as a warm start so the iteration converges essentially for free as training proceeds.
The decisive systems property of PowerSGD is that its compressed form is itself all-reducible. Top-$k$ sparsification produces different index sets on different workers, so combining the messages needs a sparse, irregular collective that is awkward on the GPU interconnects that Chapter 4 describes. PowerSGD's factors $P$ and $Q$ are dense and have the same shape on every worker, so they go straight through a standard all-reduce, the same primitive a dense gradient uses. That compatibility is why PowerSGD became a default option in production data-parallel training: it shrinks the message and keeps the fast collective. Like top-$k$, it is biased, and it relies on the same error-feedback buffer to stay convergent.
Implementing low-rank compression by hand means intercepting every gradient bucket, running power iteration, all-reducing the factors, and managing a per-parameter error buffer, a few hundred lines that must interleave correctly with the backward pass. PyTorch ships it as a one-line communication hook on DistributedDataParallel, so the same compressed all-reduce the chapter derives becomes a registration call:
import torch
from torch.distributed.algorithms.ddp_comm_hooks import powerSGD_hook as psgd
model = torch.nn.parallel.DistributedDataParallel(model) # standard DDP wrapper
state = psgd.PowerSGDState( # rank and warm-start config
process_group=None, matrix_approximation_rank=4,
start_powerSGD_iter=1000, # warm up dense, then compress
)
model.register_comm_hook(state, psgd.powerSGD_hook) # every all-reduce now low-rank
4. Error Feedback, in Code Intermediate
The claims above are quantitative, so we test them. The program below trains a linear-regression model with eight data-parallel workers two ways. The first sends each worker's full gradient every step. The second applies top-$k$ sparsification, keeping only the ten largest of one hundred coordinates per worker, with a per-worker error-feedback buffer following the recurrence from Section 2. We compare the final loss and count the coordinates each method puts on the wire.
import numpy as np
rng = np.random.default_rng(0)
N, d, K = 20_000, 100, 8 # examples, features, workers
X = rng.standard_normal((N, d))
w_true = rng.standard_normal(d)
y = X @ w_true + 0.5 * rng.standard_normal(N)
shards = np.array_split(np.arange(N), K) # one data shard per worker
lr, steps = 0.05, 60
keep = 10 # top-k: send only 10 of 100 coords
def grad_shard(w, s):
"""Mean-squared-error gradient on worker s's shard only."""
Xs, ys = X[s], y[s]
return (2.0 / len(s)) * (Xs.T @ (Xs @ w - ys))
def loss(w):
return float(np.mean((X @ w - y) ** 2))
def topk(v, k):
"""Keep the k largest-magnitude coordinates of v, zero the rest."""
out = np.zeros_like(v)
idx = np.argpartition(np.abs(v), -k)[-k:]
out[idx] = v[idx]
return out
def run(compress):
w = np.zeros(d)
err = [np.zeros(d) for _ in shards] # per-worker error-feedback memory
sent = 0
for _ in range(steps):
agg = np.zeros(d)
for j, s in enumerate(shards):
g = grad_shard(w, s)
if compress:
corrected = g + err[j] # add back what was dropped before
msg = topk(corrected, keep) # transmit only k coordinates
err[j] = corrected - msg # remember the residual locally
sent += keep
else:
msg = g
sent += d
agg += msg
w = w - lr * (agg / K) # average the received messages
return loss(w), sent
L_full, sent_full = run(compress=False)
L_comp, sent_comp = run(compress=True)
print(f"workers K : {K}")
print(f"coordinates per gradient : {d} (top-k keeps {keep})")
print(f"uncompressed final loss : {L_full:.6f}")
print(f"top-k + error-fb loss : {L_comp:.6f}")
print(f"loss gap : {abs(L_comp - L_full):.2e}")
print(f"coords sent uncompressed : {sent_full}")
print(f"coords sent compressed : {sent_comp}")
print(f"communication reduction : {100 * (1 - sent_comp / sent_full):.1f}%")
g + err, banks the residual in err, and transmits only the ten retained coordinates; the sent counter measures the network traffic each scheme actually generates.workers K : 8
coordinates per gradient : 100 (top-k keeps 10)
uncompressed final loss : 0.247179
top-k + error-fb loss : 0.246605
loss gap : 5.74e-04
coords sent uncompressed : 48000
coords sent compressed : 4800
communication reduction : 90.0%
The result is the section's thesis in three numbers. Ten coordinates per worker per step instead of a hundred is a tenfold cut in the bandwidth term $\beta P$, and the model reached the same loss. The discarded ninety coordinates were not lost, only deferred through the error buffer, which is why the gap is rounding-level rather than a visible degradation. Scale this from a hundred parameters to a billion and the saved bytes are the difference between a network-bound training step and a compute-bound one.
Who: A platform engineer training a recommendation model across two data centers in different regions.
Situation: The dataset was pinned to one region for compliance, but the only free GPUs were in another, so gradients had to cross a wide-area link every step.
Problem: The inter-region link delivered roughly a tenth of the intra-cluster bandwidth, and dense gradient all-reduce stretched each step until the GPUs sat idle most of the time waiting on the network.
Dilemma: Move the data (blocked by compliance), rent same-region GPUs at a premium, or compress the gradients hard enough that the slow link stopped being the bottleneck.
Decision: They compressed, combining top-$k$ sparsification with error feedback for the large embedding gradients and PowerSGD for the dense layers, accepting a slightly higher step count for far cheaper steps.
How: They registered a PowerSGD communication hook on the dense parameters and a custom top-$k$ hook on the embedding gradients, each with its own error buffer, and tuned $k$ until the link stopped being saturated.
Result: Per-step wire traffic fell by about $20\times$, the GPUs went from mostly idle to mostly busy, and final model quality was within noise of a same-region dense baseline, exactly the trade Output 10.7.2 shows in miniature.
Lesson: When the link is slow, the right move is not always faster hardware; compressing the message can turn a network-bound run back into a compute-bound one, and error feedback is what keeps the aggressive compression honest.
5. Local Updates: Fewer Rounds, Not Smaller Messages Intermediate
Quantization, sparsification, and low-rank compression all shrink the message while keeping one synchronization per step. Local-update methods attack the orthogonal axis: they cut the number of synchronizations. In local SGD, each worker takes $H$ ordinary optimizer steps on its own data before any communication, and only then are the worker models averaged. Setting $H = 1$ recovers synchronous distributed SGD; setting $H = 10$ communicates a tenth as often. Where compression divides the bytes per round, local SGD divides the rounds, and the two compose: a compressed message sent every tenth step is cheap on both axes.
The cost is statistical. Between synchronizations the workers drift apart, each descending the slightly different landscape of its own shard, so the averaged model makes a little less progress per gradient evaluated than fully synchronous SGD would. The trade is governed by $H$: larger $H$ means fewer rounds but more drift, and the sweet spot depends on how much faster communication is than computation. This is the same bytes-versus-rounds-versus-progress trade as compression, viewed from the rounds axis. Local SGD is the algorithmic core of federated averaging, where the workers are phones or hospitals that can communicate only rarely, and we develop that setting in full in Chapter 14, where infrequent communication is not an optimization but a hard constraint of the deployment.
Every method in this section turns the same knob: trade a little per-step progress for a large cut in communication cost. Compression shrinks the bytes per round; local updates shrink the rounds; error feedback recovers the progress that aggressive compression would otherwise lose. This is the bandwidth term $\beta P$ from the $\alpha$-$\beta$ model of Chapter 3, now treated as a design variable rather than a fixed cost. The same trade reappears scaled out across the book: as compressed all-reduce in data-parallel deep learning, as the rare-synchronization constraint of federated learning (Chapter 14), and as the geo-distributed training that lets a model train across continents. Whenever a later method claims to make distributed training cheaper, ask which of the two axes it cuts and what per-step progress it pays.
6. The Unifying View and Its Limits Advanced
Lined up together, the families form a single design space with two axes and one safeguard. Quantization and sparsification and low-rank compression all reduce bytes-per-round; local SGD reduces rounds; error feedback is the correction that lets the biased, aggressive members of either axis stay convergent. Table 10.7.1 places each method on this map, naming what it sends, what it costs, and whether it needs feedback to converge.
| Method | What it cuts | What it sends | Needs error feedback? |
|---|---|---|---|
| 1-bit SGD / QSGD / TernGrad | Bytes per round | Every coordinate, 1 to 8 bits | Helpful (required for 1-bit) |
| Top-$k$ / Deep Gradient Compression | Bytes per round | $k$ largest coordinates plus indices | Yes (biased compressor) |
| PowerSGD (low-rank) | Bytes per round | Dense factors $P$, $Q$ of rank $r$ | Yes (biased compressor) |
| Local SGD / FedAvg | Rounds | Full model every $H$ steps | No (averaging is unbiased) |
The limits are as important as the savings. Compression helps only when communication, not computation, is the bottleneck; on a fast intra-node interconnect with a small model, compressing the gradient can cost more in encoding and decoding than it saves on the wire, and dense all-reduce wins. Sparsification's index overhead erodes its savings when $k$ is not tiny, and irregular sparse collectives are slower per byte than the dense all-reduce that PowerSGD preserves. Local SGD's drift grows with data heterogeneity, so the more unlike the workers' shards, the smaller the $H$ you can afford. None of these methods is free, and each is the right tool only when its specific bottleneck binds, the same discipline of matching the remedy to the ceiling that opened the book. We quantify exactly when each pays off with the communication-complexity bounds of the next sections.
Local updates have moved from a bandwidth optimization to the enabling idea behind training across the open internet. DiLoCo (Douillard et al., 2024) pairs hundreds of local inner steps with an infrequent outer optimizer and reaches the quality of fully synchronous training while communicating orders of magnitude less, and streaming and asynchronous follow-ups (DiPaCo, Streaming DiLoCo, 2024 to 2025) relax the remaining synchronization so that geographically scattered, loosely connected clusters can co-train one model. On the bytes axis, 1-bit and sub-1-bit optimizers continue to advance: 1-bit Adam and 0/1 Adam compress not just the gradient but the optimizer's momentum and variance statistics, and recent work fuses quantization with error feedback that provably tracks full-precision Adam. The two axes are now routinely combined: compressed messages sent only every few hundred local steps, which is what makes a single training run spanning data centers on different continents a practical proposition rather than a thought experiment. The throughline is the one this section drew, communication is a budget to be spent deliberately, and the frontier is spending ever less of it per unit of learning.
Top-$k$ sparsification is a biased compressor: $\mathbb{E}[C(g)] \neq g$ in general. Argue informally why, without error feedback, a coordinate whose true gradient is consistently medium-sized (never large enough to enter the top-$k$, never zero) would be systematically ignored, and what that does to the model along that coordinate's direction. Then explain, using the recurrence $e_{t+1} = (g_t + e_t) - \tilde{g}_t$, how the error buffer guarantees that such a coordinate is eventually transmitted in full. Contrast this with an unbiased quantizer like QSGD, which converges without feedback, and state in one sentence the property that distinguishes the two cases.
Starting from Code 10.7.2, sweep keep over $\{1, 2, 5, 10, 25, 50, 100\}$ and plot the final loss against the number of coordinates sent. Identify the value of keep below which accuracy starts to degrade visibly. Then disable error feedback (set err[j] to zero every step) and repeat the sweep; quantify how much more aggressively you can compress when feedback is on. Report the compression ratio at which the two curves diverge and relate it to the section's claim that feedback turns a biased compressor convergent.
Using the $\alpha$-$\beta$ communication model from Chapter 3, the cost of one synchronization of a $P$-coordinate gradient is $\alpha + \beta P$, where $\alpha$ is the per-round latency and $\beta$ the per-coordinate transfer time. Write the per-step communication cost for (a) top-$k$ compression that sends $k$ coordinates every step, and (b) local SGD that sends all $P$ coordinates every $H$ steps. Under what relationship between $\alpha$, $\beta P$, $k$, and $H$ does compression beat local updates, and when does the latency term $\alpha$ make local updates the better choice? Explain why combining both (compressed messages sent every $H$ steps) dominates either alone when both $\alpha$ and $\beta P$ are large.