"They told me my job was to minimize the average loss. Nobody mentioned the average would be scattered across two thousand machines that all wanted to go first."
An Optimizer With Too Many Workers
Almost every model you will ever train is fit by minimizing one quantity: the average loss over a finite set of training examples. That single objective, empirical risk, is what stochastic gradient descent attacks, what parameter servers store, and what every parallel-training trick in this chapter is ultimately serving. Its defining feature is the reason this whole chapter exists: because the objective is an average over examples, so is its gradient, and an average splits across machines without losing a thing. This section states the optimization problem precisely, shows exactly which part of its structure makes data parallelism possible, separates the well-behaved convex case from the non-convex reality of deep networks, and sets up the attack the rest of the chapter mounts on it: stochastic gradients (Section 10.2), synchronized across workers (Section 10.3) or run loose (Section 10.4), with communication as the tax we spend the chapter learning to reduce.
Training a model, stripped of its framework and its jargon, is the act of choosing parameters that make a chosen loss small on the data you have. You do not get to measure loss on the true distribution of all possible inputs, because you never see that distribution; you see a finite sample. So you settle for the next best thing, the average loss over the examples in front of you, and you search for the parameters that minimize it. That average has a name, empirical risk, and minimizing it, empirical risk minimization or ERM, is the optimization problem underneath linear regression, logistic regression, support vector machines, gradient-boosted trees, and every deep network in this book. Chapter 1 used the gradient of this objective to show that data parallelism is exact; this chapter takes that same objective and builds the full distributed-optimization toolkit around it.
1. The Objective Behind All Training Beginner
Fix a model with parameters $w$ and a per-example loss $\ell(w; x_i, y_i)$ that measures how badly the model with parameters $w$ predicts the label $y_i$ for input $x_i$. Given $N$ training pairs, the empirical risk is the average of that loss over the sample, and ERM is the problem of choosing $w$ to make it as small as possible:
$$\min_{w} \; L(w) = \frac{1}{N} \sum_{i=1}^{N} \ell(w; x_i, y_i) + \lambda R(w).$$The optional term $\lambda R(w)$ is regularization, a penalty on the size or complexity of $w$ (most often $R(w) = \tfrac{1}{2}\lVert w \rVert^2$, ridge or weight decay) that discourages overfitting the finite sample. Different models simply plug in different losses: squared error $\ell = \tfrac{1}{2}(w^\top x_i - y_i)^2$ gives linear regression, the logistic loss gives logistic regression, the hinge loss gives a linear support vector machine, and the cross-entropy of a deep network's output gives a neural classifier. The shape of $\ell$ changes everything about how hard the search is, but the form of the objective, an average over examples plus a penalty, does not change at all. That invariance is what lets one chapter on optimization serve models as different as a linear regressor and a transformer.
To minimize $L$ we follow its gradient downhill, and here the averaging structure carries straight through differentiation. Because the derivative of a sum is the sum of the derivatives, the gradient of the empirical risk is itself an average of per-example gradients:
$$\nabla L(w) = \frac{1}{N} \sum_{i=1}^{N} \nabla \ell(w; x_i, y_i) + \lambda \nabla R(w).$$This is the identity Section 1.1 used to prove data parallelism is exact (Section 1.1), and it is the hinge on which this entire chapter swings. Read it as a recipe for distribution: to compute $\nabla L$, you sum a contribution from every example, and a sum can be partitioned across machines and recombined by addition. The regularizer $\lambda \nabla R(w)$ is a function of $w$ alone, not of the data, so it costs nothing to distribute; every worker can add its share locally.
2. Why the Sum Is Separable, Exactly Intermediate
The property that makes ERM distributable is that its objective is a separable sum: a total formed by adding one independent term per example, where no term depends on which other examples happen to share the dataset. Separable sums decompose along any partition of the index set. Split the $N$ examples into $K$ disjoint shards $S_1, \dots, S_K$, hand shard $S_k$ to worker $k$, and let that worker compute the unnormalized partial sum over only its own rows,
$$g_k = \sum_{i \in S_k} \nabla \ell(w; x_i, y_i).$$Then $\sum_{i=1}^{N} \nabla \ell = g_1 + g_2 + \cdots + g_K$, because addition is associative and commutative and does not care how the terms were grouped. Dividing the summed partials by $N$ (and adding the cheap regularizer gradient) recovers $\nabla L(w)$ exactly. The combine step, summing one vector held on each worker and sharing the result, is the collective operation called all-reduce, introduced in Chapter 4 and turned into a training loop in Chapter 15. Figure 10.1.1 traces exactly this path from the single objective at the top to the recombined gradient at the bottom.
Data parallelism is not a clever approximation that happens to work; it is a direct consequence of the empirical risk being a sum with one independent term per example. The gradient inherits that sum, and a sum can be split across any number of workers and reassembled by addition with zero loss. Every synchronous data-parallel method in this book, from all-reduce SGD to FSDP, is at bottom this one fact about separable sums, dressed in the engineering needed to do the addition fast and reliably over a network. When an objective is not a clean per-example sum (a ranking loss over pairs, a contrastive loss over a batch, a graph loss over edges), distributing it takes extra care, which is exactly why later chapters treat those cases separately.
The identity $\nabla L = \frac{1}{N}\sum_i \nabla \ell_i$ first appeared in Section 1.1 as the one-line proof that scale-out training is exact. Here it becomes the foundation of an entire chapter: Section 10.2 replaces the full sum with a random mini-batch to get stochastic gradient descent, Section 10.3 keeps the sum exact but computes it across synchronized workers, Section 10.4 lets workers fall out of step, and the communication-efficiency sections trim the cost of the all-reduce that reassembles it. Whenever a method in this chapter claims to "distribute training", check what it does to this sum; the answer is always either compute it exactly across machines, approximate it with a sample, or compress the messages that carry the partial sums.
3. Convex and Non-Convex Objectives Intermediate
Separability tells us the gradient distributes, but it says nothing about whether following that gradient actually finds a good minimum. That question turns on the shape of $L$, and the decisive distinction is convexity. An objective is convex if its graph is bowl-shaped: the line segment between any two points on the surface lies on or above the surface, so the function curves the same way everywhere. Linear regression with squared loss, logistic regression, and the linear SVM are all convex in their parameters. Convexity is a gift, because a convex objective has no spurious local minima; every local minimum is a global minimum, and gradient descent from any starting point converges to the same optimal loss. For distributed optimization this means the convergence guarantees are clean and provable: synchronous distributed SGD on a convex objective converges to the global optimum at rates we can write down, and those rates are what Chapter 3's cost models combine with communication time to predict end-to-end training speed.
Deep networks break convexity. Composing linear maps with nonlinear activations produces an objective with many local minima, saddle points, flat regions, and sharp valleys; the bowl becomes an egg carton. Non-convex ERM has no guarantee that gradient descent reaches the global optimum, and different runs from different initializations land in different basins. Yet the same averaged-gradient machinery still applies, because separability is a property of the sum, not of its curvature. We still compute $\nabla L$ as an average of per-example gradients, still split that average across shards, still all-reduce it. What changes is the theory we can offer: for non-convex objectives we prove convergence only to a stationary point (a place where the gradient is near zero), not to a global minimum, and the empirical fact that this works well enough to train every modern model is one of the happy surprises of the field. The good news for this chapter is that the distribution mechanics are identical in both regimes; convexity affects what you can prove about the destination, not how the workers cooperate to take each step.
A modern network's loss surface lives in a space of billions of dimensions, so the cozy picture of a bumpy egg carton is a polite fiction. In very high dimensions, a randomly chosen critical point is overwhelmingly likely to be a saddle (down in some directions, up in others) rather than a true local minimum, which is part of why gradient methods rarely get permanently stuck. The optimizer is less a marble settling into a bowl than a hiker on a ridge-laced landscape who keeps finding one more direction that goes down. Distribution does not change the terrain; it just puts a thousand hikers on it who must agree, at every step, on which way is down.
4. Computing the Gradient as a Sum of Shard Gradients Intermediate
The claim that the full-data gradient equals the sum of per-shard gradients is easy to state and worth verifying in code, because the rest of the chapter rests on it. The program below sets up a regularized logistic-regression ERM objective on synthetic binary-classification data, then computes its gradient two ways: once the single-machine way over all $N$ rows, and once as $K$ independent shard computations whose partial sums are added and divided by $N$, exactly the all-reduce-then-normalize path of Figure 10.1.1. It reports how far apart the two answers are.
import numpy as np
rng = np.random.default_rng(7)
N, d, K = 120_000, 40, 6 # examples, features, workers (shards)
lam = 1e-2 # L2 regularization strength
# Synthetic binary-classification data for regularized logistic regression.
X = rng.standard_normal((N, d))
w_star = rng.standard_normal(d)
p = 1.0 / (1.0 + np.exp(-(X @ w_star)))
y = (rng.random(N) < p).astype(float) # labels in {0, 1}
w = rng.standard_normal(d) * 0.1 # the point where we evaluate the gradient
def reg_logistic_grad(Xb, yb, w):
"""Gradient contribution of a block of rows to the FULL-data objective.
The data term is summed (not averaged) so blocks compose by addition; the
regularizer is the per-row share lam*w scaled by the block's row count."""
sigma = 1.0 / (1.0 + np.exp(-(Xb @ w)))
data_term = Xb.T @ (sigma - yb) # sum over this block's rows
reg_term = lam * w * Xb.shape[0] # this block's share of N * lam * w
return data_term + reg_term
# Single-machine gradient: one process sees all N rows.
full = reg_logistic_grad(X, y, w) / N
# Data-parallel gradient: K workers, each blind to the others' rows.
shards = np.array_split(np.arange(N), K)
partials = [reg_logistic_grad(X[s], y[s], w) for s in shards] # K workers
allreduced = np.sum(partials, axis=0) / N # all-reduce, then /N
print("workers K :", K)
print("shard sizes :", [len(s) for s in shards])
print("max abs difference :", f"{np.max(np.abs(allreduced - full)):.2e}")
print("relative error :", f"{np.linalg.norm(allreduced - full) / np.linalg.norm(full):.2e}")
print("gradients identical? :", bool(np.allclose(allreduced, full)))
full) and once as six blind shard contributions recombined by summation and division by $N$ (allreduced). The data term is summed rather than averaged inside each block precisely so that the blocks compose by addition, and the regularizer is split by row count so the shares add back to the full penalty.workers K : 6
shard sizes : [20000, 20000, 20000, 20000, 20000, 20000]
max abs difference : 1.65e-15
relative error : 6.33e-15
gradients identical? : True
np.allclose reports them identical. Six workers, each blind to five sixths of the data, jointly compute the exact gradient of the regularized objective.The difference is not merely small; it is zero up to floating-point rounding, and np.allclose confirms the two gradients are identical for every practical purpose. This is the same exactness Section 1.1 demonstrated for linear regression, now shown to survive the addition of a nonlinear logistic loss and an $L^2$ regularizer, because none of that disturbs the separable-sum structure. The only subtlety the code makes explicit is bookkeeping: each shard sums its data-term contributions rather than averaging them, and the regularizer is apportioned by row count, so that the partial sums add back to exactly $N \nabla L$ before the single division by $N$. Get that bookkeeping wrong (for instance by averaging within each shard and then averaging across shards of unequal size) and the exactness breaks, a failure you will provoke deliberately in Exercise 10.1.2.
In Code 10.1.1 we wrote the per-shard gradient by hand and combined the partials with an explicit sum. In a real training job you write neither piece. Autograd computes $\nabla \ell$ from the loss expression, and a distributed wrapper performs the cross-worker sum as part of the backward pass, so the separable-sum recombination of Figure 10.1.1 happens with no code from you at all:
# Run with: torchrun --nproc_per_node=6 thisfile.py
import torch
import torch.nn as nn
from torch.nn.parallel import DistributedDataParallel as DDP
import torch.distributed as dist
dist.init_process_group("nccl") # join the group of K workers
model = DDP(nn.Linear(40, 1).cuda()) # logistic regression as a 1-layer net
loss_fn = nn.BCEWithLogitsLoss() # mean cross-entropy over the shard
for xb, yb in this_workers_shard_loader: # each worker sees only its rows
loss = loss_fn(model(xb).squeeze(), yb)
loss.backward() # autograd builds the per-shard gradient AND all-reduces it
optimizer.step() # every worker now steps with the exact full-batch direction
np.sum of Code 10.1.1 collapse into one loss.backward() call. DistributedDataParallel fires the all-reduce automatically during backpropagation and divides by the world size, so the dozen lines of manual gradient assembly become a standard training step; the framework also overlaps the network transport with computation, as Chapter 15 details.Who: A machine learning engineer building a search-ranking model at a travel marketplace.
Situation: The team moved their pointwise click-prediction model to eight-worker data-parallel training and saw the exact, identical gradients that ERM theory promises, with linear speedup.
Problem: When they switched the objective to a pairwise ranking loss (penalizing every booked listing ranked below a skipped one), accuracy across workers drifted and the speedup vanished.
Dilemma: Keep the clean pointwise loss that distributes perfectly but ranks worse, or keep the pairwise loss that ranks better but whose terms couple examples across shard boundaries, so a naive per-shard sum silently dropped the cross-shard pairs.
Decision: They kept the pairwise objective but redefined the unit of sharding: instead of sharding individual examples, they sharded whole query groups, so every pair in a loss term lived on the same worker and the sum became separable again.
How: The data loader grouped rows by query id and assigned entire groups to workers; within a group the loss was a clean sum, and across groups the groups were independent, restoring the separable structure of Section 2.
Result: Exact gradients and linear speedup returned, with the better-ranking loss, because the objective was once again a sum of independent per-shard terms.
Lesson: Data parallelism is exact when the objective is a separable sum over the sharding unit. If a loss couples examples, shard at the granularity that keeps each coupled term whole, or the distributed gradient stops matching the single-machine one.
5. Setting Up the Chapter's Attack on This Objective Beginner
We now have the target. Every method in this chapter is a different way of minimizing the empirical risk $L(w)$ across machines, and they differ in how they handle two costs that Code 10.1.1 hid because everything ran in one process. The first cost is computing the full sum at all: with millions or billions of examples, evaluating $\nabla L$ exactly every step is wasteful, so Section 10.2 replaces it with the gradient of a small random mini-batch, an unbiased sample of the full average, giving stochastic gradient descent. The second cost is communication: once the gradient is split across workers, the partial sums must be combined every step, and that exchange takes real network time. Section 10.3 performs the combine synchronously so every worker steps in lockstep with an exact averaged gradient, while Section 10.4 lets workers run asynchronously, trading exactness for the elimination of waiting, at the cost of the stale gradients that Chapter 11 manages with bounded staleness.
Communication is the thread that runs through all of it. Because the combine step moves a vector the size of the model every step, its cost grows with both model size and worker count, and for large models it can dominate the computation it serves. The communication-efficient methods of Section 10.7 shrink that cost by compressing, quantizing, or sparsifying the messages, or by communicating less often (local SGD), and Section 10.9 asks how far that can be pushed in principle through communication-complexity lower bounds. The federated setting of Chapter 14 takes the same ERM objective to its communication-starved extreme, where data never leaves the devices that hold it. Throughout, the object being optimized never changes; what changes is how cleverly the workers cooperate to evaluate and follow its averaged gradient.
The classical assumption behind synchronous distributed ERM is a fast, reliable interconnect inside one datacenter. A vigorous recent line asks whether the same averaged objective can be optimized across machines connected only by the public internet, where the all-reduce of Figure 10.1.1 would be ruinously slow if run every step. DiLoCo (Douillard et al., 2024) and its descendants let each worker take hundreds of local optimization steps between rare synchronizations, turning the per-step communication of standard data-parallel training into per-hundreds-of-steps communication while preserving convergence on the shared objective; follow-up work on streaming and asynchronous variants pushes this toward genuinely geo-distributed and intermittently connected training. A complementary thread (the lineage of PowerSGD and low-bit gradient compression) shrinks the bytes in each combine step rather than its frequency. Both treat the cost of recombining the separable sum, the all-reduce in Code 10.1.1, as a quantity to be engineered down, and we develop the tools to evaluate them in Section 10.7 and Chapter 3.
The plan for the chapter, then, is set. We have the objective (empirical risk), the structural fact that makes it distributable (a separable sum whose gradient is an average), the distinction that governs what we can prove about the result (convex versus non-convex), and a verified demonstration that the distributed gradient equals the single-machine one exactly. The next section takes the first and most consequential step away from the exact full-data gradient: sampling it.
For each objective, state whether the empirical risk is a separable sum with one independent term per training example, and therefore whether per-example data parallelism is exact without special handling: (a) ridge regression, squared loss plus $\tfrac{\lambda}{2}\lVert w \rVert^2$; (b) a batch-wise contrastive loss where each example's loss depends on every other example in its batch through a softmax over similarities; (c) the log-likelihood of an independent-and-identically-distributed dataset under a probabilistic model; (d) a graph node-classification loss that includes a smoothness penalty over edges connecting nodes. For the non-separable cases, name a sharding unit (as in the Practical Example) that would restore separability, or explain why none cleanly does.
Modify Code 10.1.1 so the six shards have very unequal sizes (for example one shard of 100,000 rows and five of 4,000). First combine the shards by taking a plain unweighted average of each shard's mean gradient, and measure the error against full. Then fix it two ways: a size-weighted average of the shard means, and the sum-of-partials-divided-by-$N$ form already in the code. Confirm both fixes recover the exact gradient. Explain in two sentences why the unweighted average of means fails under size imbalance while the sum-then-normalize form does not, and what this implies for a real cluster where workers receive unequal numbers of rows.
Consider running the same synchronous distributed SGD on (a) a convex logistic-regression ERM objective and (b) a non-convex two-layer-network ERM objective, each launched from many random initializations on the identical data and worker configuration. Predict, and justify from the convexity discussion in Section 3, how the final training losses across the random restarts will differ between the two objectives. Then state which guarantee distributed SGD can offer in each case (global minimum versus stationary point), and explain why this difference in what you can prove about the destination does not change how the workers cooperate to take each step.