Part IV: Parallel Deep Learning and Large Models
Chapter 15: Data-Parallel Deep Learning

Why Deep Learning Needs Distributed Training

"They asked me to learn from a trillion tokens by Thursday. I am one GPU. I did the arithmetic, and then I asked for seven hundred friends."

A GPU Idling on a Communication Barrier
Big Picture

Modern deep learning is no longer a workload that fits on one accelerator; the compute a state-of-the-art model demands has grown by many orders of magnitude faster than any single chip has, so training across many devices is mandatory, not optional. Three things explode together: the dataset (epochs over terabytes), the model (parameters beyond one device's memory), and the resulting wall-clock (months on a single GPU). Part IV is the engineering of spreading that one training job across many machines: this chapter replicates the model and splits the data (data parallelism), Chapter 16 splits the model itself, and Chapter 17 splits sparse experts. This first section establishes why the move is forced, and recalls that the central form of it is mathematically exact, so the rest of the chapter is about making it fast with real GPUs and PyTorch.

By the start of Part IV you have already seen distributed training in principle. Chapter 10 developed synchronous and asynchronous SGD as optimization algorithms, and Section 1.1 proved, in one short calculation, that data-parallel gradient descent is an exact reorganization of single-machine training rather than an approximation. What changes now is the hardware and the stakes. Part III treated the worker as an abstract compute node; Part IV makes it a concrete GPU with finite memory, a finite FLOP rate, and a finite link to its neighbors, and it makes the model a real deep network trained with PyTorch. The question stops being "is data parallelism correct?" and becomes "given these chips and this network, how do we make it actually fast?"

The reason this question is now unavoidable is a simple mismatch of growth rates. The compute used to train a leading model has risen relentlessly across the deep-learning era, doubling on a timescale of months rather than the years that single-chip throughput needs to double. A single accelerator, however capable, has been falling further behind the frontier every year. When the work a model needs is hundreds of times what one chip can deliver in a tolerable time, you do not wait; you distribute. Three distinct pressures drive that gap, and it pays to keep them separate because each one calls for a different kind of parallelism.

Three pressures on one GPU Dataset size terabytes, many epochs Model size params beyond one device Wall-clock time months on a single GPU Data parallelism replicate model, split data Model / pipeline / sharded split the model itself This part Chapter 15 this chapter Chapter 16 model and sharded Chapter 17 expert parallelism Dataset size and wall-clock are relieved by replicating and splitting data; model size is relieved by splitting the model across devices.
Figure 15.1.1: The three pressures that push deep learning off a single GPU and the parallelism each one calls for. Dataset size and wall-clock are addressed by data parallelism (replicate the model, split the data), the subject of this chapter. Model size, where the parameters and their optimizer state simply do not fit on one device, is addressed by splitting the model itself across devices, the subject of Chapter 16, with expert parallelism (Chapter 17) as a sparse cousin. A foundation-model run uses all of them at once.

1. The Compute Gap That Forces the Move Beginner

Start with the quantity that matters: total training compute, measured in floating-point operations. A useful and widely used estimate is that training a dense transformer with $P$ parameters on $D$ tokens costs about $C \approx 6 P D$ FLOPs, counting the forward and backward passes together. The factor of six is an engineering rule of thumb, but it captures the scaling that matters: compute grows with both the model and the data, so when both grow, the product explodes. A seven-billion-parameter model trained on a trillion tokens needs roughly $6 \times (7 \times 10^9) \times 10^{12} \approx 4 \times 10^{22}$ FLOPs, and frontier runs sit orders of magnitude above even that.

Now put that against one accelerator. A current high-end GPU advertises on the order of $10^{15}$ FLOPs per second at peak in low precision, but real training never sustains peak; a healthy job achieves a model FLOPs utilization (MFU) of perhaps forty percent, so the sustained rate is closer to $4 \times 10^{14}$ FLOPs per second. Dividing the budget by the rate gives the single-GPU wall-clock, and the answer is brutal: more than a thousand days, well over three years, for a model that is by today's standards modest. No project waits three years for one training run. The only way the number becomes tolerable is to bring many accelerators to bear at once, which is exactly what data parallelism does by replicating the model on each of $K$ devices and giving each a different slice of the data.

The code below makes this arithmetic concrete. It estimates the single-GPU time for the seven-billion-parameter job, then the time for an ideal $K$-GPU data-parallel run (perfect linear speedup) and a more realistic one in which a simple communication-overhead model erodes the speedup as $K$ grows. It is pure arithmetic, no GPU required, so you can run it anywhere and see months collapse to days.

P = 7.0e9            # parameters (7 billion)
tokens = 1.0e12      # training tokens (1 trillion)
flops_per_token = 6.0 * P                 # standard 6*N*D forward+backward estimate
total_flops = flops_per_token * tokens

gpu_peak = 1.0e15    # 1 PFLOP/s peak (bf16, dense)
mfu = 0.40           # model FLOPs utilization actually achieved
gpu_eff = gpu_peak * mfu                   # sustained FLOP/s per GPU

def days(seconds):
    return seconds / 86400.0

single = total_flops / gpu_eff
print("total training FLOPs :", f"{total_flops:.2e}")
print("sustained per-GPU    :", f"{gpu_eff:.2e} FLOP/s")
print("single-GPU time      :", f"{days(single):.1f} days  ({days(single)/30.0:.1f} months)")
print()

for K in (8, 64, 256, 1024):
    ideal = single / K                     # perfect linear data-parallel speedup
    # Communication erodes the speedup: all-reduce time grows while per-GPU
    # compute shrinks, so scaling efficiency falls as the worker count rises.
    eff = 1.0 / (1.0 + 0.04 * (K ** 0.5))  # heuristic efficiency factor
    real = ideal / eff
    print(f"K={K:>4}  ideal={days(ideal):7.2f} d   "
          f"efficiency={eff*100:4.1f}%   realistic={days(real):7.2f} d   "
          f"speedup={single/real:6.1f}x")
Code 15.1.1: A from-scratch training-time estimator. It needs only the FLOPs budget ($6PD$) and a sustained per-GPU rate; the loop contrasts the ideal $K$-fold speedup with a realistic one in which communication overhead grows with the worker count.
total training FLOPs : 4.20e+22
sustained per-GPU    : 4.00e+14 FLOP/s
single-GPU time      : 1215.3 days  (40.5 months)

K=   8  ideal= 151.91 d   efficiency=89.8%   realistic= 169.10 d   speedup=   7.2x
K=  64  ideal=  18.99 d   efficiency=75.8%   realistic=  25.07 d   speedup=  48.5x
K= 256  ideal=   4.75 d   efficiency=61.0%   realistic=   7.79 d   speedup= 156.1x
K=1024  ideal=   1.19 d   efficiency=43.9%   realistic=   2.71 d   speedup= 449.1x
Output 15.1.1: Single-GPU training takes more than forty months; 256 GPUs bring it under eight days and 1024 GPUs under three. The widening gap between the ideal and realistic columns is the efficiency tax that gradient communication imposes, and shrinking it is the technical work of the rest of this chapter.

Two facts jump out of Output 15.1.1, and they organize everything that follows. First, distribution is the difference between a feasible project and an impossible one: forty months becomes days. Second, the speedup is not free. At 1024 GPUs the realistic run takes more than twice the ideal, because every step pays to combine gradients across all workers, and that cost grows as the cluster grows. The right column never quite keeps pace with the left, and the size of that gap is precisely what Section 15.2 onward teaches you to measure and minimize.

Key Insight: Compute Demand Outgrew the Chip, So Distribution Is Mandatory

Training compute for leading models has grown by many orders of magnitude and continues to double on a timescale of months, far faster than any single accelerator's throughput improves. The result is a structural gap: the work a competitive model requires is hundreds to thousands of times what one GPU can deliver in an acceptable time. Multi-device training is therefore not an optimization you reach for when convenient; it is the baseline condition of modern deep learning. The open question is never whether to distribute but how efficiently, because the communication needed to keep many replicas in sync is the tax that decides how much of the ideal speedup you actually keep.

2. Three Things Explode at Once Beginner

The single-GPU wall-clock in Output 15.1.1 is the symptom; the disease has three independent causes, drawn in Figure 15.1.1. The first is dataset size. Pretraining corpora are measured in terabytes of text, images, or video, and a training run sweeps over them for one or more epochs. One GPU reads and processes that stream serially, and at a fixed throughput per device the only way to finish sooner is to have more devices each chew through a different shard. This is the pressure that data parallelism targets head-on: replicate the model, split the data, and the dataset is consumed $K$ times faster.

The second is model size. A foundation model's parameters, plus the optimizer state (momentum and variance for Adam, roughly doubling or tripling the parameter memory) and the activations saved for the backward pass, routinely exceed the memory of a single accelerator. A seven-billion-parameter model in mixed precision already pushes a 40 GB device; a model an order of magnitude larger simply cannot be loaded onto one chip at all, no matter how long you are willing to wait. This pressure is different in kind: no amount of patience helps, because the obstacle is capacity, not speed. It is the subject of Chapter 16, which splits the model itself across devices with tensor, pipeline, and sharded-data parallelism.

The third is the wall-clock that results when the first two combine. Even a model that fits comfortably on one device can demand months of single-GPU time once the dataset is large, exactly as Code 15.1.1 showed. Here distribution is about throughput, not capacity: the model is happy on one chip, but you want the answer in days, so you run $K$ replicas in lockstep. Recognizing which pressure binds for a given job is the first design decision of Part IV, because it dictates which parallelism to reach for, and the three are not mutually exclusive. A frontier run faces all three at once and must compose data, model, and expert parallelism into a single strategy, which is why Chapter 19 can only be written after the chapters that own each axis.

Fun Note: The Patience That Patience Cannot Buy

There is a tempting fantasy that you could simply leave one GPU running and collect your frontier model after a long vacation. Output 15.1.1 sells that fantasy at forty months for a merely mid-size model, and the wall-clock for an actually-large one runs into decades. But the model-size pressure is worse than slow: a model that does not fit in device memory never starts at all, so no amount of waiting produces a result. Time can be traded for speed; it cannot be traded for capacity. That asymmetry is why Part IV needs two different kinds of parallelism rather than one very patient engineer.

3. Synchronous Data-Parallel SGD Is Exact Intermediate

Before we spend a chapter making data parallelism fast, it is worth recalling why it is worth making fast at all: it computes the right answer exactly, not approximately. The argument is the gradient identity from Section 1.1, and it is short enough to restate. The training loss over a minibatch $B$ is an average of per-example losses, and the gradient of an average is the average of the gradients:

$$\nabla L_B(w) = \frac{1}{|B|} \sum_{i \in B} \nabla \ell(w; x_i, y_i).$$

Split the minibatch into $K$ equal shards, one per worker, and let worker $k$ compute the average gradient over its own shard $B_k$. Averaging those $K$ per-worker gradients reconstructs the full-minibatch gradient exactly, because an average of averages over equal-size disjoint groups is the overall average. Each worker holds the same model replica, runs forward and backward on its shard, and the workers then combine their gradients with an all-reduce, the collective introduced in Chapter 4: every worker ends up holding the summed (then averaged) gradient and applies the identical update. This is synchronous data-parallel SGD, and it is mathematically indistinguishable from training on one giant device that held the whole minibatch, a fact Section 10.3 established as an optimization result.

Thesis Thread: The All-Reduce Becomes a Training Loop

The all-reduce you summed by hand in Section 1.1 and analyzed as a collective in Chapter 4 is now promoted to the beating heart of a real training step. In data-parallel deep learning it fires once per iteration, on a vector as long as the model has parameters, across every GPU in the job, while the backward pass is still running. This chapter turns that one collective into a practical PyTorch training loop; Chapter 16 splits it into reduce-scatter and all-gather for sharded training, and Chapter 17 swaps it for all-to-all when experts live on different machines. The primitive does not change; the scale and the schedule do.

Because the math is exact, this chapter spends none of its effort defending correctness and all of it on practicality: how to launch $K$ GPU processes, how to shard the data loader so no two workers see the same example, how to overlap the gradient all-reduce with the backward pass so the network cost hides behind computation, and how to diagnose why your measured speedup falls short of the ideal column in Output 15.1.1. Those are systems problems, and PyTorch already solves most of them for you.

Library Shortcut: DistributedDataParallel and torchrun

Everything described above, replicating the model, sharding the data, all-reducing the gradients each step, and averaging them, is packaged in PyTorch's DistributedDataParallel (DDP). You wrap your model once, and DDP registers backward hooks that fire the gradient all-reduce automatically, bucketed and overlapped with the backward pass, so you never call the collective yourself. The launcher torchrun spawns one process per GPU and wires up the process group:

# Launch with:  torchrun --nproc_per_node=8 train.py
import torch, torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

dist.init_process_group("nccl")                 # join the K-GPU process group
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)

model = MyTransformer().to(local_rank)
model = DDP(model, device_ids=[local_rank])     # gradient all-reduce now automatic

for batch in sharded_loader:                    # each rank sees a different shard
    loss = model(batch).loss
    loss.backward()                             # all-reduce overlaps with backward
    optimizer.step(); optimizer.zero_grad()
Code 15.1.2: The whole data-parallel machinery in a dozen lines. DDP turns the manual sum-and-divide of Section 1.1 into an automatic, bucketed, computation-overlapped all-reduce; Section 15.6 opens this box and shows exactly what it does on each backward pass.

4. The Shape of Part IV Beginner

This chapter is the foundation of a four-axis story, and it helps to see the whole map before diving in. Data parallelism, developed here, is the workhorse: it relieves the dataset-size and wall-clock pressures by replicating the model and splitting the data, and it is the form of parallelism almost every training job uses, often as one layer of a larger scheme. When the model itself overflows one device, Chapter 16 introduces model, pipeline, and sharded-data parallelism (tensor splitting, pipeline stages, and the ZeRO/FSDP family) to spread the parameters, gradients, and optimizer state across devices. Chapter 17 adds expert parallelism, where a mixture-of-experts model routes each token to a few of many expert subnetworks that live on different machines, a sparse relative of data parallelism that trades a dense all-reduce for an all-to-all.

These axes compose. A large run might place an eight-way tensor-parallel model on each node, run pipeline stages across a few nodes, and then replicate that whole arrangement data-parallel across hundreds of such groups, all coordinated so the gradients reduce correctly. The chapters that follow build the pieces one at a time so that the composite, treated in Chapter 19, is assembled from parts you already understand. For this chapter, though, the model fits on one device and only data and time are scarce, which is the cleanest possible setting and the right place to start.

Practical Example: The Pretraining Run That Could Not Wait Three Years

Who: A research engineer at an applied-AI lab tasked with pretraining a seven-billion-parameter language model from scratch on an internal corpus.

Situation: A prototype training loop ran correctly on a single 80 GB GPU and produced sensible loss curves, but a back-of-the-envelope FLOPs estimate put the full trillion-token run at over forty months.

Problem: The model fit on one device, so memory was not the binding constraint; the binding constraint was wall-clock. A three-year run was incompatible with a project measured in quarters.

Dilemma: Shrink the dataset to finish on one GPU sooner and accept a weaker model, or distribute the run across many GPUs and take on a multi-process training loop, a sharded data pipeline, and a fast interconnect to keep the gradient all-reduce from dominating.

Decision: They distributed, because the only scarce resources were data throughput and time, exactly the pressures data parallelism relieves, and the exactness of synchronous SGD meant the distributed model would match the single-GPU one numerically.

How: They wrapped the model in DistributedDataParallel, sharded the data loader across 256 GPUs with a distributed sampler, and launched with torchrun, measuring scaling efficiency at each cluster size to find where communication began to erode the speedup.

Result: The run finished in just under eight days at roughly sixty percent scaling efficiency, matching the 256-GPU row of Output 15.1.1, and the final loss was indistinguishable from a small single-GPU control trained on the same data.

Lesson: When the model fits but time does not, data parallelism is the direct remedy, and the only real engineering is keeping the communication tax small enough that the speedup stays close to ideal.

5. What Data Parallelism Costs, and Where This Chapter Goes Intermediate

The gap between the ideal and realistic columns of Output 15.1.1 is not an accident of the toy formula; it is a real and unavoidable feature of synchronous training, and naming its causes sets the agenda for the chapter. Every step, each of $K$ workers must exchange a gradient vector as long as the model has parameters, and the time that exchange takes does not shrink as you add workers; in the simplest analysis it grows. Meanwhile the compute per worker does shrink as $K$ rises, because each worker handles a smaller slice of the batch. So the ratio of communication to computation worsens with scale, which is exactly why scaling efficiency falls from ninety percent at eight GPUs to forty-four percent at a thousand in the output above. The communication-cost models of Chapter 3 make this trade-off quantitative, and ring and tree all-reduce algorithms from Chapter 4 are the first line of defense.

The rest of Chapter 15 is a sustained attack on that gap. Section 15.2 sets the scene with the hardware hierarchy of single-GPU, multi-GPU, and multi-node training, because where a worker sits relative to its peers determines how expensive the all-reduce is. Later sections develop gradient bucketing and the overlap of communication with the backward pass, the production DDP path, and a careful look at the practical bottlenecks (data loading, load imbalance, and the network) that keep real jobs below the ideal line. The goal throughout is to keep the realistic column of Output 15.1.1 as close to the ideal column as the hardware allows.

Research Frontier: Training at Frontier Scale (2024 to 2026)

The pressure this section describes is most visible at the frontier, where training runs have crossed into tens of thousands of accelerators. The Llama 3 herd (Dubey et al., 2024) was trained on a cluster of up to 16,000 H100 GPUs, and its technical report is unusually open about the engineering: hardware failures interrupted the run frequently enough that elastic recovery, treated in Chapter 18, became a first-class concern rather than an afterthought. At that scale the communication tax dominates, so a parallel research line attacks it directly: low-communication optimizers in the DiLoCo lineage (Douillard et al., 2024) let replicas take many local steps between syncs, pushing data-parallel training toward genuinely geo-distributed clusters, and scaling-law work in the tradition of Chinchilla (Hoffmann et al., 2022) tells practitioners how to spend a fixed compute budget between model size and data so that the expensive run is not wasted. Each thread treats the cost of the gradient combine, the very tax visible in Output 15.1.1, as a quantity to be engineered down, not accepted.

We now have the motivation for Part IV: a compute gap that makes single-GPU training infeasible, three distinct pressures (data, model, wall-clock) that push the work onto many devices, and the reassurance that the central remedy, synchronous data-parallel SGD, is exact rather than approximate. What remains is to make it fast on real hardware. That begins with understanding where the GPUs sit and how they talk, which is the work of Section 15.2.

Exercise 15.1.1: Which Pressure Binds? Conceptual

For each training job, state which of the three pressures from Section 2 (dataset size, model size, wall-clock) is the binding one, and whether you would reach first for data parallelism (this chapter) or model/sharded parallelism (Chapter 16): (a) fine-tuning a 1-billion-parameter model that fits easily on one GPU, on a small labeled set, but needed by tomorrow morning; (b) pretraining a 175-billion-parameter model whose parameters plus Adam state exceed any single device's memory; (c) training a 3-billion-parameter model for many epochs over a 20-terabyte image corpus. Explain why distributing along the wrong axis would not relieve the binding pressure.

Exercise 15.1.2: Re-derive the Wall-Clock Coding

Starting from Code 15.1.1, change the model to 70 billion parameters and the dataset to 2 trillion tokens, and recompute the single-GPU time and the realistic time at $K \in \{256, 1024, 4096\}$. Then replace the heuristic efficiency factor with a slightly more principled one: assume per-step compute is proportional to $1/K$ and per-step communication is constant, and define efficiency as compute divided by compute-plus-communication. Pick the communication constant so that $K=256$ yields about sixty percent efficiency, and report how efficiency now scales to $K=4096$. Which column of your table, ideal or realistic, would you quote to a manager, and why?

Exercise 15.1.3: The Exactness Claim Under Uneven Shards Analysis

Section 3 argues that averaging $K$ per-worker gradients reconstructs the full-minibatch gradient exactly when the shards are equal in size. Suppose instead that one worker is given twice as many examples as the others (a load-imbalance scenario you will meet again in Chapter 18). Show with the averaging identity that a plain unweighted mean of the per-worker average gradients is now biased, and write the size-weighted combination that restores exactness. Explain in one or two sentences why DDP avoids this problem entirely by having every worker process the same number of samples per step, and what it must do with leftover examples at the end of an epoch.