Part IV: Parallel Deep Learning and Large Models
Chapter 15: Data-Parallel Deep Learning

Single-GPU, Multi-GPU, and Multi-Node Training

"On one node we whispered over NVLink and finished before lunch. Then they spread us across the building, and now every gradient takes the scenic route through the switch room."

A GPU Idling on a Communication Barrier
Big Picture

Distributed training runs on a hierarchy of links whose bandwidth falls by an order of magnitude or more at every level, and the single most important placement decision you make is to keep the heaviest communication on the fastest link. A lone accelerator pays no communication cost at all. Pack eight of them into one server and they talk over NVLink and NVSwitch at hundreds of gigabytes per second, so synchronization nearly disappears behind compute. Spread those same eight across eight machines and every gradient now crawls through the datacenter network, an order of magnitude slower, and the synchronization step that was free indoors becomes the thing that dominates the step. This section maps that hierarchy, quantifies the penalty of crossing it, and introduces the launch vocabulary (process, rank, world size, local rank) you use to tell the framework which GPUs sit on which fast link.

The previous section established why deep learning is pushed onto many devices in the first place: models, optimizer state, and the throughput needed to chew through web-scale data all outgrow one accelerator. This section answers the next question, which is where those devices physically live and what it costs to make them act as one. The answer is not a single number. It is a staircase. Compute happens on accelerators that are very fast, but those accelerators are wired together by links that differ enormously in speed depending on whether two GPUs share a board, share a chassis, or sit in different racks. Every distributed-training design is, at bottom, an attempt to route the unavoidable communication onto the fastest wire available and to keep it off the slow ones.

We will build the picture in three steps that mirror the hardware itself: one GPU as the zero-communication baseline, one node of several GPUs joined by a high-bandwidth intra-node fabric, and many nodes joined by the comparatively slow datacenter network. At each step the computation per device stays roughly the same; what changes, and changes sharply, is the cost of the step that combines results. By the end you will be able to read a cluster's topology and predict, before launching anything, where your scaling efficiency will leak away.

Single GPU no communication GPU baseline: 100% of work stays local Single node, 4 GPUs NVLink / NVSwitch, ~300 GB/s NV Switch fast link: sync hides behind compute Many nodes InfiniBand / Ethernet, ~12 GB/s node A node B slow link: cross-node all-reduce dominates
Figure 15.2.1: The training hardware hierarchy. Left: a single GPU pays no communication cost. Center: four GPUs in one node exchange gradients over NVLink and an NVSwitch at hundreds of gigabytes per second (green, fast). Right: GPUs in different nodes must cross the datacenter network at roughly a tenth of that bandwidth (red dashed, slow), so a cross-node all-reduce becomes the bottleneck. The placement rule of this section is to keep the heaviest collective on the green links.

1. The Single GPU: The Zero-Communication Baseline Beginner

A single accelerator is the reference point against which every distributed configuration is measured, and its defining property is that it communicates with nothing. The forward pass, the backward pass, and the optimizer update all read and write the same on-device memory, so there is no gradient to send, no peer to wait for, and no network to saturate. Whatever throughput the device delivers is pure compute. This is why we always quote scaling efficiency relative to one GPU: if one device processes $T_1$ samples per second, then $K$ devices in an ideal world would process $K T_1$, and the fraction of that ideal you actually achieve is your efficiency.

The single-GPU baseline also fixes the quantity that all later communication must hide behind: the per-step compute time. If the backward pass takes 250 milliseconds on one device, then in data-parallel training that 250 milliseconds is the budget within which the gradient exchange must complete if it is to be free. Communication that finishes inside the backward pass is invisible; communication that spills past it is added directly to the step time. Holding the compute fixed and watching the communication grow as we climb the hierarchy is the cleanest way to see the penalty, and it is exactly the experiment we run in Section 4.

2. The Single Node: Many GPUs on One Fast Fabric Beginner

The first rung up from one GPU is several GPUs inside a single server. A typical training node holds eight accelerators, and they are not merely sharing a slow peripheral bus. They are wired together by a dedicated high-bandwidth fabric: NVLink connects GPUs directly, and an NVSwitch lets every GPU in the box reach every other at full NVLink speed, giving effective all-to-all bandwidth in the hundreds of gigabytes per second. This is one to two orders of magnitude faster than the network between machines, and it is the single most valuable resource in distributed training. The interconnect substrate that distinguishes these tiers, from on-board links to the datacenter fabric, is laid out in detail in Section 4.2.

Because intra-node bandwidth is so high, an eight-GPU data-parallel job on one node behaves almost like a single very large accelerator. The gradient all-reduce that synchronizes the eight replicas runs over NVSwitch and completes in a few milliseconds, comfortably inside the backward pass, so scaling efficiency stays near the ceiling. This is the regime where adding GPUs feels free, and it is why the most common first move in scaling a training job is to fill one node before reaching for a second. The collective that does this synchronizing, ring or tree all-reduce, is exactly the primitive built from scratch in Chapter 4; here it simply runs on the fastest wire in the building.

Key Insight: Bandwidth Falls Off a Cliff at the Node Boundary

The defining fact of distributed-training hardware is not that links are slow; it is that they are slow in tiers. GPUs on one board talk over NVLink; GPUs in one chassis talk through an NVSwitch at comparable speed; GPUs in different chassis must cross the datacenter network, which is typically an order of magnitude slower. The node boundary is a bandwidth cliff. Every parallelism-placement decision in Part IV reduces to one rule: arrange the work so that the collective moving the most bytes per step stays inside a node, and only the cheaper, less frequent communication is allowed to cross the cliff.

3. Many Nodes: Crossing the Datacenter Network Intermediate

To go beyond eight or sixteen GPUs you must use multiple nodes, and now the gradients have to leave the chassis. The link between nodes is the datacenter network, either InfiniBand (common in HPC and AI clusters, with bandwidths around 100 to 400 gigabits per second and remote direct memory access that bypasses the CPU) or Ethernet (cheaper, more ubiquitous, often slower and higher latency). Even a fast InfiniBand link delivers on the order of 12 to 50 gigabytes per second per node, a fraction of intra-node NVSwitch bandwidth, and it is shared by every GPU on the node that wants to talk off-box. The cross-node all-reduce, which on one node was a rounding error, becomes the most expensive operation in the training step.

This is why topology matters and why frameworks and schedulers go to such lengths to learn it. A well-designed multi-node all-reduce is hierarchical: it first reduces within each node over the fast NVSwitch fabric, then performs a single smaller reduction across nodes over the slow network, then broadcasts the result back down inside each node. Doing so means only one node's worth of data crosses the cliff instead of every GPU's, which can cut cross-node traffic by the number of GPUs per node. Achieving that requires the runtime to know which ranks are co-located, which is precisely the topology-aware placement problem developed in Section 4.9 and handed to the cluster scheduler in later chapters of this part.

Fun Note: The Slowest Link Sets the Pace

A training cluster is a relay race where the baton is a gradient and one runner in every lap has to jog across the parking lot while the rest sprint indoors. It does not matter how blazing the indoor sprinters are; the lap time is set by the parking-lot crossing. Buying faster GPUs to speed up a network-bound job is the classic mistake of hiring faster sprinters for a race whose bottleneck is the parking lot.

4. Quantifying the Inter-Node Penalty Intermediate

Numbers make the cliff concrete. A ring all-reduce over $K$ GPUs moves, per GPU, about $2\frac{K-1}{K}$ times the gradient size, regardless of how many GPUs participate, because the per-GPU traffic saturates near twice the payload. The time to complete it is therefore governed almost entirely by the bandwidth of the link each hop crosses, plus a small per-hop latency. The model below computes that time for the same eight-GPU, one-billion-parameter job under two layouts: all eight GPUs on one node over NVSwitch, and the same eight spread one-per-node over InfiniBand. It then converts each communication time into a scaling efficiency against a fixed 250-millisecond compute budget, using $\text{efficiency} = \frac{t_{\text{compute}}}{t_{\text{compute}} + t_{\text{comm}}}$.

"""Model all-reduce time and scaling efficiency: 8 GPUs intra-node vs across nodes."""

# Ring all-reduce moves 2*(K-1)/K * (message bytes) per GPU, independent of K's
# growth: the per-GPU traffic approaches 2x the gradient size. What changes
# between the two layouts is the LINK each hop crosses.
def ring_allreduce_seconds(grad_bytes, link_gbytes_per_s, latency_us, K):
    hops = 2 * (K - 1)                       # K-1 reduce-scatter + K-1 all-gather
    chunk = grad_bytes / K                   # ring sends one chunk per hop
    bw = link_gbytes_per_s * 1e9             # GB/s -> bytes/s
    transfer = hops * chunk / bw             # bandwidth term
    latency = hops * latency_us * 1e-6       # per-hop startup cost
    return transfer + latency

P = 1_000_000_000                            # 1B-parameter model
grad_bytes = P * 2                           # fp16 gradients, 2 bytes each
K = 8

# Intra-node: 8 GPUs on one server linked by NVSwitch (~300 GB/s effective, sub-us).
intra = ring_allreduce_seconds(grad_bytes, 300.0, 1.0, K)
# Inter-node: 8 GPUs spread one-per-server over 100 Gb/s InfiniBand (~12.5 GB/s, ~5us).
inter = ring_allreduce_seconds(grad_bytes, 12.5, 5.0, K)

# Per-step compute the all-reduce must hide behind (a fixed accelerator number).
compute_s = 0.250                            # 250 ms of backward compute per step

def efficiency(comm_s):
    # Ideal step = compute only. Real step = compute + exposed communication.
    return compute_s / (compute_s + comm_s)

print(f"model parameters        : {P:,}")
print(f"gradient payload (fp16) : {grad_bytes/1e9:.2f} GB")
print(f"compute per step        : {compute_s*1e3:.0f} ms")
print()
print(f"intra-node all-reduce   : {intra*1e3:7.2f} ms   (NVSwitch ~300 GB/s)")
print(f"inter-node all-reduce   : {inter*1e3:7.2f} ms   (InfiniBand ~12.5 GB/s)")
print(f"inter/intra slowdown    : {inter/intra:7.1f}x")
print()
print(f"scaling efficiency intra: {efficiency(intra)*100:6.1f} %")
print(f"scaling efficiency inter: {efficiency(inter)*100:6.1f} %")
print(f"efficiency lost crossing nodes: {(efficiency(intra)-efficiency(inter))*100:5.1f} percentage points")
Code 15.2.1: A first-principles model of ring all-reduce time on two link speeds, converted to scaling efficiency against a fixed compute budget. The only inputs that differ between the two runs are the link bandwidth and per-hop latency; the gradient payload and GPU count are identical.
model parameters        : 1,000,000,000
gradient payload (fp16) : 2.00 GB
compute per step        : 250 ms

intra-node all-reduce   :   11.68 ms   (NVSwitch ~300 GB/s)
inter-node all-reduce   :  280.07 ms   (InfiniBand ~12.5 GB/s)
inter/intra slowdown    :    24.0x

scaling efficiency intra:   95.5 %
scaling efficiency inter:   47.2 %
efficiency lost crossing nodes:  48.4 percentage points
Output 15.2.1: The same eight GPUs synchronizing the same two-gigabyte gradient finish in 11.68 ms on one node but 280 ms across nodes, a 24x slowdown. The fast layout hides the all-reduce behind compute and holds 95.5% efficiency; the slow layout exposes it and drops to 47.2%, losing 48 percentage points purely to where the GPUs were placed.

The lesson is stark and it is entirely about placement, not about the GPUs. Identical hardware doing identical math loses nearly half its efficiency the moment the gradient has to cross the node boundary instead of staying on the fabric. This single calculation justifies the dominant heuristic of Part IV: fill a node before you span nodes, and when you must span nodes, make sure the framework reduces within each node first so that only one node's payload ever touches the slow link. The hierarchical all-reduce described in Section 3 is what converts the 280-millisecond catastrophe back toward the 12-millisecond ideal.

Thesis Thread: Scale-Out Lives or Dies on the Communication Tier

This section is the clearest statement of the book's central tension. Scale-out works because computation partitions perfectly across machines, as the exact-gradient identity of Section 1.1 proved. Scale-out is hard because the combining step does not partition for free: its cost is set by the slowest link any byte must cross. Every method in the rest of Part IV, from gradient bucketing and overlap (Section 15.5) to the sharded collectives of Chapter 16, is a maneuver to keep the heaviest communication on the fastest tier of this hierarchy. When you meet a new parallelism strategy later, ask first: which collective does it run, and which link does that collective cross?

5. The Launch Model: Processes, Ranks, World Size, Local Rank Intermediate

To place work on the right link you need a vocabulary for naming the workers, and PyTorch (like MPI before it) gives every participating process four pieces of identity. The full set of co-operating processes is the world, and its size, the world size, is the total number of workers, usually one per GPU. Each process has a global rank, an integer from $0$ to world size minus one, that uniquely names it across the whole job. Within a single node, each process also has a local rank, which says which GPU on that machine it owns. These ideas are the training-loop face of the process-and-identity concepts introduced in Section 2.1; here they acquire a hardware meaning, because the local rank is exactly what tells a process which GPU on the fast fabric it controls.

The distinction between global rank and local rank is what makes topology-aware collectives possible. A hierarchical all-reduce groups processes by node, and it knows which processes are co-located precisely because they share a node and differ only in local rank. Get the local rank wrong and two processes may try to drive the same GPU, or the runtime may treat co-located GPUs as if they were remote and route their traffic over the network needlessly. The mapping from $(\text{node}, \text{local rank})$ to global rank is therefore not bookkeeping; it is the data structure that decides which gradients stay on NVSwitch and which cross the cliff.

Library Shortcut: torchrun Computes Ranks and World Size for You

You never assign ranks by hand. The torchrun launcher spawns one process per GPU, sets the RANK, WORLD_SIZE, and LOCAL_RANK environment variables for each, and wires up the rendezvous so the processes can find one another. Your script reads those variables and binds itself to the GPU named by its local rank. A multi-node job that would otherwise need a hand-rolled launcher, an address-discovery protocol, and per-process environment plumbing collapses to one command per node plus a few lines of setup; torchrun handles process spawning, the rendezvous handshake, and restart-on-failure.

# Launch on each node, e.g. 2 nodes x 8 GPUs = world size 16:
#   torchrun --nnodes=2 --nproc_per_node=8 \
#            --rdzv_backend=c10d --rdzv_endpoint=node0:29500 train.py
import os, torch, torch.distributed as dist

dist.init_process_group("nccl")                 # join the world over NCCL
rank       = dist.get_rank()                     # global id, 0 .. world_size-1
world_size = dist.get_world_size()               # total workers across all nodes
local_rank = int(os.environ["LOCAL_RANK"])       # which GPU on THIS node
torch.cuda.set_device(local_rank)                # bind to the fast-fabric GPU we own

if rank == 0:
    print(f"world size {world_size}: rank 0 leading the job")
Code 15.2.2: The launch boilerplate shown illustratively (it requires GPUs and a running rendezvous to execute). torchrun populates the environment variables; the script binds each process to the GPU named by its local rank, the hook that keeps intra-node collectives on the fast fabric.

6. Reading the Efficiency Drop in Practice Advanced

Putting the pieces together, the efficiency of a data-parallel job is a staircase that steps down each time the dominant collective is forced onto a slower tier. Inside one node you sit near the top of the staircase, where Output 15.2.1 showed 95% efficiency. The first multi-node hop is the largest single drop, because gradients now traverse the network for the first time; subsequent nodes add less, since a well-built hierarchical all-reduce already crosses the cliff only once per node. The practical consequence is that the curve of efficiency versus GPU count has a visible knee at the node boundary, and a measured scaling study (the subject of Section 15.9) will show it plainly.

Two design levers move that knee. The first is reducing the bytes that must cross: gradient compression, lower precision, and the bucketed overlap of Section 15.5 all shrink or hide the payload. The second is ensuring the framework actually performs the within-node reduction first, which depends on correct local-rank assignment and topology-aware grouping. When a real job scales worse than the hierarchy predicts, the cause is almost always one of these two: too many bytes crossing, or the runtime failing to keep intra-node traffic off the network. Diagnosing which one binds is the everyday work of scaling a training run, and it is exactly the muscle this section builds.

Practical Example: The 64-GPU Run That Scaled Worse Than 8

Who: An ML platform engineer at a foundation-model startup bringing up a new training cluster.

Situation: An eight-GPU single-node run hit 94% scaling efficiency, so the team expanded the same job to 64 GPUs across eight nodes expecting a near-linear speedup.

Problem: The 64-GPU run delivered only about 3.5x the single-node throughput instead of the hoped-for 8x, and GPU utilization sat low while the network saturated.

Dilemma: Buy faster GPUs and more of them (a scale-up reflex that would not touch a network-bound step), or invest in understanding why the cross-node all-reduce was so expensive and whether it could be made hierarchical.

Decision: They treated it as a placement and topology problem, because Code 15.2.1 predicted exactly this kind of collapse the moment gradients crossed the node boundary without a within-node reduction first.

How: They verified local-rank assignment so co-located GPUs were grouped, enabled NCCL's hierarchical all-reduce so each node reduced internally over NVSwitch before the single cross-node exchange, and switched gradients to fp16 to halve the bytes on the wire.

Result: Cross-node traffic fell by roughly the eight GPUs per node, the 64-GPU run climbed past 6x the single-node throughput, and the network stopped being the bottleneck.

Lesson: When scaling stalls at the node boundary, the fix lives in the communication tier, not the compute tier. Keep the heaviest collective on the fast fabric and let only one node's payload cross the cliff.

Research Frontier: Training Across the Cliff Without Paying for It (2024 to 2026)

If the node boundary is a bandwidth cliff, a vigorous research line asks how far you can push training across it while paying as little as possible. Low-communication optimizers in the DiLoCo lineage (Douillard et al., 2024) let each node, or each cluster, take many local steps between rare cross-node synchronizations, trading a little statistical efficiency for an order-of-magnitude reduction in inter-node traffic, and have been demonstrated training language models over ordinary internet links. Topology-aware collective libraries continue to improve the hierarchical all-reduce, with NCCL and successors auto-detecting NVLink and NVSwitch layouts to keep traffic on-fabric, while emerging interconnects such as NVLink-based multi-node switches push the fast tier itself across more devices. Complementary work overlaps the cross-node exchange so aggressively with compute that the exposed communication of Output 15.2.1 nearly vanishes. The unifying theme is that the hierarchy of this section is not fixed: research is steadily moving the bandwidth cliff outward and teaching optimizers to need it less.

We now have the map of where training hardware lives and what it costs to make it cohere: one GPU as the free baseline, a node of GPUs on a fast fabric where synchronization hides behind compute, and the datacenter network where the same synchronization can dominate. With the launch vocabulary of ranks and world size in hand, we are ready to turn this hierarchy into an actual algorithm. The next section, Section 15.3, builds data parallelism proper: how each worker holds a full model replica, processes its own shard, and synchronizes gradients with the all-reduce whose cost we just learned to read off the topology.

Exercise 15.2.1: Where Does the Knee Fall? Conceptual

You have access to nodes of eight GPUs each, joined by NVSwitch inside a node and 100 Gb/s InfiniBand between nodes. For a data-parallel job, sketch the qualitative shape of scaling efficiency versus total GPU count from 1 to 64, and mark where you expect the largest single drop. Explain in terms of the bandwidth tiers of Section 2 and 3 why the first multi-node hop costs more efficiency than going from, say, 16 to 24 GPUs, assuming the runtime performs a hierarchical all-reduce. State one configuration choice that would move the knee to the right.

Exercise 15.2.2: Re-Run the Cliff Coding

Reproduce Code 15.2.1, then extend it. First add a third layout: 64 GPUs as eight nodes of eight, using a hierarchical all-reduce that reduces within each node over NVSwitch and exchanges only one node's payload across the InfiniBand link. Compute its all-reduce time and efficiency, and compare it to a naive flat all-reduce over all 64 GPUs on the slow link. Then sweep the gradient payload from 0.25 GB to 8 GB and report at which payload the inter-node flat configuration drops below 50% efficiency. Explain why the hierarchical layout degrades so much more gracefully.

Exercise 15.2.3: Compute-to-Communication Crossover Analysis

Using the efficiency model $\frac{t_{\text{compute}}}{t_{\text{compute}} + t_{\text{comm}}}$ from Section 4, derive the per-step compute time $t_{\text{compute}}$ at which an inter-node all-reduce of a two-gigabyte fp16 gradient (use the 280 ms from Output 15.2.1) would still achieve 80% efficiency. Interpret the result: what does it say about the model and batch sizes for which multi-node data parallelism remains worthwhile on a slow network, and how does the answer change if gradient compression halves the 280 ms? Connect your reasoning to the communication-cost model of Chapter 3.