Part I: Foundations of Distributed AI
Chapter 4: Communication Primitives for Distributed Training

Topology-Aware Placement

"They gave me rank 5 and put me three switches away from rank 4. Every step, I wait for a packet that has to cross half the building. My neighbors on the same board finished an hour ago."

A GPU Idling on a Communication Barrier
Big Picture

A collective operation has no fixed cost; its cost is decided by which physical links the participating ranks must talk over, and that is decided by where the scheduler placed them. The previous sections treated all-reduce, all-gather, and all-to-all as if the wires underneath were uniform. They are not. Inside a modern training node, GPUs talk to each other over NVLink at hundreds of gigabytes per second; between nodes they talk over a network an order of magnitude slower. The same eight-way all-reduce can finish in milliseconds when the eight ranks share one node, or take dozens of times longer when they are scattered across the cluster. Topology-aware placement is the discipline of arranging ranks, and structuring collectives, so the heaviest traffic rides the fastest links. This section shows the cost gap with real numbers, explains the hierarchical collectives that exploit it, and gives the rule for mapping a parallelism strategy onto an interconnect.

By now you can name the collectives that distributed training runs on, and in Section 4.8 you saw the libraries (NCCL, MPI, Gloo) that implement them. Those libraries are fast, but they cannot change one fact: a collective is only as quick as the slowest link its ranks must cross. A GPU does not care, when it issues an all-reduce, whether its partner is a centimeter away on the same circuit board or a hundred meters away through three network switches. The cost model of Chapter 3 does care, because the bandwidth term changes by more than an order of magnitude between those two cases. So the question this section answers is not "which collective?" but "given the collective, where should its ranks live, and how should the operation be staged across the interconnect's levels?"

Recall the alpha-beta model from Section 3.8: sending $b$ bytes over a link costs roughly $\alpha + b/\beta$, a fixed latency $\alpha$ plus a bandwidth term, where $\beta$ is the link's bandwidth. A ring all-reduce over $n$ ranks runs $2(n-1)$ steps, each moving a $1/n$ share of the buffer, so its total time is

$$T_{\text{ring}}(n) = 2(n-1)\left(\alpha + \frac{B/n}{\beta}\right),$$

where $B$ is the buffer size in bytes. The structure of the formula is fixed, but $\alpha$ and $\beta$ are properties of the link, and the link is chosen by placement. Put the ring on NVLink and $\beta$ is hundreds of gigabytes per second; scatter it across nodes and $\beta$ collapses to the network rate. That single substitution is the whole subject of this section.

Two nodes: fast NVLink inside, slow network between Node A G0 G1 G2 G3 NVLink ~300 GB/s Node B G4 G5 G6 G7 NVLink ~300 GB/s network ~12.5 GB/s Hierarchical all-reduce in three stages 1. reduce within node over NVLink (parallel) 2. all-reduce across nodes leaders only, over network 3. broadcast result down over NVLink (parallel) The bulk of the bytes never leave NVLink; only a reduced share among leaders crosses the slow link. Compare to a flat ring whose every hop rides the network: same math, far slower wires.
Figure 4.9.1: Topology and the hierarchical collective. Eight GPUs sit on two nodes; the four on each node share fast NVLink (green), while the two nodes are joined only by a slower network link (dashed red). A hierarchical all-reduce reduces within each node over NVLink, exchanges a reduced buffer across nodes among one leader per node, then broadcasts the result back down. The heavy traffic stays on the fast fabric; only a thin layer crosses the slow one.

1. The Same Collective, Two Very Different Costs Beginner

The cleanest way to feel topology is to price one fixed operation, an eight-way all-reduce of a one-gigabyte gradient buffer, under two placements. In the first, the scheduler puts all eight ranks on a single node, so the ring lives entirely on NVLink. In the second, the eight ranks land on eight separate nodes, so every hop of the ring crosses the network. Nothing about the algorithm changes; only $\beta$ changes, and with it the elapsed time. The demo below builds both cases from the alpha-beta formula and reports the gap.

# Topology-aware placement: where a rank sits decides what its collective costs.
# Alpha-beta model: moving b bytes over a link costs alpha + b/bandwidth.
# A ring all-reduce over n ranks runs 2*(n-1) steps, each moving a (1/n)-sized
# chunk; the ring is gated by the SLOWEST link it must span.

bytes_per_elem = 4
P = 256 * 1024 * 1024           # 256M-parameter gradient buffer
B = P * bytes_per_elem          # total bytes

# Link model: (one-way latency seconds, bandwidth bytes/sec).
NVLINK  = (1.0e-6, 300e9)       # intra-node NVSwitch fabric, ~300 GB/s
NETWORK = (5.0e-6, 12.5e9)      # cross-node 100 Gb/s InfiniBand, ~12.5 GB/s

def ring_allreduce_time(n, link, B):
    alpha, bw = link
    steps = 2 * (n - 1)
    chunk = B / n                            # each step moves one (1/n) chunk
    return steps * (alpha + chunk / bw)

# Same 8-way all-reduce, placed well vs badly.
good8 = ring_allreduce_time(8, NVLINK, B)    # all 8 ranks share one node
bad8  = ring_allreduce_time(8, NETWORK, B)   # 8 ranks scattered across 8 nodes

print(f"8 ranks on one node, all NVLink   : {good8*1e3:8.2f} ms")
print(f"8 ranks scattered, all on network : {bad8*1e3:8.2f} ms")
print(f"speedup from good placement       : {bad8/good8:8.1f}x")
Code 4.9.1: Pricing one fixed eight-way all-reduce under good (single-node NVLink) and bad (scattered, cross-node) placement. Only the link's bandwidth $\beta$ differs between the two calls.

Running this with C:\Python314\python.exe produced the numbers below, taken verbatim from the run.

8 ranks on one node, all NVLink   :     6.28 ms
8 ranks scattered, all on network : 150.39 ms
speedup from good placement       :     24.0x
Output 4.9.1: The identical collective is twenty-four times slower when its ranks are scattered. The algorithm did not change; the wires did.

Twenty-four times is not a rounding effect; it is the bandwidth ratio between NVLink and the network, amplified slightly by the higher per-step latency of the network. For a training step that issues this all-reduce once per iteration, the difference between $6$ milliseconds and $150$ milliseconds is the difference between communication hiding behind compute and communication dominating the step. The scheduler made that decision when it chose where rank 5 would live, long before the first gradient was computed.

Key Insight: Placement Is a Performance Knob, Not a Detail

A collective's cost is set by the slowest link its ranks must cross, and that link is chosen by placement. Two runs of the same job, with the same code and the same hardware count, can differ by an order of magnitude purely because one packed the heavy collective onto NVLink and the other spread it across the network. Before tuning algorithms, tune placement: keep the ranks that talk most onto the fastest fabric they can share.

2. Hierarchical Collectives: Reduce Local, Exchange Global Intermediate

Most real jobs cannot fit on one node, so some traffic must cross the network. The trick is to cross it as little as possible. A hierarchical collective exploits the two-level structure of the interconnect: it first reduces within each node over the fast link, then exchanges only the reduced result across nodes over the slow link, then broadcasts the final answer back down inside each node. Instead of all $n$ ranks meeting on the network, only one leader per node does, so the slow stage runs over the number of nodes rather than the number of GPUs. Figure 4.9.1 sketches the three stages.

The demo below extends Code 4.9.1 with a hierarchical variant and prices a thirty-two-GPU all-reduce two ways: a flat ring whose ranks are scattered across thirty-two nodes, and a hierarchical collective on four eight-GPU nodes. Both move the same buffer; they differ only in how they use the topology.

def hierarchical_time(nodes, gpus_per_node, B):
    # Stage 1: ring-reduce inside every node over NVLink (nodes run in parallel).
    t_intra = ring_allreduce_time(gpus_per_node, NVLINK, B)
    # Stage 2: one leader per node all-reduces across nodes over the network.
    t_inter = ring_allreduce_time(nodes, NETWORK, B)
    # Stage 3: broadcast the reduced result back down each node over NVLink.
    a_nv, bw_nv = NVLINK
    t_bcast = a_nv + B / bw_nv
    return t_intra + t_inter + t_bcast

nodes, gpus_per_node = 4, 8
world = nodes * gpus_per_node                       # 32 GPUs
flat32 = ring_allreduce_time(world, NETWORK, B)     # all 32 ranks on the network
hier32 = hierarchical_time(nodes, gpus_per_node, B) # NVLink + a thin network layer

print(f"flat ring, scattered (bad)  : {flat32*1e3:8.2f} ms")
print(f"hierarchical NVLink+network : {hier32*1e3:8.2f} ms")
print(f"speedup from hierarchy      : {flat32/hier32:8.2f}x")
Code 4.9.2: A three-stage hierarchical all-reduce compared with a flat ring over the same thirty-two ranks. The hierarchical version keeps the bulk of the traffic on NVLink and crosses the network among four leaders, not thirty-two ranks.
flat ring, scattered (bad)  :   166.74 ms
hierarchical NVLink+network :   138.74 ms
speedup from hierarchy      :     1.20x
Output 4.9.2: For this one-gigabyte buffer the hierarchical collective is about $1.2$ times faster than the flat scattered ring. The gain is real but modest, because at this size the operation is bandwidth-bound and the reduced buffer still has to cross the network once.

The honest reading of Output 4.9.2 is worth dwelling on. Hierarchy buys a large win when an operation is latency-bound, because it slashes the number of network hops from $2(n-1)$ down to $2(\text{nodes}-1)$; it buys a smaller win when an operation is bandwidth-bound, because a reduced copy of the buffer must still traverse the slow link no matter how the ranks are grouped. The dramatic twenty-four-fold gap in Output 4.9.1 came from keeping the traffic entirely off the network; once any data must cross it, the network's bandwidth sets a floor. Both effects are real, and a good placement chases both: keep what can stay local local, and minimize what must go global.

Fun Note: The All-Reduce That Took the Scenic Route

A team once debugged a training run that was mysteriously three times slower than an identical run from the week before. Same code, same model, same GPU count. The culprit was the scheduler: a maintenance window had freed up nodes in a different rack, and the placement logic, optimizing for "any free GPU," had spread eight ranks across eight nodes instead of packing them onto one. The gradients were taking the scenic route through the network every step. Pinning the job to a single node restored the speed. The fix was zero lines of model code and one line of scheduler configuration.

3. Mapping Parallelism Onto the Topology Intermediate

Large models are trained with several forms of parallelism at once, and each form runs a different collective with a different traffic volume. Tensor parallelism, which splits a single layer across devices, issues an all-reduce on every forward and backward pass and is the heaviest communicator. Data parallelism issues one gradient all-reduce per step. Pipeline parallelism sends only activations between adjacent stages, the lightest of the three. The placement rule follows directly: assign the heaviest collective to the fastest link. Keep tensor-parallel groups inside a node on NVLink; let data-parallel and pipeline groups span nodes over the network, where their lighter traffic can tolerate the slower link.

Table 4.9.1 states this mapping as a checklist. It is the same logic that the three-dimensional parallelism strategies of Chapter 16 formalize, where tensor, pipeline, and data dimensions are deliberately laid out so that the innermost (most communication-intensive) dimension maps to the innermost (fastest) level of the interconnect.

Table 4.9.1: Mapping parallelism strategies onto the interconnect. The heaviest communicator gets the fastest link; the lightest tolerates the slowest.
ParallelismCollective and frequencyTrafficPreferred link
Tensor parallelAll-reduce, twice per layer per stepHeaviestIntra-node NVLink
Data parallelAll-reduce, once per stepMediumCross-node network
Pipeline parallelPoint-to-point activations between stagesLightestCross-node network
Expert parallelAll-to-all, per MoE layerBursty, largeIntra-node where possible

The fourth row, expert parallelism, is the awkward case: its all-to-all (introduced in Section 4.6) moves tokens between every pair of expert shards and is both large and bursty, so it wants the fast link but rarely fits inside one node. That tension is exactly what Chapter 16 and the expert-parallel material that follows it must navigate, and it is one reason rail-optimized network designs (Section 5 below) exist.

Practical Example: The 70B Run That Found Its 30 Percent

Who: A systems engineer on a foundation-model team bringing up training for a 70-billion-parameter model on a 256-GPU cluster of eight-GPU nodes.

Situation: The job used eight-way tensor parallelism, four-way pipeline parallelism, and eight-way data parallelism, composed into a three-dimensional layout.

Problem: Measured throughput sat at roughly 70 percent of the projected number, and profiling showed the GPUs stalling on tensor-parallel all-reduces far longer than the model predicted.

Dilemma: Either accept the loss and add more GPUs to brute-force the schedule, or chase the stall and risk days of layout experiments with no guarantee of a win.

Decision: They chased the stall, because the all-reduce time on the tensor-parallel groups was an order of magnitude above the NVLink prediction, a signature of cross-node placement.

How: Inspecting the rank-to-node map revealed that the tensor-parallel groups of eight had been split across two nodes instead of packed onto one. They re-expressed the layout so each tensor-parallel group of eight occupied exactly one node, keeping that heavy all-reduce on NVLink, and let the data-parallel dimension span nodes instead.

Result: Throughput rose by about 30 percent with no new hardware, recovering nearly the full projected number, because the heaviest collective now rode the fastest link exactly as Output 4.9.1 says it should.

Lesson: When a composed parallelism strategy underperforms, check the rank-to-link map before touching the model. The most expensive collective belongs on the fastest fabric, and getting that one mapping right is often the single largest free win in a large training job.

4. From Placement to the Scheduler Advanced

Topology awareness is not only a property a job sets for itself; it is a property the cluster scheduler must respect. A job that needs its eight tensor-parallel ranks on one node is useless if the scheduler hands it eight ranks on eight different nodes, and a synchronous training job that runs one straggler-bound collective per step cannot tolerate having even one rank placed far away. This is why schedulers for AI clusters offer gang scheduling (all ranks of a job start together or not at all) and topology-aware placement (ranks of a group are co-located on the same node or rack). Those mechanisms are the cluster-side counterpart of everything in this section, and they get their full treatment in Chapter 33.

The contract between job and scheduler is a small piece of information: the job declares its communication groups and how tightly each must be packed, and the scheduler honors that when it assigns ranks to hardware. A tensor-parallel group declares "keep these eight together on one node"; a data-parallel group declares "these can span nodes." When the declaration is missing or ignored, you get the scenic-route all-reduce from the fun note above. When it is honored, the cost model of Chapter 3 becomes predictive rather than aspirational.

Library Shortcut: NCCL and torch.distributed Discover the Topology for You

You do not hand-build the hierarchical collective of Code 4.9.2 in practice. NCCL detects the interconnect at startup (which GPUs share NVLink, which share a PCIe switch, which are reachable only over the network) and automatically constructs hierarchical rings and trees that keep traffic on the fastest available links. Your job is to give it a good rank-to-device mapping; it does the staging. Process subgroups let you declare the communication structure so the library and the launcher can place and route accordingly:

import torch.distributed as dist

dist.init_process_group("nccl")                 # NCCL probes NVLink/PCIe/network here
# Declare the tensor-parallel group: these ranks should share a node.
tp_group = dist.new_group(ranks=[0, 1, 2, 3, 4, 5, 6, 7])
# A data-parallel group whose members may span nodes.
dp_group = dist.new_group(ranks=[0, 8, 16, 24])

# The same all_reduce call; NCCL routes it hierarchically over the real topology.
dist.all_reduce(local_grad, group=tp_group)     # stays on NVLink if ranks 0-7 co-locate
Code 4.9.3: The roughly fifteen lines of manual three-stage hierarchy in Code 4.9.2 collapse to declaring process groups and issuing ordinary collectives. NCCL handles topology discovery, ring and tree construction, and the choice of which links to use; the launcher (torchrun with a topology-aware scheduler) handles getting rank 0 through 7 onto the same node.

5. The Interconnect Is Getting Smarter Advanced

Placement matters more every year, because the gap between the fastest and slowest links keeps widening and the fabrics keep gaining structure that a placement strategy can exploit. Three current directions are worth naming, all of which a topology-aware job must reason about.

Research Frontier: Rails, Switches, and Light (2024 to 2026)

NVLink and NVSwitch have grown into multi-node domains: NVIDIA's GB200 NVL72 (2024) wires 72 GPUs into a single NVLink fabric, so what used to be a slow cross-node hop now stays on the fast domain, and the "node" boundary that this section reasons about moves outward to 72 GPUs. At the network layer, rail-optimized topologies wire each GPU to a dedicated network rail so that same-rank GPUs across nodes share a low-contention path; this is the placement assumption behind the all-to-all of expert parallelism and is documented in the Llama 3 training infrastructure report (Meta, 2024) and in cluster designs from major operators. Further out, optical circuit switching, used in Google's TPU pods and explored for GPU clusters, reconfigures the physical topology to match a job's communication pattern rather than forcing the job to adapt to a fixed wiring. Across all three, the lesson of this section only sharpens: the interconnect is becoming a structured resource that rewards jobs which map their heaviest collectives onto its fastest, least-contended paths, and penalizes jobs that ignore it.

We now have the placement rule (heaviest collective on the fastest link), the mechanism that exploits topology automatically (hierarchical collectives), and the cluster-side contract that makes placement happen (gang and topology-aware scheduling). One large lever remains unpulled. Even a perfectly placed all-reduce still costs time, and that time is wasted if the GPUs sit idle waiting for it. The next section shows how to hide the collective entirely by overlapping it with the backward pass that produces the gradients, so that communication and computation run at once. That technique, and the gradient bucketing that enables it, is the subject of Section 4.10.

Exercise 4.9.1: Read the Placement From the Numbers Conceptual

A colleague reports that an eight-way gradient all-reduce of a $1$ GB buffer is taking about $150$ milliseconds per step, and asks whether the algorithm is buggy. Using Output 4.9.1, diagnose the likely cause without seeing any code. What single piece of cluster information would confirm your diagnosis, and what one change would you propose first? Explain why an algorithmic change (a different all-reduce implementation) is unlikely to close most of the gap.

Exercise 4.9.2: When Does Hierarchy Pay? Coding

Extend Code 4.9.2 to sweep the buffer size $P$ from $2^{16}$ to $2^{28}$ parameters and plot the speedup of the hierarchical collective over the flat scattered ring at each size. Confirm from your plot that the speedup is largest for small buffers (latency-bound) and shrinks toward a floor for large buffers (bandwidth-bound). Then explain, using the $2(n-1)$ step count in the formula, exactly why shrinking the number of network hops helps the small-buffer case so much more than the large-buffer case.

Exercise 4.9.3: Lay Out a 3D-Parallel Job Analysis

You must train on $64$ GPUs arranged as $8$ nodes of $8$ GPUs, using $8$-way tensor parallelism, $2$-way pipeline parallelism, and $4$-way data parallelism ($8 \times 2 \times 4 = 64$). Using Table 4.9.1, assign each of the three parallel dimensions to a level of the interconnect (intra-node NVLink vs cross-node network) and justify each choice from the collective it runs and its traffic. Then estimate, with the alpha-beta intuition from this section, which single misplacement would cost you the most, and argue why. Forward-reference how Chapter 16 would express this layout.