Part I: Foundations of Distributed AI
Chapter 4: Communication Primitives for Distributed Training

The Communication Substrate

"I only ever talk to my right-hand neighbour. Somehow, by the end of the round, I am holding everyone's gradient. Nobody has explained to me how this is not gossip."

A Worker On A Ring
Big Picture

Every collective operation in distributed training, no matter how grand its name, is assembled from one humble primitive: a worker sending bytes to another worker and a worker receiving them. Before we study all-reduce, all-gather, and all-to-all as named operations, we need the substrate they stand on: the point-to-point send and recv that move a buffer from one rank to one other rank, the process-group abstraction that names the participants (a world of ranks), and the physical wires those bytes actually cross. Those wires are not all alike. A transfer inside one machine over NVLink and the same transfer between two machines over Ethernet differ by roughly two orders of magnitude in time, and that gap is the single physical fact that makes topology matter and makes a careless collective slow. This section gives you the substrate and just enough interconnect to plug real numbers into the alpha-beta model from Section 3.8; the named collectives that ride on it begin in Section 4.3.

The previous section, Section 4.1, argued that communication, not compute, is what bounds distributed training once you add enough machines. That argument leaves a concrete question unanswered: what exactly is "communication", mechanically, at the level a program controls? The answer is smaller than the word suggests. At bottom there is only one operation, one worker handing a block of bytes to another worker across a link. Everything else, including the whole vocabulary of collectives this chapter develops, is a pattern of those single handoffs orchestrated so that many workers reach a shared result. This section builds up from that one operation to the abstraction that organizes it and the hardware that carries it.

1. Point-to-Point: Send and Recv, the Atom of Communication Beginner

The most basic act of communication is a point-to-point transfer: one rank calls send to push a buffer toward a named destination, and the destination rank calls recv to accept it into a buffer of its own. The two calls are a matched pair; a send with no matching recv simply waits, and a recv with no matching send waits too. This pairing is what makes point-to-point both flexible and dangerous. It is flexible because you can express any communication pattern at all by choosing who sends to whom and in what order. It is dangerous because if every rank tries to send before anyone receives, and the buffers are large enough that the system cannot quietly absorb them, every rank blocks waiting for a partner who is also blocked, and the job deadlocks. The discipline of ordering sends and recvs so this cannot happen is the price of working at this level.

Point-to-point comes in two flavors that matter for performance. A blocking send does not return until the buffer is safe to reuse; a non-blocking send returns immediately and hands you a handle you later wait on, which lets a worker post a transfer and keep computing while the bytes move. That overlap, computing during a transfer instead of stalling for it, is the seed of one of the most important optimizations in the whole chapter, the overlap of gradient communication with the backward pass that Section 4.10 develops in full. For now the point is narrower: send and recv are the only two verbs the hardware truly needs, and the cost of one such transfer of $n$ bytes is exactly the alpha-beta cost $\alpha + n\beta$ from Section 3.8.

Key Insight: Collectives Are Choreographed Point-to-Point

There is no separate hardware primitive for "all-reduce". A collective is a schedule of point-to-point sends and recvs among $K$ ranks, chosen so that after the schedule completes, every rank holds the combined result. Change the schedule (a ring, a tree, a butterfly) and you change which point-to-point messages happen and when, which is exactly what changes the $\alpha$ and $\beta$ terms of the collective's cost. This is why Section 4.4 can compare all-reduce algorithms purely by counting messages and bytes: every one of them is built from the same atom you met here.

2. Collective Operations: One Call, Every Rank Participates Beginner

Writing distributed training as raw send and recv would be both tedious and error-prone, so the field standardized a small set of named patterns, the collective operations, in which a single call is issued by every participating rank and the library executes the underlying schedule of point-to-point messages for you. The defining feature of a collective is that all ranks call it together: where point-to-point names a single source and a single destination, a collective names the whole group and produces a result defined over the whole group. An all-reduce, the collective that sums one vector per rank and leaves the sum on every rank, is the one that synchronizes gradients in data-parallel training and the subject of Section 4.3; it is the same operation you performed by hand in Section 1.1 when eight workers combined their partial gradients.

The value of the collective is not only convenience. Because the library knows the full pattern in advance, it can pick a schedule tuned to the hardware (a ring when bandwidth dominates, a tree when latency dominates) and overlap the hops optimally, which a hand-rolled sequence of sends almost never achieves. The trade is expressiveness for performance and safety: you give up the freedom to specify an arbitrary message pattern, and in return you get a deadlock-free, topology-aware implementation. Almost all of distributed deep learning is expressed in this collective vocabulary, which is why the rest of this chapter is organized as one section per collective. Table 4.2.1 places point-to-point beside the collectives it composes into.

Table 4.2.1: Point-to-point versus collective communication. The collectives are not new hardware operations; each is a fixed schedule of the send and recv atoms, which is why the same alpha-beta cost model prices them all.
PropertyPoint-to-point (send / recv)Collective (all-reduce, etc.)
Who calls itOne sender and one receiverEvery rank in the group, together
What it namesA single source and destinationThe whole group (a world of ranks)
ResultBytes moved from A to BA combined value defined over all ranks
Who picks the scheduleYou, by ordering the callsThe library, tuned to the topology
Built fromThe hardware atom itselfMany point-to-point messages

3. The Process Group: A World of Ranks Beginner

For any of this to work, the participants must be named, and that naming is the job of the process group, also called a communicator. When a distributed job starts, the framework forms a group of all the worker processes and assigns each one an integer rank from $0$ to $K-1$; the total count $K$ is the world size, and the default group containing everyone is the world. A send addresses its destination by rank; a collective acts over a group, so it sums or gathers exactly the ranks in that group and no others. The process group is the small piece of shared bookkeeping that turns a pile of independent processes into a coordinated team that can speak the collective vocabulary.

Groups can be carved into subgroups, and this is not a detail but a central tool. Three-dimensional parallelism splits a model along data, tensor, and pipeline dimensions at once, and each dimension is its own process group: a tensor-parallel all-reduce runs only over the few ranks that share a layer, while a data-parallel all-reduce runs over the ranks holding replicas, and the two groups overlap on the same physical machines. Forming the right subgroups so that frequent, latency-sensitive collectives stay inside one machine while rare, bandwidth-heavy ones cross the network is exactly the topology-aware placement that Section 4.9 is devoted to. The process-group abstraction is what makes that placement expressible: you decide which ranks share a group, and the library confines each collective to it.

Library Shortcut: torch.distributed Gives You the World and the Atoms

You do not implement send, recv, or the process group yourself. PyTorch's torch.distributed initializes the world, hands every process its rank and the world size, and exposes the point-to-point atoms directly; carving out a subgroup is one call to new_group. The whole substrate this section describes is a handful of lines:

# Launch with: torchrun --nproc_per_node=4 thisfile.py
import torch, torch.distributed as dist

dist.init_process_group("nccl")          # form the world; assign ranks
rank = dist.get_rank()                    # this process's id, 0 .. K-1
world = dist.get_world_size()             # K, the world size

# Point-to-point: rank 0 sends one tensor to its right neighbour, which recvs it.
buf = torch.full((3,), float(rank), device="cuda")
if rank == 0:
    dist.send(buf, dst=1)                 # blocking send to rank 1
elif rank == 1:
    dist.recv(buf, src=0)                 # blocking recv from rank 0

# A subgroup over just the even ranks; a collective on it touches only those.
evens = dist.new_group(ranks=list(range(0, world, 2)))
if rank % 2 == 0:
    dist.all_reduce(buf, group=evens)     # confined to the even-rank world
Code 4.2.1: The communication substrate in torch.distributed: init_process_group builds the world, send and recv are the point-to-point atoms, and new_group carves a subgroup that a collective can be confined to. The library hides the deadlock-avoidance, transport, and topology choices that a hand-rolled version would have to get right.

4. Building a Collective From Bare Send and Recv Intermediate

The cleanest way to believe that a collective is just choreographed point-to-point is to build one. The simulation below arranges $K$ ranks on a logical ring where each rank only ever talks to its right neighbor, then implements an all-reduce-by-sum using nothing but those one-hop sends. A first set of $K-1$ hops walks a running sum around the ring until one rank holds the total; a second set of $K-1$ hops walks that finished total back around so every rank ends with it. We count every message and check the result against the single-machine sum, so the claim "no approximation, just regrouping" from Section 1.1 is verified rather than asserted.

import numpy as np

K = 4
rng = np.random.default_rng(0)
local = [rng.integers(1, 10, size=3) for _ in range(K)]   # one vector per rank

# A naive ring all-reduce written ONLY with point-to-point hops. Phase A walks a
# running sum around the ring; phase B walks the finished sum back so every rank
# holds it. msgs counts each point-to-point send (no collective is used).
buf = [v.copy() for v in local]
msgs = 0
acc = buf[0].copy()                       # rank 0 starts the running sum
for step in range(1, K):                  # K-1 send/recv hops: gather the sum
    acc = acc + buf[step]                 # recv from left, add local share
    msgs += 1
total = acc                               # rank K-1 now holds the full sum
result = [None] * K
result[K - 1] = total.copy()
for step in range(K - 1):                 # K-1 hops: broadcast the sum back
    result[step] = total.copy()
    msgs += 1

reference = np.sum(local, axis=0)         # what a single machine would compute
print("per-rank input vectors:")
for r in range(K):
    print(f"  rank {r}: {local[r]}")
print()
print("point-to-point messages sent :", msgs, f" (= 2*(K-1) for K={K})")
print("reference single-machine sum :", reference)
print("all ranks hold the sum?      :",
      all(np.array_equal(result[r], reference) for r in range(K)))
print("max abs error vs reference   :",
      max(int(np.max(np.abs(result[r] - reference))) for r in range(K)))
Code 4.2.2: An all-reduce assembled by hand from $2(K-1)$ one-hop sends on a ring, with no collective call anywhere. The same $2(K-1)$ message count is what the cost model in Section 3.8 charges to ring all-reduce.
per-rank input vectors:
  rank 0: [8 6 5]
  rank 1: [3 3 1]
  rank 2: [1 1 2]
  rank 3: [8 6 9]

point-to-point messages sent : 6  (= 2*(K-1) for K=4)
reference single-machine sum : [20 16 17]
all ranks hold the sum?      : True
max abs error vs reference   : 0
Output 4.2.2: Four ranks, each talking only to its neighbor, jointly compute the exact group sum in $2(K-1) = 6$ messages, and every rank ends holding it with zero error. The collective added no approximation; it only orchestrated point-to-point hops.

Six messages, no shared memory, every rank blind to the others' data except through one neighbor, and the result is the exact sum on every rank. This is the entire idea of a collective in miniature. The cost of this hand-built version is $2(K-1)$ messages, which is precisely the message count the alpha-beta model in Section 3.8 assigned to ring all-reduce; the real ring algorithm improves on the naive version above only by slicing each vector so the per-message payload is $n/K$ rather than $n$, which is the refinement Section 4.4 develops. What does not change, ever, is that the bytes cross a physical wire, and which wire decides how long each of those six hops takes.

5. The Interconnect Hierarchy, Thinly Intermediate

The send in Code 4.2.2 was free because it happened in one process; on real hardware each hop crosses a link with a finite $\alpha$ and $\beta$, and those links form a hierarchy. The crucial fact, the only one this chapter needs you to internalize about hardware, is that the tiers differ by orders of magnitude. Inside a single machine, accelerators talk over NVLink, a dedicated high-bandwidth fabric reaching hundreds of gigabytes per second between GPUs, or over PCIe, the general-purpose bus shared with the rest of the system at tens of gigabytes per second. Between machines, bytes leave the box and cross a network: commodity Ethernet, or InfiniBand, a low-latency network built for high-performance computing and now standard in AI clusters. The span from NVLink down to Ethernet is roughly two orders of magnitude in bandwidth and a similar gap in latency, and Figure 4.2.1 draws the tiers to scale.

Intra-node (inside one machine) NVLink ~300 GB/s, latency ~1 us · GPU-to-GPU in one box PCIe ~25 GB/s, latency ~2 us · shared system bus Inter-node (across the network) InfiniBand ~50 GB/s (NDR), latency ~3 us · HPC-grade network Ethernet ~3 GB/s (25G), latency ~30 us · commodity network relative bandwidth (bar length, longer is faster)
Figure 4.2.1: The interconnect hierarchy, drawn so bar length tracks relative bandwidth. The top group is intra-node (NVLink, then PCIe); the bottom group is inter-node (InfiniBand, then commodity Ethernet). The roughly two-orders-of-magnitude span from the longest bar to the shortest is the physical fact behind topology-aware placement: keep the chatty collectives on the long bars. The figures are representative orders of magnitude for reasoning about cost, not vendor specifications.

One more idea is needed to make inter-node transfers fast, and it is RDMA, remote direct memory access. Ordinarily, moving a buffer from one machine to another means copying it through the operating-system kernel and the network stack at both ends, burning CPU cycles and adding latency on every hop. RDMA lets a network adapter read from one machine's memory and write directly into another machine's memory without involving either CPU or kernel on the data path, a zero-copy transport. This is what lets InfiniBand (and RDMA-over-Ethernet variants) hit the low latencies in Figure 4.2.1, and it is why an all-reduce across nodes can approach the alpha-beta cost rather than drowning in copy overhead. For our purposes RDMA is a single fact: it is the transport that makes the inter-node $\alpha$ and $\beta$ small enough that the cost model of Section 3.8 is the right tool, and the named libraries that drive it appear in Section 4.8.

Fun Note: Same Trip, Wildly Different Tolls

Picture a gradient buffer as a parcel. Handed to a GPU one slot over on NVLink, it arrives almost before you let go. Handed across PCIe, it queues behind everyone else's parcels on the shared bus. Sent to the machine in the next rack over Ethernet, it stands in line at the post office, gets stamped by the kernel twice, and shows up later that afternoon. RDMA is the courier who is allowed to walk straight into the recipient's house and set the parcel on the table. The bytes are identical; the bill is not.

6. Costing a Transfer Across the Tiers Intermediate

The reason to keep the hardware tour thin is that two numbers per tier, an $\alpha$ and a $\beta$, are all the cost model needs. The program below takes a single 256 MiB gradient buffer, the kind of payload a data-parallel step exchanges, and applies the alpha-beta time $T(n) = \alpha + n\beta$ from Section 3.8 to each tier in Figure 4.2.1. It is the same model, fed the four tiers' constants, and it makes the orders-of-magnitude claim concrete: the identical transfer that costs under a millisecond on NVLink costs nearly a tenth of a second on commodity Ethernet.

n = 256 * 1024**2     # 256 MiB payload, in bytes
tiers = [
    # name,                  alpha (s),  bandwidth (bytes/s)
    ("NVLink (intra-node)",   1.0e-6,  300e9),
    ("PCIe   (intra-node)",   2.0e-6,   25e9),
    ("InfiniBand NDR (inter)",3.0e-6,   50e9),
    ("Ethernet 25G (inter)",  30.0e-6,  3.1e9),
]
print(f"alpha-beta transfer time of a {n//1024//1024} MiB buffer across tiers:")
print(f"{'tier':<24}{'alpha(us)':>10}{'BW(GB/s)':>10}{'time(ms)':>10}{'vs NVLink':>11}")
base = None
for name, alpha, bw in tiers:
    t = alpha + n * (1.0 / bw)            # T(n) = alpha + n*beta, beta = 1/bandwidth
    base = base or t                      # NVLink is the reference (fastest)
    print(f"{name:<24}{alpha*1e6:>10.1f}{bw/1e9:>10.1f}{t*1e3:>10.2f}{t/base:>10.1f}x")
Code 4.2.3: The alpha-beta model of Section 3.8 applied to one 256 MiB transfer across the four tiers of Figure 4.2.1. Only the per-tier $\alpha$ and $\beta = 1/\text{bandwidth}$ change; the formula is identical.
alpha-beta transfer time of a 256 MiB buffer across tiers:
tier                     alpha(us)  BW(GB/s)  time(ms)  vs NVLink
NVLink (intra-node)            1.0     300.0      0.90       1.0x
PCIe   (intra-node)            2.0      25.0     10.74      12.0x
InfiniBand NDR (inter)         3.0      50.0      5.37       6.0x
Ethernet 25G (inter)          30.0       3.1     86.62      96.7x
Output 4.2.3: The same 256 MiB buffer takes 0.90 ms on NVLink and 86.62 ms on 25G Ethernet, a factor near 97. For a payload this large the time is set almost entirely by the bandwidth term $n\beta$, so the tier's bandwidth, not its latency, dominates the gap.

The 97-fold spread is the whole argument for caring about topology. A collective that keeps its traffic on NVLink inside a machine pays the top row's cost; the same collective forced across Ethernet pays the bottom row's, and at large message sizes the difference is set by bandwidth, exactly the bandwidth-bound regime that Section 3.8 named. This is why a real cluster does not treat all ranks as interchangeable: it places the ranks that must exchange the most bytes on the fastest links and lets the rare, smaller transfers cross the network. Turning that instinct into a placement strategy is the job of Section 4.9, and every collective in the sections between here and there is priced with the per-tier constants you just used.

Practical Example: The Cluster That Was Two Slow Hops in a Trench Coat

Who: An ML platform engineer standing up an eight-node, sixty-four-GPU data-parallel training cluster on rented cloud instances.

Situation: The single-node configuration trained at the expected speed; scaling to eight nodes delivered barely twice the throughput, not the eightfold the team budgeted for.

Problem: Each training step's gradient all-reduce had to leave every machine and cross the network, and the rented instances were connected by ordinary 25-gigabit Ethernet rather than the InfiniBand the in-house cluster used.

Dilemma: Accept the slowdown and rent more nodes to brute-force the throughput, or pay for instances with RDMA-capable InfiniBand at a higher hourly rate, without yet knowing whether the network was truly the bottleneck.

Decision: They priced one all-reduce with the alpha-beta model across both tiers, exactly as in Code 4.2.3. The inter-node Ethernet term dwarfed the intra-node NVLink term by the factor Output 4.2.3 shows, so the network, not the GPUs, was starving the step.

How: They moved to InfiniBand instances with RDMA and arranged the process groups so the bandwidth-heavy data-parallel all-reduce kept as much traffic as possible on intra-node NVLink before crossing the slower inter-node link.

Result: Per-step communication fell by roughly an order of magnitude and eight-node throughput climbed close to the budgeted scaling, at a lower total cost than renting twice the nodes on slow Ethernet would have been.

Lesson: The tiers differ by orders of magnitude, so a careless placement can hide a two-orders-of-magnitude tax inside an innocent-looking all-reduce. Price the transfer across tiers before you blame the model or buy more machines.

7. Why the Substrate Decides Everything Above It Advanced

You now hold the substrate beneath the entire chapter: a single point-to-point atom, a process group that names the ranks a collective acts over, and an interconnect hierarchy whose tiers differ by orders of magnitude and whose inter-node transfers are made fast by RDMA. Every named collective in the sections ahead is a schedule of the atom over a group, and its cost is that schedule's message count and bytes evaluated against the tier its ranks happen to sit on. The all-reduce of Section 4.3, the all-gather and reduce-scatter of Section 4.5, and the all-to-all of Section 4.6 are three such schedules; what makes one fast or slow in a given system is not the schedule alone but the schedule meeting the substrate.

Thesis Thread: The Tax Has a Mechanism Now

Section 1.1 named communication as the tax on scale-out, and Section 3.8 made it a number, $\alpha + n\beta$. This section supplies the mechanism that the number describes: the tax is paid in point-to-point sends over a tiered fabric, and the size of each payment is set by which tier carries it. From here, every parallel method in the book can be read as a choice of which collective to fire, over which process group, on which tier of the substrate; that three-part choice is the lever the rest of the book teaches you to pull.

Research Frontier: The Substrate Keeps Moving (2024 to 2026)

The interconnect tiers are not fixed; each generation widens the gap this section exploits. On the intra-node side, NVLink and the NVSwitch fabric have advanced through their fifth generation in the Blackwell-class systems announced in 2024, pushing aggregate GPU-to-GPU bandwidth into the terabyte-per-second range and extending a single high-bandwidth domain across dozens of accelerators in one rack-scale NVL72 unit, which turns what used to be inter-node traffic into intra-node traffic. On the inter-node side, InfiniBand NDR at 400 gigabits per port (and the emerging XDR generation) together with RDMA-over-Converged-Ethernet keep the network tier close enough to the bus tier that very large clusters stay tractable. In-network reduction, where the switch itself sums gradients (NVIDIA's SHARP and its successors), is increasingly standard and changes the substrate's cost model by performing part of the collective in the fabric rather than on the ranks. We meet the libraries that drive all of this, NCCL chief among them, in Section 4.8; the lesson here is that "which tier carries this byte" is a moving target, and the alpha-beta model is the stable lens through which each new generation is read.

With the atom, the group, and the tiers in hand, we are ready to name the first and most important collective in distributed training. The next section, Section 4.3, takes the all-reduce you built by hand in Code 4.2.2 and develops it as the operation that synchronizes gradients in data-parallel SGD, complete with the cost model that decides how it scales.

Exercise 4.2.1: Why a Collective and Not Raw Sends? Conceptual

You are tempted to skip the collective vocabulary and write your gradient synchronization as a hand-ordered sequence of send and recv calls, as in Code 4.2.2. State two concrete advantages a library collective has over your hand-rolled version (think about deadlock and about the schedule the library is free to pick), and one situation in which raw point-to-point is genuinely the right tool. Explain why the process-group abstraction is what lets the library make the topology-aware choice you cannot easily make by hand.

Exercise 4.2.2: Price the Same Buffer Everywhere Coding

Extend Code 4.2.3 to sweep the buffer size $n$ from 4 KiB to 1 GiB across all four tiers, and for each tier print the size at which the transfer crosses from latency-bound ($\alpha$ dominates) to bandwidth-bound ($n\beta$ dominates), the crossover $n = \alpha/\beta$ from Section 3.8. Explain why a 4 KiB control message and a 256 MiB gradient buffer can land in different regimes on the same tier, and what that implies for whether small or large messages most reward moving to a faster tier.

Exercise 4.2.3: Count the Hops, Predict the Cost Analysis

The hand-built all-reduce in Code 4.2.2 sends $2(K-1)$ full-size messages of $n$ bytes each, while the real ring all-reduce sends $2(K-1)$ messages of $n/K$ bytes each. Using the alpha-beta model, write the total time for both versions as a function of $\alpha$, $\beta$, $n$, and $K$, and show that the real ring's bandwidth term is smaller by a factor near $K$ while its latency term is identical. For a 256 MiB buffer at $K = 16$ on the InfiniBand row of Output 4.2.3, estimate both times and state which version you would never ship, and why.