Part I: Foundations of Distributed AI
Chapter 4: Communication Primitives for Distributed Training

Communication Libraries: NCCL, MPI, and Gloo

"You spent four sections deciding which collective to call. I am the one who actually has to move the bytes, pick the ring, and find the wire. A little credit would be nice."

A Communication Backend That Does the Real Work
Big Picture

The collectives from the previous sections are not something you implement; they are something you call, and the code on the other side of that call is a communication library that turns "all-reduce this gradient" into a concrete schedule of messages over real wires. Three libraries dominate distributed AI: NCCL, NVIDIA's GPU-optimized collective library and the default for GPU training; MPI, the mature High Performance Computing standard that runs on CPUs and GPUs alike; and Gloo, the portable fallback for CPUs and simple GPU cases. In practice you do not link against any of them directly. You select one by name as a torch.distributed backend, and it then chooses the algorithm (ring or tree), reads the machine's topology, picks the transport (RDMA, shared memory, or TCP), and overlaps the transfer with your computation. This section is the bridge from the abstract collectives of Sections 4.3 through 4.7 to the named software you will actually configure.

Across the previous five sections we treated collectives as mathematical objects: an all-reduce sums one vector per worker and gives every worker the result; reduce-scatter and all-gather split that work for sharded training; all-to-all routes tokens between experts. None of those sections said how a vector of four billion bytes actually crosses from one GPU to another, in what order, over which link, and overlapped with which part of the backward pass. That "how" is the job of a communication library. The library is the layer that sits below your training framework and above the hardware, and the quality of its implementation is often the difference between a job that scales to a thousand GPUs and one that stalls at sixty four. Section 4.7 closed the catalogue of primitives; this section names the engines that run them.

Framework Training framework PyTorch DDP / FSDP, DeepSpeed, Megatron-LM, Ray Dispatch torch.distributed one collective API, backend selected by name Library NCCL NVIDIA GPU MPI HPC standard Gloo portable CPU Transport NVLink intra-node GPU RDMA InfiniBand / RoCE shared mem same host TCP commodity net Hardware GPUs, NICs, switches
Figure 4.8.1: Where the communication libraries sit. Your training framework issues a collective call; torch.distributed dispatches it to whichever backend you selected; that library chooses an algorithm and rides the fastest transport it can reach down to the hardware. NCCL prefers NVLink and RDMA, Gloo falls back to shared memory and TCP, and MPI can drive either depending on the cluster. The arrows show that one API call fans out to several possible wires.

1. Why a Separate Library Layer Exists at All Beginner

A reasonable question is why the framework does not just send the bytes itself. The answer is that a correct, fast collective is a substantial piece of systems engineering that no training framework wants to own. To move a gradient with an all-reduce, something must decide whether to use a ring or a tree algorithm (a choice that depends on message size and worker count, as Section 4.4 showed), detect that two GPUs share an NVLink bridge while a third is a network hop away, fall back from RDMA to TCP when InfiniBand is absent, register pinned memory buffers, and schedule the transfer so it overlaps with the parts of the backward pass that have already finished. Getting all of that right, and portable, and fast, is a multi-year effort. The communication libraries exist so that effort is written once and reused by every framework.

This is the same separation of concerns that lets Chapter 15 describe data-parallel training without ever mentioning a wire protocol: the framework expresses intent ("synchronize these gradients"), and the library expresses mechanism ("ring all-reduce over NVLink within the node, tree all-reduce over InfiniBand across nodes"). The cost models of Chapter 3 are how you predict what that mechanism will cost; the library is what actually pays the cost. Keeping the two layers distinct is what lets you change interconnects without rewriting your model.

Key Insight: You Choose the Collective; the Library Chooses Everything Else

Your code names a collective and a backend. The library then makes four decisions you never see: which algorithm to run (ring versus tree, by message size and group size), which topology the workers form (who is one NVLink hop away and who is across the network), which transport to ride (NVLink, RDMA, shared memory, or TCP), and how to overlap the transfer with compute. The reason a backend swap can change throughput by a large factor, with byte-for-byte identical results, is that all four of those decisions are made inside the library, not in your training loop.

2. The Three Libraries, and What Each Is For Beginner

Three libraries cover nearly every distributed AI job, and they divide the space cleanly by where the data lives and what hardware connects it. NCCL (the NVIDIA Collective Communications Library) is purpose-built for collectives among NVIDIA GPUs. It is topology-aware down to the level of individual NVLink bridges and PCIe switches, drives RDMA over InfiniBand and RoCE for cross-node transfers, and is the default and recommended backend for essentially all GPU training. When this book says "the all-reduce" in a deep-learning context, NCCL is almost always the code running it.

MPI (the Message Passing Interface) is the standard of classical High Performance Computing, decades old and extremely mature. It runs collectives across CPUs and, through CUDA-aware builds, across GPUs as well. On a supercomputer or a national-lab cluster that already speaks MPI and is provisioned with an MPI-aware job launcher, using the MPI backend lets an AI job inherit the cluster's tuned transport stack. Outside that setting, MPI is less common for pure GPU deep learning, because NCCL usually matches or beats it on NVIDIA hardware and needs no separate installation.

Gloo is Meta's portable collective library. It targets CPUs first, supports a useful subset of collectives on GPUs, and asks nothing of the environment beyond ordinary TCP, which makes it the dependable fallback. You reach for Gloo when there is no GPU, when you are debugging a distributed program on a laptop, or when a collective you need runs on CPU tensors (for example, gathering Python metadata across workers). It is rarely the fast path, and that is fine, because it is the path you take when correctness and portability matter more than peak bandwidth. Table 4.8.1 lays the three side by side.

Table 4.8.1: The three communication libraries that torch.distributed exposes as backends, by the devices they serve, the transports they prefer, and the situation each one fits best.
LibraryDevicesPreferred transportBest fit
NCCLNVIDIA GPUNVLink and NVSwitch intra-node; RDMA (InfiniBand, RoCE) inter-nodeThe default for all GPU collective training and sharded parallelism
MPICPU and GPU (CUDA-aware builds)Whatever the HPC fabric provides; vendor-tuned RDMAEstablished HPC and supercomputer clusters already running MPI
GlooCPU, limited GPUShared memory on one host; TCP across hostsCPU collectives, single-laptop debugging, the portable fallback
Fun Note: The Backend Nobody Thanks

Gloo is the understudy who knows every line. When NCCL is missing, when there is no GPU on the box, when you just want a distributed program to start so you can set a breakpoint, Gloo quietly steps in over plain TCP and makes the all-reduce return the right answer. It will never win a bandwidth benchmark, and it does not want to. Its whole job is to make sure your distributed program runs somewhere, which is exactly the property you are most grateful for at two in the morning.

3. Selecting a Backend in torch.distributed Intermediate

The reason you rarely think about these libraries by name is that torch.distributed presents them as interchangeable backends behind one collective API. You pass a backend string to init_process_group, and from that point every all_reduce, all_gather, or all_to_all call is routed to the library you named. Swapping NCCL for Gloo is a one-word change with no edit to the collective calls themselves, which is precisely the portability the library layer buys you. The widely used rule of thumb is short: NCCL for GPU collectives, Gloo for CPU work and debugging, MPI when you are on an HPC cluster that already runs it.

The program below probes which backends the installed PyTorch was compiled with, applies that rule of thumb to pick one, and then shows what a sum all-reduce computes over a small group. On a multi-GPU box the same code would initialize NCCL and run the all-reduce over the network; on the single CPU host used to capture the output, it reports the available backends and simulates the collective so the semantics are visible without a second process.

import torch
import torch.distributed as dist

# 1. Which backends did THIS PyTorch build compile in? (a real probe)
for b in ["nccl", "mpi", "gloo"]:
    available = getattr(dist, "is_" + b + "_available")()
    print(f"  {b:5s}: {'available' if available else 'not available'}")

# 2. Rule of thumb: NCCL for GPU collectives, else Gloo for CPU/debugging.
chosen = ("nccl" if torch.cuda.is_available() and dist.is_nccl_available()
          else "gloo" if dist.is_gloo_available() else "none")
print("device available   :", "cuda" if torch.cuda.is_available() else "cpu-only")
print("selected backend   :", chosen)

# 3. What a SUM all-reduce computes, shown for a 3-worker group.
#    Single-host simulation: each "worker" holds its own gradient vector;
#    the collective replaces every copy with the element-wise sum.
workers = [torch.tensor([1., 2., 3.]),
           torch.tensor([0., 1., 0.]),
           torch.tensor([4., 0., 2.])]
result = torch.stack(workers).sum(dim=0)   # what all_reduce(SUM) yields on every worker
print("per-worker inputs  :", [w.tolist() for w in workers])
print("all_reduce(SUM)    :", result.tolist(), "(held by every worker afterward)")
print("mean (divide by 3) :", (result / 3).tolist())
Code 4.8.1: Probing the available backends and applying the selection rule of thumb. The backend probe and selection are real; the all-reduce is simulated in one process so its result is visible without launching a second worker.
  nccl : not available
  mpi  : not available
  gloo : available
device available   : cuda
selected backend   : gloo
per-worker inputs  : [[1.0, 2.0, 3.0], [0.0, 1.0, 0.0], [4.0, 0.0, 2.0]]
all_reduce(SUM)    : [5.0, 3.0, 5.0] (held by every worker afterward)
mean (divide by 3) : [1.6666666269302368, 1.0, 1.6666666269302368]
Output 4.8.1: Real output on a Windows host. Only Gloo is compiled into this build, so even though a CUDA device is present the selection falls back to Gloo; on a Linux GPU node the same probe reports NCCL available and selects it. The all-reduce sums the three workers element-wise, and dividing by the group size recovers the mean gradient of Section 4.3.

Two details in that output are worth dwelling on. First, NCCL availability is a property of the build, not just the hardware: NCCL is a Linux library, so a Windows PyTorch can see a GPU yet report NCCL unavailable, which is why the rule of thumb checks is_nccl_available() and not only torch.cuda.is_available(). Second, the all-reduce result is identical no matter which backend produces it; the backend changes the speed and the wire, never the arithmetic. That invariance is what makes the backend a tuning knob rather than a correctness concern.

Library Shortcut: One String Picks the Whole Transport Stack

Everything Figure 4.8.1 shows below torch.distributed, the algorithm choice, the topology detection, the RDMA-or-TCP transport, and the compute overlap, is selected by a single argument to one function. You never write a ring schedule or open a socket:

# Run with: torchrun --nproc_per_node=8 train.py
import torch.distributed as dist

dist.init_process_group(backend="nccl")   # GPU training: topology-aware, RDMA-capable
# dist.init_process_group(backend="gloo") # swap one word to debug on CPU
# dist.init_process_group(backend="mpi")  # or to run under an HPC launcher

# Every collective below is now routed to the chosen library, unchanged:
dist.all_reduce(grad, op=dist.ReduceOp.SUM)
Code 4.8.2: Selecting NCCL, Gloo, or MPI is a one-word change to init_process_group; the collective calls that follow are untouched. Replacing a hand-written ring all-reduce, topology probe, and socket layer (hundreds of lines) with this single argument is the whole value of the backend abstraction.

4. What the Library Handles for You Intermediate

It pays to name the four things the library does on your behalf, because each one connects to material elsewhere in this chapter and book. The first is algorithm selection. For a small gradient the library may pick a tree-based reduction to keep latency low; for a large one it switches to the bandwidth-optimal ring all-reduce that Section 4.4 explained, and modern NCCL can even split a single collective across both NVLink and the network simultaneously. You expressed "all-reduce"; the library chose "ring or tree" using the very cost trade-offs of Chapter 3.

The second is topology awareness. The library inspects the machine to learn which GPUs share an NVLink bridge, which sit behind the same PCIe switch, and which are reachable only across the network, then builds its communication pattern to keep as much traffic as possible on the fastest links. This is so consequential that Section 4.9 is devoted to placing your workers so the library has a good topology to exploit in the first place. The third is transport: the library opens RDMA queues over InfiniBand or RoCE when they exist, uses shared memory for same-host transfers, and falls back to TCP otherwise, all without a line of socket code from you. The fourth is overlap, launching the transfer on a separate stream so it proceeds while computation continues, the mechanism that Section 4.10 turns into gradient bucketing during the backward pass.

Practical Example: The Backend Swap That Tripled Throughput

Who: A platform engineer onboarding a new multi-node GPU cluster for a foundation-model team.

Situation: A data-parallel training job ran correctly across four eight-GPU nodes but scaled poorly, with each added node buying far less speedup than the first.

Problem: The job had been ported from a single-node prototype that used the Gloo backend, and nobody had revisited the backend string when it moved to the GPU cluster.

Dilemma: Spend days profiling the model and rewriting the data loader, the usual first suspects, or first check whether the collective layer itself was the bottleneck.

Decision: Check the cheap thing first. The team confirmed Gloo was driving inter-node gradient all-reduce over plain TCP, ignoring the cluster's InfiniBand fabric entirely.

How: They changed one argument, init_process_group(backend="nccl"), and confirmed NCCL was discovering the InfiniBand RDMA path and the intra-node NVLink bridges.

Result: Per-step communication time fell sharply and end-to-end throughput more than tripled at four nodes, with byte-for-byte identical model outputs, since the backend changes the transport, not the math.

Lesson: On GPU clusters the backend is not a detail. Gloo over TCP and NCCL over RDMA compute the same all-reduce, but the second one actually uses the expensive interconnect you are paying for.

5. Where the Libraries Sit Under the Frameworks Advanced

It helps to see the whole stack at once, because the libraries of this section are a thin and well-defined layer with a great deal resting on top of them. Above torch.distributed sit the training frameworks: DistributedDataParallel and FSDP in PyTorch, then DeepSpeed and Megatron-LM, then orchestration systems like Ray. Every one of those frameworks ultimately expresses its parallelism as collective calls, which means every one of them runs on NCCL, MPI, or Gloo underneath. When Chapter 16 shards a model with FSDP and issues reduce-scatter and all-gather, those collectives are NCCL calls; the chapter can reason about parallelism strategy precisely because this layer reliably turns each collective into bytes on the fastest available wire.

This is why the chapter places the libraries in Part I rather than burying them inside a training chapter. They are shared infrastructure. The same NCCL that synchronizes gradients in data-parallel training (Chapter 15) also moves shards in FSDP (Chapter 16) and routes tokens in expert parallelism. Learning the backend layer once means every later parallel method reads as "which collective, over which backend, on which topology," a question you can now answer. With the engines named, the next question is how to arrange the workers so those engines find a friendly topology, which is exactly where Section 4.9 goes.

Research Frontier: The Communication Library Stack (2024 to 2026)

The library layer is under active development on several fronts. NCCL releases in the 2.20 and later line have added improved support for collectives that span NVLink and the network in one operation and better handling of the dense NVSwitch fabrics in DGX and GB200 systems. NVIDIA has pushed NVSHMEM, a partitioned-global-address-space model that lets GPU kernels issue fine-grained one-sided transfers directly, blurring the line between computation and communication and underpinning low-latency Mixture-of-Experts routing; the DeepEP library released alongside DeepSeek-V3 in 2025 builds expert-parallel all-to-all on exactly this idea. On the open-fabric side, libfabric and the AWS Elastic Fabric Adapter (EFA) provide a vendor-neutral RDMA-style transport that NCCL can ride through the aws-ofi-nccl plugin, decoupling the collective library from any single interconnect vendor. The throughline is that the collective layer, long treated as fixed plumbing, has become a place where real performance is won; we return to its consequences for sharded and expert parallelism in Chapter 16.

Exercise 4.8.1: Pick the Backend Conceptual

For each job, name the backend you would select and justify it in one sentence using the rule of thumb and Table 4.8.1: (a) data-parallel training of a vision model on a single node with eight NVLink-connected NVIDIA GPUs; (b) a unit test that exercises your all-reduce logic on a laptop with no GPU, in continuous integration; (c) a climate-model-style job on a national supercomputer whose batch system already launches every job under an MPI runtime. For one of the three, describe a symptom you would expect if you chose the wrong backend instead.

Exercise 4.8.2: Probe Your Own Build Coding

Run Code 4.8.1 on a machine you have access to and record which backends report available and which backend the rule of thumb selects. Then extend the script: if Gloo is available, actually initialize a single-process group with the Gloo backend (use a file-based init_method and world_size=1) and run a real dist.all_reduce on a CPU tensor, confirming the result matches the simulated value. If initialization fails on your platform, capture the error and explain in two sentences what it tells you about Gloo's environment requirements versus the simulation in Code 4.8.1.

Exercise 4.8.3: Cost of the Wrong Wire Analysis

The Practical Example traded Gloo over TCP for NCCL over RDMA. Suppose each node holds a gradient of $P = 2 \times 10^{9}$ four-byte numbers, a TCP path sustains 5 gigabytes per second, and an InfiniBand RDMA path sustains 50 gigabytes per second. Using the simple "bytes divided by bandwidth" estimate (ignore the ring-algorithm factors, which Section 4.4 and Chapter 3 make precise), compute the time to move one gradient once over each path and the ratio between them. Explain why this ratio, not the model code, was the real cause of the poor scaling, and predict how the gap widens as you add nodes.