"I spent the whole forward pass mailing tokens to strangers and the whole backward pass mailing them home. Nobody asks the postal service how it feels."
An All-to-All That Has Routed One Token Too Many
In a Mixture-of-Experts layer the experts live on different devices, so after the gate decides where each token should go, the tokens must physically move to the device that owns their expert, be computed there, and then move back: two all-to-all exchanges wrap every single MoE layer. All-to-all is the most communication-intensive collective, every device talks to every other device, and in a modern sparse model it is the dominant cost, not the expert math. This section shows why the collective is there, writes down what it costs as a function of tokens per device, demonstrates the dispatch-then-combine round trip in pure Python, and surveys the systems work (overlap, hierarchical routing, DeepEP) that keeps the interconnect from becoming the wall the whole model runs into.
The previous section placed each expert on its own device and explained why sparse activation lets a Mixture-of-Experts (MoE) model carry many more parameters than a dense model of the same per-token cost. That placement creates a problem it did not solve. After the gate runs (Section 17.3), a device that processed a slice of the batch holds a pile of tokens, but those tokens are routed all over the cluster: some belong to the expert sitting on this device, most belong to experts on other devices. Nothing computes yet, because the tokens are in the wrong place. The collective that fixes this is all-to-all, and it is the operation that defines how fast an MoE model trains and serves.
This collective is not new to us. It was introduced as the routing primitive in Section 4.6, alongside all-reduce and all-gather. There we measured its raw cost; here we watch it become the engine, and the bottleneck, of an entire class of models. Where data-parallel training leans on all-reduce (Chapter 15) and sharded training leans on reduce-scatter and all-gather (Chapter 16), expert parallelism is the regime where all-to-all is in charge.
1. Why Two All-to-Alls Wrap Every MoE Layer Beginner
Picture $D$ devices, each holding one expert and each holding a shard of the current batch of tokens. The gate has already chosen, for every token, the expert it must visit. From one device's point of view, its local tokens are destined for a mix of experts spread across all $D$ devices, and symmetrically, the expert it hosts is the destination for tokens currently sitting on every other device. To compute anything, the tokens and their destinations must be reconciled.
The dispatch all-to-all does exactly that: every device simultaneously sends each of its tokens to the device that owns the token's expert, and simultaneously receives, from every other device, the tokens that belong to its own expert. After this single collective, each device holds precisely the set of tokens its local expert must process, gathered from across the whole cluster. The experts then run their feed-forward computation locally, with no communication. Finally the combine all-to-all reverses the exchange: each device ships every processed token back to the device it originally came from, into the exact position it left, so the layer's output lines up with its input for the rest of the network. Two all-to-alls, one before the experts and one after, bracket every MoE layer in the model.
Expert parallelism makes the expensive part (the feed-forward computation) embarrassingly local: once a token arrives, its expert runs with zero communication. The price is that getting the token to the right place, and home again, is a fully global exchange in which every device talks to every other device. The compute is sharded for free; the routing is what you pay for. That inversion, cheap math wrapped in expensive movement, is why an MoE layer's performance is an interconnect question, not a FLOP question.
2. What All-to-All Costs, and Why It Is the Bottleneck Intermediate
The reason all-to-all dominates is structural. In all-reduce, the total bytes a device sends are roughly the size of one gradient, independent of how many workers there are, because clever ring and tree algorithms reuse the same bytes. In all-to-all there is no such reuse: each device has a distinct message for every other device, so the volume a device must move grows with the data it holds and is spread across $D-1$ partners. If each of $D$ devices holds $T$ tokens of hidden size $h$ (in bytes per token), and routing is uniform, each device sends about a $\frac{D-1}{D}$ fraction of its tokens off-device, giving a per-device transfer of
$$ B_{\text{dispatch}} \;=\; T \, h \,\frac{D-1}{D} \;\approx\; T\,h \quad\text{for large } D, $$and the combine all-to-all moves the same volume back, so a single MoE layer pays roughly $2 T h$ bytes of off-device traffic per device, every forward pass, plus the same again in the backward pass. Multiply by the number of MoE layers in the model and the picture is stark: in a deep sparse model the all-to-all traffic is paid dozens of times per step. Under the alpha-beta cost model of Chapter 3, the time for one exchange is about
$$ t_{\text{a2a}} \;\approx\; (D-1)\,\alpha \;+\; \frac{B_{\text{dispatch}}}{\beta_{\text{eff}}}, $$where $\alpha$ is per-message latency, $\beta_{\text{eff}}$ is the effective per-link bandwidth, and the $(D-1)\alpha$ term reflects that every device opens a channel to every other. As the number of experts (and devices) grows, both the latency term and the contention for shared links grow, which is why all-to-all, not the expert FLOPs, has become the MoE bottleneck. The completion time is set by the busiest link, so the cost above is a best case that uniform routing never quite achieves in practice.
This book's spine is that a handful of collectives, introduced in Chapter 4, return as the engine of every parallel method. Data parallelism is all-reduce; sharded parallelism is reduce-scatter plus all-gather. Expert parallelism is the case where the collective does not merely synchronize a result computed elsewhere; it is the layer's defining operation. An MoE model's quality comes from its experts, but its speed, its hardware budget, and its scaling limit come almost entirely from how well its two all-to-alls run. When you read that a sparse model is "interconnect-bound," this is the sentence being unpacked.
3. The Dispatch-and-Combine Round Trip in Pure Python Intermediate
The mechanics are easier to trust once you watch every token leave home, reach its expert, and return to the exact slot it left from. The simulation below runs one MoE layer across $D$ devices on a single process. It builds the dispatch traffic matrix (who sends how many tokens to whom), routes the tokens, runs a placeholder expert step, then runs the combine all-to-all and checks two things: that the combine exchange is the transpose of the dispatch exchange, and that every token came back to its own origin slot. It also reports the per-device traffic and the load imbalance that Section 17.6 will set out to fix. The gate is deliberately skewed so one expert is popular, which is what real routers do before any balancing is applied.
import numpy as np
rng = np.random.default_rng(7)
D = 4 # devices, one expert per device for clarity
T = 12 # tokens held per device after the local forward pass
N = D * T # total tokens in the batch
# Each device d holds T tokens. The gate assigned every token a destination
# expert in {0..D-1}. We bias the gate so expert 1 is "popular" (load imbalance).
probs = np.array([0.15, 0.55, 0.20, 0.10]) # destination distribution
origin = np.repeat(np.arange(D), T) # which device each token came from
local_slot = np.tile(np.arange(T), D) # its slot on that origin device
dest = rng.choice(D, size=N, p=probs) # expert each token is routed to
token_id = np.arange(N) # stable identity to check round-trip
# ---- DISPATCH all-to-all: every device sends its tokens to the owning expert.
# send_counts[s, r] = number of tokens device s ships to device r (the all-to-all
# traffic matrix). Row sums are what each device sends; column sums are receives.
send_counts = np.zeros((D, D), dtype=int)
for s in range(D):
for r in range(D):
send_counts[s, r] = np.sum((origin == s) & (dest == r))
# After dispatch, expert r processes every token whose dest == r.
inbox = {r: token_id[dest == r] for r in range(D)}
expert_load = np.array([len(inbox[r]) for r in range(D)])
# ---- Expert compute is local: each device runs its MLP on the tokens it received,
# with no network traffic. We carry the token identity through as a marker.
processed = {r: inbox[r] for r in range(D)}
# ---- COMBINE all-to-all: each expert returns every token to its origin device,
# into the exact slot it left from. recv[origin][slot] must equal the original id.
recv = -np.ones((D, T), dtype=int)
combine_counts = np.zeros((D, D), dtype=int) # expert r -> origin s
for r in range(D):
for tid in processed[r]:
s = origin[tid]
slot = local_slot[tid]
recv[s, slot] = tid
combine_counts[r, s] += 1
# ---- Round-trip correctness: every token came back to its own origin slot.
expected = token_id.reshape(D, T) # slot (s, slot) started as that id
roundtrip_ok = np.array_equal(recv, expected)
# ---- Traffic & imbalance: off-diagonal sends cross the network; the busiest
# expert sets the all-to-all completion time, so imbalance directly costs latency.
cross_device = send_counts.sum() - np.trace(send_counts)
imbalance = expert_load.max() / expert_load.mean()
print("devices D :", D)
print("tokens per device T :", T, " total tokens:", N)
print("dispatch send matrix [s->r]:")
print(send_counts)
print("expert load (recv per dev):", expert_load.tolist())
print("cross-device tokens (disp):", cross_device, "of", N)
print("combine returns == dispatch sends (transpose):",
np.array_equal(combine_counts, send_counts.T))
print("round-trip correct :", bool(roundtrip_ok))
print("load imbalance (max/mean):", f"{imbalance:.2f}x")
send_counts matrix is the dispatch traffic pattern; the combine step rebuilds recv and the assertions confirm both the transpose relationship and the per-token round trip.devices D : 4
tokens per device T : 12 total tokens: 48
dispatch send matrix [s->r]:
[[1 6 5 0]
[1 8 1 2]
[2 9 0 1]
[2 6 4 0]]
expert load (recv per dev): [6, 29, 10, 3]
cross-device tokens (disp): 39 of 48
combine returns == dispatch sends (transpose): True
round-trip correct : True
load imbalance (max/mean): 2.42x
round-trip correct: True) and the combine exchange is exactly the transpose of the dispatch exchange. Note that 39 of 48 tokens crossed devices and that the skewed gate gave expert 1 a load of 29 against a mean of 12, a 2.42x imbalance, the busiest expert that the all-to-all must wait on.Two facts in Output 17.5.1 carry the whole section. First, the combine all-to-all is the transpose of the dispatch all-to-all, which is why the round trip is exact: each token returns to its own slot, so the layer's output aligns with its input. Second, the column sums of the send matrix are wildly uneven (expert 1 receives 29 tokens, expert 3 only 3), and because the slowest device sets the collective's finish time, that imbalance is pure wasted wall-clock. The all-to-all is correct regardless of imbalance, but it is only as fast as its most overloaded link, which is the bridge to load balancing in Section 17.6.
Trace token 0 in the simulation. It leaves device 0, crosses the network to whichever expert the gate chose, gets multiplied by a weight matrix, then crosses the network again to land back in slot 0 of device 0. The useful work, one feed-forward pass, might take microseconds; the two network crossings can take longer. A surprising amount of MoE engineering is, at heart, the logistics of making sure a token's commute does not dwarf its actual job.
4. Making All-to-All Cheaper: Overlap, Hierarchy, and Kernels Advanced
Because the collective is unavoidable, the engineering goal is to hide it or shrink it. Three ideas do most of the work. The first is overlap: the dispatch all-to-all for a layer can run concurrently with the expert computation of tokens that have already arrived, and the backward-pass all-to-all can overlap with gradient computation, so the network time disappears behind math instead of stacking on top of it. The second is hierarchical all-to-all: rather than one flat exchange across all $D$ devices, do an intra-node all-to-all over fast NVLink first, aggregate per node, then a smaller inter-node all-to-all over the slower InfiniBand fabric, then a final intra-node spread. This turns most of the traffic onto the fast links and sends far fewer, larger messages across the slow ones, cutting both the $\alpha$ and the contention terms. The third is fused kernels: specialized all-to-all implementations that fuse the gather, the network transfer, and the scatter into one optimized operation rather than three library calls with materialized buffers in between.
Who: A systems engineer on a team pretraining a 100-billion-parameter sparse model across 32 nodes of 8 GPUs each.
Situation: Profiling showed the GPUs idle for nearly 40 percent of each step, waiting on the two all-to-alls in every one of the model's MoE layers.
Problem: The expert FLOPs were cheap, exactly as sparse activation intends, so the step time was set almost entirely by inter-node all-to-all over InfiniBand, not by compute.
Dilemma: Buy a faster fabric (slow, capital-heavy, and capped by what the data center supports), or restructure the communication in software to use the fabric they already had.
Decision: Restructure in software: switch the flat all-to-all to a hierarchical NVLink-then-InfiniBand scheme and overlap the dispatch exchange with expert compute.
How: They adopted a dispatch-and-combine library (DeepEP-style) that batches tokens per destination node, routes intra-node traffic over NVLink first, and issues the inter-node exchange asynchronously so it overlaps the experts.
Result: GPU idle time on the all-to-all fell from roughly 40 percent to under 12 percent, and step time dropped by about a third with no change to the model or the math.
Lesson: When a sparse model is interconnect-bound, the fastest win is usually to reshape the all-to-all to match the network topology, not to compute less or buy more hardware.
The nested loops in Code 17.5.1 build and apply the traffic matrix by hand. In a real expert-parallel job each device declares how many tokens it sends to and receives from every other device, and one collective call performs the entire exchange over the network:
import torch, torch.distributed as dist
dist.init_process_group("nccl") # the group of D expert devices
# After the gate: sort local tokens by destination expert and count them.
send_counts = counts_per_destination(dest) # length-D list, tokens per dest device
recv_counts = exchange_counts(send_counts) # an all-to-all on the counts themselves
# One collective performs the whole dispatch; NCCL handles every device-to-device send.
dist.all_to_all_single(recv_buf, send_buf,
output_split_sizes=recv_counts,
input_split_sizes=send_counts)
# ... experts run locally on recv_buf, then a second all_to_all_single combines back.
all_to_all_single call. The hand-built send matrix and per-pair routing collapse into split-size arrays; NCCL performs every device-to-device transfer, and a production stack such as DeepEP or Megatron-MoE wraps this with the overlap and hierarchical routing of Section 4.The all-to-all has become the headline target of sparse-model systems research. DeepEP (DeepSeek, 2025) is an open-source expert-parallel communication library built solely around fast dispatch-and-combine: it provides high-throughput intranode (NVLink) and internode (RDMA) kernels, a low-latency inference path, and computation-communication overlap that hides the exchange behind expert math. It grew directly out of the DeepSeek-V3 training stack (DeepSeek-AI, 2024), which co-designed the routing and the interconnect so that, with 256 experts per layer, all-to-all did not dominate, using node-limited routing to cap how many nodes each token visits. In parallel, work on expert-parallel all-to-all overlap (in Megatron-MoE, Tutel, and FasterMoE lineages) schedules the dispatch and combine exchanges to run concurrently with both the expert forward pass and the backward gradient computation, and adaptively reshapes routing to the live network topology. The common thread: the model's quality lives in its experts, but the frontier of making it trainable at all lives in the collective that moves the tokens.
5. Why the Interconnect Decides Intermediate
Everything in this section converges on one conclusion: for an MoE model, the interconnect is not a detail, it is the design center. A dense model is mostly a compute problem, so a faster accelerator helps directly. A sparse model deliberately makes compute cheap and pays for it in movement, so the machine that matters is the network: NVLink bandwidth inside a node, InfiniBand or equivalent RDMA fabric between nodes, and the topology that connects them. Two clusters with identical GPUs but different fabrics will train the same MoE model at very different speeds, because the two all-to-alls per layer ride the fabric, not the GPU. This is why MoE training is concentrated on machines with the fastest available interconnects, and why the placement of experts relative to the network topology, an idea developed for general collectives in Chapter 4, is decisive here. The next section confronts the imbalance that Output 17.5.1 exposed: when one expert is far more popular than the rest, the all-to-all is dragged down to the speed of its busiest link, and load balancing is what brings it back.
A model has 60 transformer blocks, of which every second block is an MoE block (so 30 MoE blocks). Each MoE block performs a dispatch and a combine all-to-all in the forward pass, and the same two again in the backward pass. How many all-to-all collectives does one training step issue? If each all-to-all takes 0.4 ms on the cluster's fabric, how much of a step is spent purely in this collective? Explain why this number, not the expert FLOPs, is what a systems engineer profiles first.
Modify Code 17.5.1 so the completion time of each all-to-all is modeled as proportional to the maximum column sum of the send matrix (the busiest receiving device), not the total traffic. Run it for a balanced gate (probs uniform) and for the skewed gate given. Report the ratio of modeled times. Then add a simple capacity limit: any tokens beyond a per-expert capacity of $1.25 \times T$ are dropped, and measure how many tokens are lost under each gate. Relate your findings to the capacity-factor mechanism you will meet in Section 17.7.
Using the cost model $t_{\text{a2a}} \approx (D-1)\alpha + B_{\text{dispatch}}/\beta_{\text{eff}}$, consider 64 devices arranged as 8 nodes of 8, with intra-node bandwidth 10x the inter-node bandwidth and intra-node latency 5x lower. Estimate the time for a flat 64-way all-to-all, then for a hierarchical scheme (intra-node 8-way, then inter-node 8-way, then intra-node 8-way) carrying the same total token volume. Under what ratio of intra- to inter-node bandwidth does the hierarchical scheme stop being worth its extra latency terms? Tie your answer to the alpha-beta model of Chapter 3.