"They built a hundred of us and promised every token its pick of two. Most days I sit idle, fully credentialed, waiting for a router to remember I exist."
An Expert Nobody Routed To
Dense scaling ties model size to compute: make the model ten times bigger and every token costs ten times more to process. Sparse scaling cuts that tie. By routing each token to only a few of many expert sub-networks, a model can hold far more parameters while keeping the arithmetic per token roughly fixed. This single idea, conditional computation, is why most frontier models of 2024 to 2026 are sparse mixtures of experts rather than dense stacks. It buys capacity almost for free in floating-point operations, but it does not buy it for free in distribution: the experts no longer fit on one device, so they are sharded across machines, and sending each token to its chosen remote experts turns the forward pass into an all-to-all exchange. This chapter is about that exchange. This first section establishes the dense-versus-sparse trade in numbers, then frames the rest of the chapter as a new collective joining the data-parallel and sharded axes you already know.
By the end of Chapter 16 you had three ways to spread one model across machines. Tensor parallelism splits each matrix multiply across devices, pipeline parallelism assigns different layers to different stages, and sharded data parallelism (ZeRO and FSDP) splits the parameters, gradients, and optimizer state across the data-parallel group and gathers them just in time. All three share an assumption so quiet it is easy to miss: every parameter in the model participates in processing every token. The model is dense. Splitting it across machines changes who stores and who computes each piece, but the total arithmetic per token is fixed by the parameter count. This section questions that assumption, because relaxing it is what the entire chapter is built on, and because the relaxation creates the fourth distribution axis that complements the three from Chapter 16.
1. Why Dense Scaling Hits a Wall Beginner
Consider the feed-forward block that sits inside every Transformer layer: two linear maps with a nonlinearity between them. If the hidden width of the model is $d_{\text{model}}$ and the inner width is $d_{\text{ff}}$, the block holds roughly $P = 2\, d_{\text{model}}\, d_{\text{ff}}$ parameters, and pushing one token through it costs about $2P$ floating-point operations, because a matrix-vector product spends two operations (one multiply, one add) per parameter. The dependence is the whole story: in a dense model the FLOPs per token and the parameter count are the same number up to a factor of two. You cannot move one without moving the other.
This is the wall. Suppose a dense model gives good results and you want a model with ten times the capacity, more parameters to memorize more of the world. In a dense design, that model costs ten times as much arithmetic for every token, in training and again at every step of inference, forever. The bill is not paid once; it is paid on every token the model ever processes. Capacity and cost are welded together, and for the trillion-parameter regime that frontier systems now target, the welded version is simply unaffordable to train and to serve.
In a dense model, compute per token is proportional to the parameter count, $\text{FLOPs/token} \approx 2P$. Making the model bigger to hold more knowledge makes every token more expensive by the same factor, on every forward pass for the life of the model. Sparse scaling exists to break this proportionality: it lets $P$ grow while the per-token FLOPs stay near constant, so capacity and compute can be dialed independently.
2. Conditional Computation: Pay Only for What You Use Beginner
The escape is to stop running every parameter on every token. Replace the single feed-forward block with $E$ parallel copies, called experts, and add a small learned router that, for each token, picks the $k$ experts most worth running (typically $k = 1$ or $k = 2$, with $E$ in the tens or hundreds). The token flows through only its chosen experts; the rest sit idle for that token. This is conditional computation: the computation performed is conditioned on the input, not fixed in advance. Figure 17.1.1 shows the contrast directly, with two of eight experts lit and the other six dark.
The accounting changes in a precise and pleasant way. The model now stores $E$ feed-forward blocks, so its parameter count is roughly $E$ times the dense block, $P_{\text{total}} \approx E \cdot 2\, d_{\text{model}}\, d_{\text{ff}}$. But any single token runs only $k$ of them, so its arithmetic is $\text{FLOPs/token} \approx k \cdot 2 \cdot 2\, d_{\text{model}}\, d_{\text{ff}}$, set by $k$ and not by $E$. The ratio of stored capacity to per-token compute is therefore about $E / k$, a number you control directly. The router itself is tiny, a single linear map from the token to $E$ scores, so it adds a negligible slice of FLOPs while deciding where the expensive arithmetic goes. Section 17.2 builds the mixture-of-experts layer in full, and Section 17.3 studies the router, which is where the real subtlety lives.
It helps to see the numbers rather than the symbols. The program below tallies parameters and per-token FLOPs for a dense feed-forward block and for three mixture-of-experts variants built from the same block, using the dimensions of a typical large Transformer. It prints stored parameters, FLOPs per token, and active (used) parameters per token for each configuration.
def ffn_params(d_model, d_ff):
# Two linear layers of an FFN: (d_model x d_ff) + (d_ff x d_model).
return 2 * d_model * d_ff
def ffn_flops_per_token(d_model, d_ff):
# ~2 FLOPs per parameter (one multiply, one add) for one token.
return 2 * ffn_params(d_model, d_ff)
d_model, d_ff = 4096, 14336 # a GPT-style block (d_ff ~ 3.5 * d_model)
# --- Dense reference: one FFN, every token uses every parameter. ---
dense_params = ffn_params(d_model, d_ff)
dense_flops = ffn_flops_per_token(d_model, d_ff)
print(f"{'config':<28}{'FFN params':>16}{'FLOPs/token':>16}{'active params':>16}")
print("-" * 76)
print(f"{'dense (1 FFN)':<28}{dense_params:>16,}{dense_flops:>16,}{dense_params:>16,}")
# --- MoE: E experts, route each token to top-k. ---
for E, k in [(8, 2), (64, 2), (256, 8)]:
moe_params = E * ffn_params(d_model, d_ff) # ALL experts stored
active = k * ffn_params(d_model, d_ff) # only k run per token
moe_flops = k * ffn_flops_per_token(d_model, d_ff) # FLOPs scale with k, not E
print(f"{f'MoE E={E}, top-{k}':<28}{moe_params:>16,}{moe_flops:>16,}{active:>16,}")
print("-" * 76)
E, k = 64, 2
print(f"E={E}, top-{k}: parameters x{E:>3} FLOPs/token x{k} "
f"capacity-per-FLOP gain x{E//k}")
config FFN params FLOPs/token active params
----------------------------------------------------------------------------
dense (1 FFN) 117,440,512 234,881,024 117,440,512
MoE E=8, top-2 939,524,096 469,762,048 234,881,024
MoE E=64, top-2 7,516,192,768 469,762,048 234,881,024
MoE E=256, top-8 30,064,771,072 1,879,048,192 939,524,096
----------------------------------------------------------------------------
E=64, top-2: parameters x 64 FLOPs/token x2 capacity-per-FLOP gain x32
Read the two middle rows of Output 17.1.1 together. The jump from $E=8$ to $E=64$ multiplies stored parameters eightfold, from 0.9 billion to 7.5 billion in a single block, yet the FLOPs-per-token column does not move at all, because both route to $k=2$ experts. That is the decoupling stated as a measurement: a wider pool of experts adds capacity and memory while leaving the per-token arithmetic exactly where it was. The last row shows the other knob: raising $k$ to 8 does raise FLOPs, because $k$, not $E$, sets the compute. A modeller turns $E$ to buy capacity and turns $k$ to spend compute, and the two dials are finally independent.
A dense model is a single overworked generalist who must personally weigh in on every word. A mixture of experts is a large committee with a brisk chair: for each word the chair (the router) wakes the two members most likely to have something useful to say and lets everyone else nap. The committee knows enormously more in total, yet any given decision still only troubles two people. The catch, which the rest of the chapter is about, is that the committee is too big to fit in one room, so the chair has to phone members in other buildings.
3. Why Capacity Without Compute Is Worth Having Intermediate
A fair objection: if most experts are idle on most tokens, are the extra parameters doing anything? They are, and the reason is that different tokens choose different experts. Over a batch of thousands of tokens, the router spreads work across the whole pool, so every expert is busy on the tokens that suit it even though each individual token touches only $k$. The capacity is real because it is amortized across the input distribution, not because any one token uses all of it. Specialization is the payoff: experts drift toward different kinds of input (a rough intuition, since the router is learned end to end and the specialization is soft rather than labeled), and the model as a whole holds more distinct competence than a dense model of equal per-token cost.
This is why the comparison that matters is not "MoE versus a dense model of the same parameter count" but "MoE versus a dense model of the same per-token FLOPs". On the FLOPs-matched comparison, the sparse model wins, because for the same compute budget per token it carries far more parameters and therefore more knowledge. That is the comparison frontier labs actually run, and it is why the sparse design has taken over. The cost it imposes is not arithmetic; it is memory and movement, which is to say, distribution.
The dense-to-sparse shift is visible across the frontier of 2024 to 2026. Mistral's Mixtral 8x7B (Jiang et al., 2024) made the recipe concrete and open: eight experts per layer, top-2 routing, roughly 47 billion stored parameters but only about 13 billion active per token, matching or beating much larger dense models at a fraction of the inference FLOPs. DeepSeek-V3 (DeepSeek-AI, 2024) pushed the design to 671 billion total parameters with about 37 billion active, using many fine-grained experts plus a few always-on shared experts and an auxiliary-loss-free load-balancing scheme. xAI's Grok-1 was released as a 314-billion-parameter mixture of experts, and Alibaba's Qwen and Google's Gemini families ship MoE variants. The common thread is the trade in Output 17.1.1: enormous stored capacity, modest active compute. The open problems are equally consistent, namely how to route tokens stably, how to keep experts evenly loaded, and how to move tokens to experts across machines without the all-to-all becoming the bottleneck, which Sections 17.3 through 17.6 take up in turn.
4. Model Scale Becomes a Distribution Problem Intermediate
Look again at Output 17.1.1. The $E=256$, top-8 configuration stores 30 billion parameters in a single layer's worth of experts, and a real model stacks dozens of such layers. That total does not fit in the memory of one accelerator, and there is no version of scaling up a single device that fixes it. The experts must be spread across machines. This is the moment the chapter turns from a modeling idea into a systems problem: the very thing that made sparse models cheap in FLOPs, holding many experts, makes them expensive in memory and forces them across the network.
The natural placement is to put different experts on different devices, a scheme called expert parallelism, developed in Section 17.4. It is a genuinely new axis, distinct from the three in Chapter 16. Tensor parallelism splits one operation across devices; pipeline parallelism splits the layer sequence into stages; sharded data parallelism splits the parameter state but still runs every parameter on every token. Expert parallelism splits the parameters by which token will use them, a partition that only exists because the computation is conditional. The four axes compose: a frontier training job runs tensor, pipeline, sharded-data, and expert parallelism at once, a combination Chapter 19 assembles in full.
Placing experts on different devices creates an immediate communication pattern. A token arrives on the device that holds its position in the sequence, but the experts it was routed to live on other devices. So before the experts run, every device must ship each of its tokens to whichever devices hold that token's chosen experts, and after the experts run, the results must come back to where the tokens started. Every device sends some data to every other device and receives some from every other device. That pattern is the all-to-all collective, the defining communication of this chapter, detailed in Section 17.5 and built on the all-to-all primitive introduced back in Section 4.6.
Each parallel axis in this book is defined by the collective it leans on. Data-parallel training (Chapter 15) is built on all-reduce, the summing of one gradient vector per worker first seen in Section 1.1. Sharded training (Chapter 16) leans on reduce-scatter and all-gather. Expert parallelism is defined by all-to-all: every device exchanges a different slice of tokens with every other device to route them to remote experts. When you meet a new MoE system, ask first how it implements and overlaps its all-to-all, because that collective, introduced in Section 4.6, is the axis on which this entire chapter turns. Sparse scaling does not remove communication; it changes which collective dominates.
Code 17.1.1 only counts parameters and FLOPs; it does not run a model. Building a real routed expert layer from scratch, with a gating network, top-$k$ selection, token dispatch, and the cross-device all-to-all, is a few hundred lines that Sections 17.2 through 17.5 unpack. Production frameworks collapse it to a configuration choice. In DeepSpeed-MoE, wrapping a feed-forward module as a distributed mixture of experts is essentially one constructor call:
# pip install deepspeed ; the experts are sharded across the expert-parallel group.
import deepspeed.moe.layer as moe
expert_layer = moe.MoE(
hidden_size=4096, # d_model
expert=feed_forward_module, # the dense FFN to replicate into experts
num_experts=64, # E: stored capacity
k=2, # top-k routing: per-token compute
ep_size=8, # experts sharded across 8 devices (expert parallel)
)
# DeepSpeed builds the gate, the top-k dispatch, and the all-to-all that moves
# tokens to remote experts and the expert outputs back, all inside .forward().
MoE constructor; ep_size is the expert-parallel degree that Section 17.4 explains, and the library schedules the all-to-all of Section 17.5 internally.Who: An applied-research team at a startup serving a code-completion assistant on a fixed inference budget.
Situation: Their dense 13-billion-parameter model was accurate enough to be useful but plateaued on rare languages and uncommon library APIs, where it lacked the knowledge to autocomplete well.
Problem: A dense 60-billion-parameter model closed the gap in offline tests, but it cost roughly $4.6\times$ the FLOPs per token, which blew the latency budget and quadrupled the serving bill.
Dilemma: Ship the dense 60B and miss the latency target and the budget, or stay on the dense 13B and keep failing on the long tail of rare inputs.
Decision: They trained a mixture of experts sized like the $E=64$, top-2 row of Output 17.1.1: many times the stored capacity of the 13B model, but only about $2\times$ its active parameters per token, so per-token compute stayed close to the original.
How: They replaced the feed-forward blocks with routed experts using a framework MoE layer (Code 17.1.2), sharded the experts across eight inference devices with expert parallelism, and accepted an all-to-all exchange per MoE layer as the new cost to manage.
Result: Long-tail completion quality rose to within a point of the dense 60B model while per-token FLOPs stayed near the 13B baseline; the binding cost moved from arithmetic to the all-to-all latency, which they then optimized with the techniques of Section 17.5.
Lesson: When the gap is knowledge rather than reasoning depth, sparse capacity buys what dense compute cannot afford. The bill does not vanish; it relocates from FLOPs to communication, which is a bill this book teaches you to manage.
5. What This Chapter Builds Beginner
The plan follows the trade from this section straight into its consequences. Section 17.2 constructs the mixture-of-experts layer that replaces a dense feed-forward block. Section 17.3 studies routing and gating, where a token is matched to its experts, and the instabilities that come with a learned router. Sections 17.4 and 17.5 are the distribution core: sharding experts across machines, then the all-to-all that moves tokens to them and the outputs back. The remaining sections handle what breaks at scale, namely uneven expert load (17.6), capacity limits and dropped tokens (17.7), and serving a sharded sparse model under a latency budget (17.8), before 17.9 weighs sparse against dense distributed models on even terms and closes the chapter.
The thread to hold onto is the one this section opened with. Sparse scaling decouples how much a model knows from how much it computes per token, which is a modeling win, but it pays for that win in memory and movement, which is a distribution problem. Every later section in the chapter is a way of paying that bill efficiently, and the currency, more than in any earlier chapter, is the all-to-all collective. We turn first to the layer itself in Section 17.2.
Using only Output 17.1.1, answer without rerunning the code. (a) Which two rows have identical FLOPs per token, and what single design choice explains it? (b) A colleague proposes $E=128$, top-2 with the same $d_{\text{model}}$ and $d_{\text{ff}}$. State its stored parameters, FLOPs per token, and active parameters per token by reasoning from the existing rows, not by recomputing the products. (c) Explain in one sentence why comparing this MoE to a dense model of the same stored parameter count is the wrong comparison, and what the right one is.
Extend Code 17.1.1 to model the DeepSeek-style design of a few always-on shared experts plus the routed top-$k$ experts. Add a parameter $s$ for the number of shared experts that run on every token, and update both the FLOPs-per-token formula (now $(k+s)$ experts run) and the stored-parameter formula (now $E+s$ experts stored). Print a row for $E=64$, $k=2$, $s=2$ and compare its FLOPs per token and stored parameters to the plain $E=64$, top-2 row. State in one sentence what the shared experts cost in compute and what you would expect them to buy.
A sparse layer stores $E$ experts but runs $k$, so its capacity-to-compute ratio is $E/k$. Suppose each expert lives on its own device and routing a token to a remote expert costs a fixed communication time $\tau$ per token per chosen expert, while running an expert costs compute time $c$ per token. Write the per-token time as a function of $k$, $\tau$, and $c$ (ignore the all-to-all's algorithmic cleverness). Argue from your expression at what point increasing $k$ to improve quality is dominated by communication rather than compute, and explain why raising $E$ at fixed $k$ does not change the per-token time but does change the memory and load-balancing burden that Section 17.6 must manage.