Part IV: Parallel Deep Learning and Large Models
Chapter 17: Expert Parallelism and Sparse Distributed Models

Trade-Offs vs Dense Distributed Models

"They gave me thirty-two times the parameters and told me to act surprised when the network bill arrived."

An Expert Nobody Routed To
Big Picture

A mixture-of-experts model buys a large increase in parameter count, and the quality that comes with it, for almost no increase in per-token compute; it pays for that bargain with all-to-all communication, harder training, and serving memory that scales with the full parameter count rather than the active one. That single sentence is the verdict of this chapter. Sparse scaling decouples capacity from the floating-point cost of a forward pass, which is why it has become the dominant frontier recipe of 2024 to 2026. But the decoupling is not free: the saved compute reappears as bytes on the interconnect, as load-balancing machinery in the training loop, and as memory on every serving replica. This closing section weighs both sides on one scale, says plainly when MoE wins and when a dense model is the better engineering choice, and folds expert parallelism into the larger parallel stack as one more axis whose collective is the all-to-all.

Every previous section of this chapter argued a single mechanism. Section 17.1 separated capacity from FLOPs; Section 17.2 built the mixture-of-experts layer; Section 17.3 routed tokens through a gate; Section 17.4 sharded experts across machines; Section 17.5 turned that sharding into an all-to-all; Sections 17.6 and 17.7 fought to keep the experts balanced and stable; Section 17.8 exposed the serving paradox that an MoE is cheap in compute but expensive in memory. This section does not introduce a new mechanism. It assembles those mechanisms into a decision: for a given system, is sparse scaling worth its costs, or is a dense distributed model the wiser choice? We answer with numbers first and judgment second.

1. The Case for Sparse Scaling Beginner

The argument for mixture-of-experts is one quantity stated two ways. For a fixed training FLOP budget, a sparse model holds far more parameters than a dense one, and parameter count is the lever that buys model quality. Equivalently, for a fixed parameter count, a sparse model trains and runs at a fraction of the FLOPs. Either framing describes the same decoupling introduced in Section 17.1: total capacity is set by the number of experts, while per-token compute is set only by the few experts each token actually visits. A model with $E$ experts and top-$k$ routing has roughly $E/k$ times the parameters of the dense feed-forward block it replaces, while spending the FLOPs of only $k$ of them.

This is why the frontier moved. When the binding constraint on model quality is the training compute budget, and that budget is enormous but still finite, sparsity converts spare memory and interconnect bandwidth into extra capacity that dense scaling cannot reach at the same FLOP cost. Production systems in the lineage of the Switch Transformer and the open Mixtral and DeepSeek-V3 families exploit exactly this: they ship hundreds of billions of total parameters while activating only tens of billions per token, and they reach quality that a dense model of the active size cannot match. The bargain is real, and the next subsection puts a number on it.

Key Insight: MoE Spends Memory and Bandwidth to Buy Capacity Without Buying FLOPs

Dense scaling raises quality by adding parameters that every token must compute through, so capacity and per-token FLOPs rise together. Sparse scaling breaks that coupling: it adds parameters that only a routed subset of tokens compute through, so capacity rises while per-token FLOPs stay flat. The catch is conservation of difficulty. The compute you did not spend does not vanish; it reappears as the all-to-all that moves tokens to their experts (Section 17.5), the balancing machinery that keeps experts busy (Section 17.6), and the memory that must hold every expert whether it fires or not (Section 17.8). MoE is not cheaper; it relocates the cost from FLOPs to bytes and to systems complexity.

2. Putting Numbers to the Trade-Off Intermediate

A verdict deserves arithmetic. The program below compares a dense model and an MoE model designed to spend the same per-token compute, then reports four quantities that decide the trade-off: capacity (total parameters), per-token compute, communication volume per training step, and serving memory. The two models are matched at equal per-token FLOPs by giving the dense model exactly $k$ feed-forward blocks, so that it activates the same amount of compute the MoE spends on its $k$ routed experts. Everything else, the capacity and the costs, is then read off directly.

# dense vs MoE at EQUAL per-token compute; report capacity, FLOPs, comm, memory.
d = 4096            # model (hidden) width
d_ff = 4 * d        # feed-forward inner width per expert
bytes_per_param = 2 # bf16 weights at serving time
T = 8192            # tokens per training step
E, k = 64, 2        # experts, and experts each token is routed to (top-k)

def ffn_params(d, d_ff):      return 2 * d * d_ff          # two matrices
def ffn_flops_tok(d, d_ff):   return 2 * ffn_params(d, d_ff)  # mul + add

L_dense = k                    # match per-token FLOPs: dense depth = k blocks
dense_params = L_dense * ffn_params(d, d_ff)
moe_params   = E * ffn_params(d, d_ff)               # ALL experts are real params

dense_flops = L_dense * ffn_flops_tok(d, d_ff)
moe_flops   = k * ffn_flops_tok(d, d_ff)             # only k experts fire

dense_comm = 0                                        # FFN-local: no token shuffle
moe_comm   = 2 * k * T * d * bytes_per_param          # two all-to-alls (dispatch+combine)

dense_serve = dense_params * bytes_per_param          # every param must be resident
moe_serve   = moe_params   * bytes_per_param

print(f"{'metric':30}{'DENSE':>14}{'MoE':>14}{'MoE/DENSE':>12}")
print(f"{'capacity (B params)':30}{dense_params/1e9:>14.2f}{moe_params/1e9:>14.2f}{moe_params/dense_params:>11.0f}x")
print(f"{'per-token compute (GFLOP)':30}{dense_flops/1e9:>14.2f}{moe_flops/1e9:>14.2f}{moe_flops/dense_flops:>11.0f}x")
print(f"{'comm / step (MiB)':30}{dense_comm/2**20:>14.2f}{moe_comm/2**20:>14.2f}{'new cost':>12}")
print(f"{'serving memory (GiB)':30}{dense_serve/2**30:>14.2f}{moe_serve/2**30:>14.2f}{moe_serve/dense_serve:>11.0f}x")
Code 17.9.1: The chapter's trade-off in one program. Both models spend identical per-token FLOPs by construction (dense depth set to $k$); the comparison then exposes what MoE gains and what it pays. Pure Python, no libraries, so the numbers are auditable by hand.
metric                                 DENSE           MoE   MoE/DENSE
capacity (B params)                     0.27          8.59         32x
per-token compute (GFLOP)               0.54          0.54          1x
comm / step (MiB)                       0.00        256.00    new cost
serving memory (GiB)                    0.50         16.00         32x
Output 17.9.1: At equal per-token compute, the MoE holds 32x the parameters ($E/k = 64/2$) for the same 0.54 GFLOP per token. The bargain is paid in two new currencies: 256 MiB of all-to-all traffic per step that the dense model never moves, and 32x the serving memory because every expert is resident whether it fires or not.

The numbers make the verdict concrete. The MoE carries thirty-two times the parameters, the ratio $E/k$, at identical per-token compute, which is precisely the quality lever sparse scaling exists to pull. The price sits in the two right-hand rows. The dense model moves no tokens between machines for this layer, while the MoE moves 256 MiB across the interconnect every step, two all-to-alls of $k$ routed copies (Section 17.5); and the dense model fits its layer in half a gigabyte of bf16 weights while the MoE needs sixteen, because serving memory tracks total capacity, not active capacity (Section 17.8). Whether thirty-two times the capacity is worth thirty-two times the memory and a new communication bill is the whole question, and it has no universal answer. It depends on which resource you have in surplus.

capacity (params) serving memory communication (all-to-all) FLOPs / token low 32x 32x low ~0 256 MiB same Dense MoE
Figure 17.9.1: The trade-off on four axes, drawn from Output 17.9.1. The two shapes touch on the left axis (FLOPs per token are equal by construction) and diverge sharply on the other three: the MoE stretches far out on capacity, serving memory, and communication, while the dense model stays small and balanced. Choosing MoE means accepting an elongated shape: you gain on the top axis only by paying on the right and bottom ones.

3. The Costs, Named Plainly Intermediate

Output 17.9.1 quantifies two costs; the chapter has uncovered four, and an honest verdict names all of them. The first is heavier communication. The all-to-all of Section 17.5 is the most placement-sensitive collective in the parallel toolbox: unlike an all-reduce, whose volume is fixed by the model size, the all-to-all moves tokens and so its cost rides on the routing decision and on the slowest link any token must cross. This makes the interconnect, not the accelerator, the decisive resource for MoE training, and it is why the 256 MiB per step in our small example becomes the dominant term at frontier scale.

The second cost is harder training. A dense model trains itself; an MoE must be coaxed into using its experts evenly. Without the load-balancing loss of Section 17.6 and the capacity factors of Section 17.7, routing collapses onto a few popular experts, the rest starve, and the extra parameters are wasted. These mechanisms add hyperparameters, a tuning burden, and failure modes (token dropping, instability) that a dense model simply does not have. The third cost is memory-bound serving, the paradox of Section 17.8: a request touches only $k$ experts of compute but the deployment must keep all $E$ resident, so a serving replica's memory is set by the total parameter count even though its FLOPs are set by the active count. The fourth is plain systems complexity: more moving parts, more places to misconfigure, more interaction with the rest of the parallel stack.

Practical Example: The Team That Chose Dense After Pricing the Interconnect

Who: An applied-research team at a mid-size company fine-tuning an open model for an internal assistant.

Situation: They had budget for eight GPUs on a single node with fast intra-node links, but no fast cross-node fabric, and they wanted the best quality their compute could buy.

Problem: A sparse model promised more capacity per training FLOP, and the open MoE checkpoints were tempting, so the instinct was to reach for mixture-of-experts.

Dilemma: Adopt an MoE to gain capacity per FLOP, accepting the all-to-all that their slow cross-node link would throttle and the serving memory that would not fit their inference box; or stay dense, simpler and interconnect-light, but capped at the active parameter count.

Decision: They stayed dense, because every MoE cost they would pay (all-to-all bandwidth, full-capacity serving memory, balancing tuning) landed exactly on the resources they lacked, while the one MoE benefit (capacity per FLOP) addressed a constraint that compute, not capacity, was already setting for them.

How: They scaled the dense model with the sharded-data-parallel methods of Chapter 16, keeping the whole job inside the fast intra-node fabric.

Result: Training and serving stayed within their hardware, with no all-to-all on a slow wire and no balancing instability, at quality competitive with what an MoE would have delivered on their constrained interconnect.

Lesson: MoE wins when memory and interconnect are abundant and capacity is the binding constraint. When the interconnect is slow, the inference box is small, or the model already fits, dense is simpler and often better.

Fun Note: Conservation of Difficulty

There is a folk law of systems engineering: difficulty is conserved, never destroyed. MoE looks like it abolishes the FLOP cost of extra parameters, and in a strict accounting it does. Then you check the network counter and find the difficulty waiting for you there, in bytes, having taken the scenic route through the all-to-all. The capacity was free. The plumbing was not.

4. When MoE Wins, and When Dense Is Better Intermediate

The decision reduces to matching the method to the resource that is scarce, exactly the discipline this book opened with in Section 1.1. Table 17.9.1 lays the choice out as a set of conditions. Read it as a checklist: the more rows that favor a column, the clearer the call.

Table 17.9.1: When sparse scaling earns its costs, and when a dense distributed model is the simpler and better choice. The deciding question for each row is which resource binds.
ConditionFavors MoEFavors dense
Binding constraint on qualitytraining FLOP budget (capacity is what you lack)memory or latency (capacity is not the bottleneck)
Interconnectfast, uniform fabric for the all-to-allslow or non-uniform cross-node links
Memory budgetabundant; can hold all experts residentlimited; only the active parameters fit
Scalelarge-scale training and high-throughput servingsmall batch, modest scale, single node
Serving patternlarge batches that fill expert capacitysmall batches where experts sit idle
Deployment complexity tolerancea platform team to tune balancing and routingsimplicity prized; few moving parts

The pattern across the rows is consistent. MoE rewards abundance: abundant memory to hold the experts, an abundant and fast interconnect to move the tokens, and an abundant batch to keep every expert busy. Strip any of those away and the sparse model pays its costs without collecting its benefit. Dense wins the opposite regime: limited memory, small batches, simple deployments, and any setting where the model already fits and only compute or latency binds. Neither answer is universally right, which is the entire point of weighing them.

5. Expert Parallelism in the Parallel Stack Advanced

Expert parallelism is not a rival to the parallelism axes of Chapter 16; it is one more axis that composes with them. The 3D stack of that chapter, data parallelism replicating the model, pipeline parallelism splitting the layer stack, tensor parallelism splitting a single layer, gains a fourth dimension when a layer is a mixture-of-experts: the expert axis of degree $e$, which places different experts on different devices and multiplies into the device-count identity alongside the others. A frontier configuration is then a tuple $(d, p, t, e)$ whose product must equal the device count exactly, the same hard constraint Section 16.9 formalized.

What distinguishes the expert axis is its collective. Data parallelism rides an all-reduce, sharded parallelism rides reduce-scatter and all-gather, pipeline parallelism passes point-to-point activations, and tensor parallelism rides an all-reduce on the fastest link. Expert parallelism rides the all-to-all, and as Section 17.5 showed, that collective is the most sensitive of all to placement, because its volume depends on data-dependent routing rather than on a fixed model dimension. This is why expert parallelism slots into the stack as a placement problem first: the expert axis wants to live on a fast, uniform slice of the fabric, and a configuration search like the one in Section 16.9 must account for its all-to-all explicitly. The throughline of the whole book holds here precisely as promised in Section 1.1: every parallel method is defined by the collective it relies on, and expert parallelism's collective is the all-to-all.

Thesis Thread: One More Axis, One More Collective

The spine of this book is that scale-out distributes an essential activity across machines, and each form of distribution is identified by the collective that recombines the result. Data parallelism (Chapter 15) recombines with all-reduce; sharded parallelism (Chapter 16) with reduce-scatter and all-gather; expert parallelism, this chapter, with all-to-all. Sparse scaling adds a new way to grow a model, the expert axis, but it does not escape the framework: it is one more axis in the device grid, with one more collective on the interconnect, subject to the same device-count identity and the same placement discipline as every axis before it. The frontier did not abandon the parallel stack; it extended it.

Library Shortcut: The Expert Axis as One Mesh Dimension

The hand-wiring of an expert-parallel group, scattering experts across devices and arranging their all-to-all, is exposed by modern frameworks as one more named dimension of a device mesh, exactly like the tensor and pipeline dimensions of Section 16.9. DeepSpeed-MoE and Megatron-LM let you declare an expert-parallel degree alongside the others, and the framework places the experts, builds the all-to-all process group, and overlaps the dispatch with compute for you:

# the (data, pipeline, tensor, expert) grid as one declaration
from torch.distributed.device_mesh import init_device_mesh

mesh = init_device_mesh(
    "cuda", (2, 4, 8, 8),                          # dp=2, pp=4, tp=8, ep=8
    mesh_dim_names=("dp", "pp", "tp", "ep"),       # ep is the expert axis
)
expert_group = mesh["ep"].get_group()             # the all-to-all process group
# the MoE layer's dispatch/combine all-to-all runs over expert_group;
# DeepSpeed-MoE / Megatron handle placement, the collective, and overlap.
Code 17.9.2: The expert axis declared as one mesh dimension. Roughly the dozens of lines of expert placement and all-to-all group construction from Section 17.4 and Section 17.5 collapse to naming ep in the mesh; the framework wires the rest, just as it does for the dense axes.
Research Frontier: Sparse Scaling From 2024 to 2026

Mixture-of-experts is the live frontier of large-model design, and the 2024-to-2026 literature pushes on exactly the trade-off this section weighs. Fine-grained MoE with many small experts plus a shared always-on expert, the design behind DeepSeek-V3 (DeepSeek-AI, 2024) and DeepSeekMoE, raises capacity-per-FLOP further while easing balancing; its auxiliary-loss-free balancing replaces the load-balancing loss of Section 17.6 with a bias-adjustment scheme that sidesteps the quality cost of the balancing penalty. On the systems side, expert-parallel communication libraries such as DeepEP target the all-to-all directly, overlapping dispatch and combine with compute so the interconnect tax of Output 17.9.1 shrinks toward zero. A parallel thread attacks the serving paradox of Section 17.8 with expert offloading and caching, keeping cold experts in cheaper memory so a replica's footprint tracks active rather than total capacity. The common thread is that every cost named in Section 3 is a research target, and the verdict of this chapter is being actively renegotiated as those costs fall.

6. Chapter Summary Beginner

Chapter 17 told one story in seven mechanisms and one verdict. Sparse scaling decouples a model's parameter count from the FLOPs of its forward pass (Section 17.1), and the mixture-of-experts layer realizes that decoupling by replacing one feed-forward block with many experts and routing each token to a few (Sections 17.2 and 17.3). Routing creates a placement problem: experts live on different machines, so dispatching tokens to them and combining the results is an all-to-all collective (Sections 17.4 and 17.5), the most placement-sensitive collective in the parallel toolbox. Keeping the experts useful demands load balancing and capacity control, with their attendant losses, factors, token dropping, and stability concerns (Sections 17.6 and 17.7). Serving inverts the training economy into a paradox: cheap in active compute, expensive in resident memory (Section 17.8). And the balanced verdict, this section, is that MoE buys large capacity per training FLOP at the price of communication, training difficulty, serving memory, and complexity, a bargain that pays off when memory and interconnect are abundant and capacity binds, and that does not pay off when they are scarce.

Key Takeaway: Chapter 17 in Five Ideas

Sparse scaling decouples params from FLOPs. An MoE holds $E/k$ times the parameters of its dense counterpart at the same per-token compute, which is the entire reason it exists.
Routing and gating select the experts. A gate sends each token to its top-$k$ experts; the routing decision is data-dependent and drives every downstream cost.
Expert parallelism rides the all-to-all. Sharding experts across machines turns dispatch and combine into an all-to-all, the collective that makes the interconnect, not the accelerator, decisive.
Load balance and capacity keep it honest. Balancing losses and capacity factors stop routing from collapsing onto a few experts; without them the extra parameters are wasted.
The serving paradox is memory, not compute. A request touches $k$ experts but the replica must hold all $E$, so MoE serving is memory-bound even though it is FLOP-light.

Exercise 17.9.1: Read the Trade-Off Off the Numbers Conceptual

Using only Output 17.9.1, answer three questions and justify each from a single row of the table. (a) A team has a fast interconnect and a large pool of GPU memory but a fixed, modest training FLOP budget; should they prefer dense or MoE, and which number decides it? (b) A second team serves single-request, low-latency traffic on one small inference box; which row of Section 4's table rules MoE out for them? (c) The MoE in Output 17.9.1 holds 32x the parameters. State the general formula for that ratio in terms of $E$ and $k$, and explain why raising $k$ shrinks the capacity advantage while raising the per-token FLOPs.

Exercise 17.9.2: Sweep the Design and Plot the Crossover Coding

Extend Code 17.9.1 into a sweep over the number of experts $E \in \{8, 16, 32, 64, 128\}$ at fixed $k = 2$, printing for each the capacity ratio, the communication volume per step, and the serving memory. Then add a column for an estimated all-to-all time assuming an interconnect bandwidth $B$ you choose (for example 100 GB/s), and identify the $E$ at which the all-to-all time would exceed the per-token compute time of the active experts (use a plausible accelerator throughput). This is the point where the interconnect, not the accelerator, becomes the bottleneck, the claim of Section 3.

Exercise 17.9.3: Place the Expert Axis on the Grid Analysis

You have 64 devices arranged as eight nodes of eight, with a fast intra-node link and a slow cross-node link, the topology of Section 16.9. You must place a 4D configuration $(d, p, t, e)$ with $d \cdot p \cdot t \cdot e = 64$. Argue, from the collective each axis uses, where the expert axis $e$ should live relative to the node boundary, and why putting the all-to-all on the slow cross-node link would be the worst choice. Compare your reasoning to where Section 16.9 places the tensor axis, and explain why the two axes compete for the same fast slice of the fabric.

Project Ideas

1. Build an expert-parallel MoE layer and measure all-to-all cost against load balance. Implement a top-$k$ MoE feed-forward layer with experts sharded across processes using torch.distributed, wiring the dispatch and combine all-to-alls by hand as in Section 17.5. Instrument the all-to-all time and the per-expert token counts, then sweep the routing entropy from balanced to collapsed (by perturbing the gate) and plot how communication time and the worst-expert overflow move together. The deliverable is a curve showing that the all-to-all cost is governed by load balance, not just by token volume.

2. Reproduce the dense-vs-MoE crossover with real timings. Take a small dense transformer and a parameter-matched MoE variant, train both for a fixed FLOP budget on a modest dataset, and measure wall-clock, achieved quality, and peak memory. Vary the simulated interconnect by inserting artificial latency into the all-to-all, and find the bandwidth below which the dense model wins on time-to-quality. This turns Table 17.9.1 from a checklist into a measured boundary on your own hardware.

3. Attack the serving paradox with expert offloading. Serve an open MoE checkpoint with all experts resident, then implement an offloading cache that keeps only the hottest experts in fast memory and pages the rest from host memory on demand. Measure how serving memory, tail latency, and throughput trade off as the cache size shrinks, and connect the result to the 2024-to-2026 offloading work cited in the research-frontier callout above and in Section 17.8.