Part IV: Parallel Deep Learning and Large Models
Chapter 17: Expert Parallelism and Sparse Distributed Models

Expert Parallelism and Sparse Distributed Models

Chapter 16 spent its energy fitting a dense model across devices: every parameter fired on every token, and the whole problem was carving that uniform computation into pieces small enough to hold. This chapter changes the model so that most of it stays asleep. A mixture-of-experts layer holds many expert sub-networks but routes each token to only a few of them, so the parameter count can grow into the trillions while the compute spent on any single token barely moves. That sparsity buys enormous capacity, and it buys a new distributed problem: the experts live on different machines, and every token must be shipped to the machine holding the expert it was routed to, computed there, and shipped back. This chapter is about that routing, the all-to-all that carries it, and the load-balancing, capacity, and stability machinery that keeps a sparse model trainable across a cluster.

Conceptual illustration for Chapter 17: Expert Parallelism and Sparse Distributed Models

"They built a hundred and twenty-seven of us and promised every token a fair shake. The gate has sent me nothing for three thousand steps. I have so much capacity, and no one to spend it on."

An Expert Nobody Routed To
Big Picture

A mixture-of-experts model replaces a dense layer with many expert sub-networks and a gating function that routes each token to only a small subset of them, so the total parameter count grows with the number of experts while the compute per token stays fixed at the size of the few experts it visits. The pressure that drives this chapter is the opposite of the one that drove the last. Chapter 16 fought a memory wall by cutting a fixed dense computation into shards; here the computation is no longer fixed, and the design lever is sparsity. When you train with top-$k$ routing out of $E$ experts, the model has roughly $E/k$ times the parameters of the dense layer it replaces but spends the same FLOPs per token, so capacity and cost decouple. That decoupling is the whole appeal, and it forces a new parallelism axis. Experts are too numerous to replicate, so expert parallelism places different experts on different devices, and because the gate assigns tokens to experts without regard to which device a token currently sits on, every step must physically move tokens to the device that holds their expert and move the results back. That movement is an all-to-all, the collective from Chapter 4 that this chapter promotes to a first-class training operation and the fourth dimension of the 4D parallelism Chapter 16 introduced. Routing that is free to send tokens anywhere will send too many to the popular experts and starve the rest, so the chapter develops the load-balancing losses, capacity factors, and token-dropping rules that keep the assignment roughly even and the all-to-all buffers a fixed size. It grounds the architecture in the systems that ship it, GShard, Switch Transformer, DeepSpeed-MoE, Tutel, and the open Mixtral and DeepSeek-V3 models, and it closes by weighing a sparse distributed model against the dense one of the previous chapter: more parameters per dollar of compute, but a harder communication pattern, a trickier training stability story, and a serving problem where most of the weights sit idle for most tokens yet must still be resident somewhere on the fleet.

Chapter Overview

This is the third chapter of Part IV, and it picks up exactly where the dense partitioning of Chapter 16 left off. There the model was dense and the engineering was about cutting a uniform computation across devices; here the model is sparse, and the engineering is about deciding, per token, which small piece of the model to run and then moving the token to wherever that piece lives. The binding constraint shifts again: not device memory and not gradient bandwidth, but the irregular, data-dependent communication of routing. The collective to watch is the all-to-all. Chapter 4 introduced it as the most general of the collectives, every rank sending a distinct message to every other rank, and here it becomes the beating heart of every MoE forward and backward pass.

The nine sections fall into four groups. The first motivates the architecture and builds it: Section 17.1 contrasts dense and sparse scaling and shows why decoupling parameters from compute is worth the trouble, Section 17.2 constructs the mixture-of-experts layer, and Section 17.3 develops the routing and gating that decide where each token goes. The second group makes it distributed: Section 17.4 introduces expert parallelism that shards experts across nodes, and Section 17.5 develops the all-to-all that moves tokens to their assigned expert and the results back. The third group keeps it trainable: Section 17.6 builds the load-balancing machinery that prevents a few experts from hogging every token, and Section 17.7 develops the capacity factors, token-dropping rules, and stability tricks that bound the communication buffers and tame the optimization. The fourth group steps back to systems and judgment: Section 17.8 takes up serving a distributed MoE model where most weights are idle per token, and Section 17.9 weighs the whole sparse approach against the dense distributed training of Chapter 16.

Read in order, the nine sections take you from "a dense layer wastes capacity by running every parameter on every token" to "a cluster trains and serves a model whose parameter count dwarfs its per-token compute, with tokens routed across machines by a gate and carried there by an all-to-all that the load-balancing machinery keeps from collapsing." The argument is cumulative: sparsity creates the opportunity, routing realizes it, expert parallelism distributes it, the all-to-all pays for it, and load balancing and capacity control are what keep the bargain from falling apart at scale.

Prerequisites

This chapter builds directly on two earlier ones. From Chapter 16: Model, Pipeline, and Sharded Parallelism you carry the four parallelism axes and especially the idea of mapping each axis onto the interconnect tier its collectives demand, because expert parallelism is the fourth axis that the 3D and 4D strategies of that chapter reserved a slot for, and a real frontier MoE model composes expert parallelism with the tensor, pipeline, and sharded data parallelism you learned there. From Chapter 4: Communication Primitives for Distributed Training you carry the all-to-all collective above all, since the entire routing mechanism of an MoE layer is an all-to-all that scatters tokens to their assigned experts and a second all-to-all that gathers the results, and you cannot reason about MoE communication cost without first understanding what an all-to-all moves and how it scales. The chapter assumes comfortable Python and PyTorch, a working picture of a transformer block and where its feed-forward layer sits, and the basic distributed-training vocabulary of ranks, process groups, and collectives from Part I. No prior experience with sparse models or gating networks is required; Section 17.1 motivates the architecture from the dense baseline before any expert is introduced.

Learning Objectives

Remember the Chapter as One Sentence

If you keep one thing from this chapter, keep this: a mixture-of-experts model grows its parameter count by adding experts but holds its per-token compute fixed by routing each token to only a few of them, and distributing it means sharding those experts across machines and using an all-to-all to ship every token to the device that holds its assigned expert, with load balancing and capacity control keeping that routing from collapsing onto a handful of overworked experts. Read forward, the sections build the craft in layers: first why sparse scaling beats dense scaling at fixed compute, then the MoE layer and its gating, then expert parallelism and the all-to-all that powers it, then the load-balancing losses and capacity factors that keep it stable, and finally serving and the dense-versus-sparse trade-off. Read as a question, the chapter asks of any layer whose capacity you want to grow without paying for it on every token: how do you route each token to a small specialized piece of a much larger model, how do you spread those pieces across a cluster, and how do you move tokens to their pieces without the communication or the imbalance erasing the win. The roadmap below walks the nine sections that answer it.

Chapter Roadmap

Read the nine sections in order and you will hold a working blueprint for training and serving a model whose effective size is decoupled from the compute it spends on any single token: Section 17.1 establishes why sparse scaling pays, Sections 17.2 and 17.3 build the MoE layer and its gate, Sections 17.4 and 17.5 distribute the experts and route tokens to them with an all-to-all, and Sections 17.6 through 17.9 balance the load, bound the capacity, serve the result, and weigh it all against the dense model. The thread to watch is the all-to-all of Chapter 4 stepping into the center of the training loop, and the data parallelism of Chapter 16 giving way to a routing problem where the work a device does depends on where the gate decided to send each token.

What's Next?

This chapter and the two before it assumed a cluster that stays the same size for the whole run: a fixed set of devices, each holding its assigned shard or expert, all present from the first step to the last. Real training at this scale does not get that luxury. Nodes fail, preemptible instances vanish with a moment's warning, and a job that holds thousands of accelerators for weeks will lose some of them. Chapter 18: Elastic and Fault-Tolerant Distributed Training takes up training that survives a changing cluster: checkpointing that lets a job resume without losing days of work, elastic schedulers that grow and shrink the worker set, and the recovery protocols that rebuild the parallelism layout you spent these three chapters constructing after a device drops out. Read it next, and watch the static device map of expert and sharded parallelism become something that has to heal itself while the loss keeps going down.

Bibliography & Further Reading

Foundations of Sparse Mixture-of-Experts

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., Dean, J. "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." arXiv:1701.06538, 2017. arxiv.org/abs/1701.06538

The paper that introduced the sparsely-gated MoE layer, top-$k$ routing, and the load-balancing loss, the architectural foundation for Sections 17.2 and 17.3.

📄 Paper

Clark, A., de las Casas, D., Guy, A., Mensch, A., Paganini, M., Hoffmann, J., et al. "Unified Scaling Laws for Routed Language Models." arXiv:2202.01169, 2022. arxiv.org/abs/2202.01169

The study that derives scaling laws for routed (MoE) models and quantifies how parameter count and per-token compute decouple, the analytical backbone of Section 17.1.

📄 Paper

Routing, Load Balancing, and Stability

Fedus, W., Zoph, B., Shazeer, N. "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." arXiv:2101.03961, 2021. arxiv.org/abs/2101.03961

The work that simplified routing to top-1, introduced the capacity factor and token dropping, and stabilized large MoE training, the direct basis for Sections 17.3 and 17.7.

📄 Paper

Zhou, Y., Lei, T., Liu, H., Du, N., Huang, Y., Zhao, V., Dai, A., Chen, Z., Le, Q., Laudon, J. "Mixture-of-Experts with Expert Choice Routing." arXiv:2202.09368, 2022. arxiv.org/abs/2202.09368

The paper that inverts the routing direction so experts choose tokens, giving perfect load balance by construction, the expert-choice alternative compared in Sections 17.3 and 17.6.

📄 Paper

Distributed MoE Systems

Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., Chen, Z. "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding." arXiv:2006.16668, 2020. arxiv.org/abs/2006.16668

The system that scaled MoE to 600 billion parameters with automatic sharding and the all-to-all dispatch-and-combine pattern, the basis for the expert parallelism of Sections 17.4 and 17.5.

📄 Paper

Rajbhandari, S., Li, C., Yao, Z., Zhang, M., Aminabadi, R. Y., Awan, A. A., Rasley, J., He, Y. "DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale." arXiv:2201.05596, 2022. arxiv.org/abs/2201.05596

The framework that pairs expert parallelism with optimized MoE inference and distillation, a primary reference for the distributed training and serving of Sections 17.4 and 17.8.

📄 Paper

Hwang, C., Cui, W., Xiong, Y., Yang, Z., Liu, Z., Hu, H., et al. "Tutel: Adaptive Mixture-of-Experts at Scale." arXiv:2206.03382, 2022. arxiv.org/abs/2206.03382

The system that makes the all-to-all and expert placement adaptive at runtime to handle dynamic load, deepening the communication discussion of Section 17.5.

📄 Paper

DeepSeek-AI. "DeepEP: An Efficient Expert-Parallel Communication Library." GitHub, 2025. github.com/deepseek-ai/DeepEP

An open communication library specialized for the dispatch-and-combine all-to-all of expert parallelism, a current production reference for the routing kernels of Section 17.5.

🛠️ Tool

Open Frontier MoE Models

Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., et al. "Mixtral of Experts." arXiv:2401.04088, 2024. arxiv.org/abs/2401.04088

The open-weight sparse MoE language model with eight experts and top-2 routing, a concrete reference architecture for the layer, routing, and serving of Sections 17.2 and 17.8.

📄 Paper

DeepSeek-AI. "DeepSeek-V3 Technical Report." arXiv:2412.19437, 2024. arxiv.org/abs/2412.19437

The report behind a frontier fine-grained MoE model with shared experts and auxiliary-loss-free load balancing, the current state of the art for Sections 17.6 and 17.9.

📄 Paper