Part V: Distributed Inference and Serving
Chapter 24: Distributed LLM Serving

Serving Distributed MoE Models

"They batched a thousand requests together just so two of them would finally have enough company to make me worth waking up."

An Expert Idling Below Tile Size
Big Picture

A sparse Mixture-of-Experts model is the hardest kind of model to serve well at decode time: every expert must stay resident across the fleet even though each token touches only a few, and the all-to-all that routes tokens to their experts moves so little data per request that its fixed latency, not its compute, sets the pace. Chapter 17 built MoE as a training win, decoupling model capacity from per-token compute. Serving inherits all of that machinery and collides with the opposite cost structure. The flop savings that made training cheap do nothing for the memory wall that sizes the cluster, and the inference all-to-all that was a tolerable training overhead becomes a per-token latency tax that small decode batches cannot hide. This section is the deployment-side companion to Section 17.8: it shows quantitatively why small-batch MoE decode is all-to-all-bound, and how large batches, hot-expert replication, all-to-all overlap, and expert offloading turn a memory-and-communication-bound sparse model into a high-throughput service.

The earlier chapters of this part sized a dense serving deployment from per-node economics. Chapter 22 established the per-GPU baseline (quantization, paged KV cache, the memory budget a single accelerator brings), and Chapter 23 multiplied that baseline across a fleet. A dense model serves predictably: its FLOPs per token are fixed, so more requests means more GPUs in a clean linear way. A sparse MoE model breaks that clean accounting on two axes at once. The fleet is sized not by arithmetic but by total parameters, because all experts must be loadable; and the latency per token is set not by compute but by a collective, the inference all-to-all from Chapter 4, which returns here as the dominant cost at exactly the batch sizes decode runs at. This is the same all-to-all that the thesis thread of this book has tracked from a hand-written sum in Chapter 1 to expert routing in Chapter 17; at serving time it surfaces as the thing you optimize a deployment around.

1. The Serving Paradox, Restated for Decode Beginner

Section 17.8 stated the MoE serving paradox in its general form: cheap per token, heavy in total. With $E$ experts and top-$k$ routing, a token's feed-forward block uses only $k$ of the $E$ experts, so the active parameter count per token is a fraction $k/E$ of the layer's total. But routing is data-dependent and decided at inference time, so the system must be ready to send any token to any expert. Every expert is therefore loaded into device memory somewhere on the fleet, all the time. Resident weight memory scales with the full expert count while compute scales only with the active experts, and the ratio of what you provision to what a token consumes is

$$\frac{M_{\text{resident}}}{b \, F_{\text{token}} / 2} = \frac{E}{k},$$

where $b$ is bytes per parameter and $F_{\text{token}}$ the per-token FLOPs. A model with $E = 64$ and $k = 2$ keeps thirty-two times more parameters resident than any single token pays compute for, so the dense habit of sizing a fleet from FLOPs is off by that factor. Serving is sized by the memory wall first.

What decode adds to this picture is a second, sharper problem that training never faced. Training runs at enormous batch sizes: tens of thousands of tokens flow through each MoE layer per step, so every expert receives a healthy crowd of tokens, fills its GEMM tiles, and amortizes the all-to-all over a large message. Autoregressive decode is the opposite. Each in-flight request contributes exactly one token per step, so the per-layer batch equals the number of concurrent requests, often a few dozen. Spread $k$ times that handful of tokens across $E$ experts and most experts receive a fraction of a token's worth of work. The expert kernels run far below their efficient tile size, and the all-to-all ships tiny messages dominated by fixed per-hop latency. The sparsity that was pure profit at training scale becomes a structural inefficiency at decode scale. We make the cost of distribution explicit, as Chapter 3 taught, so the trade-off can be reasoned about with numbers rather than instinct.

Key Insight: Small-Batch MoE Decode Is All-to-All-Bound, the Mirror Image of Training

Training feeds each MoE layer tens of thousands of tokens, so every expert fills its GEMM tiles and the all-to-all moves a large, bandwidth-efficient message. Decode feeds each layer one token per concurrent request, often a few dozen total; after top-$k$ routing across $E$ experts, most experts see a fraction of a tile and the all-to-all moves a tiny, latency-bound message. The fixed per-layer collective latency, repeated across every MoE layer, then dominates time-per-output-token (TPOT). MoE decode efficiency is therefore a batching problem: the cure is to give each expert enough tokens that its kernel and its all-to-all message are both worth their fixed cost.

2. Experts Resident, Few Tokens in Flight Beginner

Figure 24.8.1 shows the decode-time arrangement and why it is awkward. The experts of each MoE layer are sharded across the serving GPUs in an expert-parallel layout, every GPU permanently holding its slice of experts, exactly the inference-time face of the expert parallelism from Chapter 17. When a small decode batch arrives, the router on each GPU picks $k$ experts per token, the first all-to-all ships each token to the GPU owning its chosen expert, the experts compute, and a second all-to-all returns the results. The experts are stationary; the tokens travel. At decode batch sizes the picture is stark: a handful of tokens scatters across a wide fleet of experts, so most experts compute on one or two tokens or sit idle this step, while the all-to-all links carry near-empty messages whose cost is almost entirely the latency of setting up the transfer.

Decode step: 4 concurrent requests, 1 token each, top-2 routing across 4 expert-parallel GPUs Home GPUs (tokens) tok A tok B tok C tok D all-to-all tiny messages, latency-bound Serving GPUs (experts resident) GPU 1 · E0,E1 ← 2 tokens GPU 2 · E2,E3 ← 3 tokens GPU 3 · E4,E5 ← 0 tokens (idle) GPU 4 · E6,E7 ← 3 tokens 8 token-expert assignments spread over 8 experts: most experts run a sub-tile GEMM, one GPU sits idle all experts stay resident regardless of how few fire
Figure 24.8.1: Why MoE decode is awkward. Four concurrent requests emit four tokens; top-2 routing produces eight token-expert assignments scattered across eight resident experts on four GPUs. Most experts receive one or two tokens (a sub-tile GEMM) and one GPU receives none, yet every expert stays loaded because the next token could route anywhere. The all-to-all that moves these few tokens carries near-empty messages whose cost is the fixed setup latency, not bandwidth, so it dominates the step time. Contrast this with training, where each expert would receive thousands of tokens.

3. Modeling Decode Efficiency as Batch Size Grows Intermediate

The cleanest way to see the all-to-all bound and its cure is to put numbers to it. We model one MoE layer at decode. With batch size $B$ (the number of concurrent decode requests, one token each) and top-$k$ routing across $E$ experts, the mean tokens landing on an expert is

$$t_{\text{expert}} = \frac{B \, k}{E}.$$

The per-layer all-to-all costs a fixed latency $\alpha$ plus a bandwidth term proportional to the bytes it moves, $t_{\text{a2a}} = \alpha + t_{\text{expert}} \cdot H \cdot b / \beta$, where $H$ is the hidden size and $\beta$ the link bandwidth. The expert GEMM pays a fixed weight-load floor plus per-token arithmetic that only reaches peak once a tile of $\tau$ tokens is full. Time-per-output-token is the per-layer maximum of these two terms (the all-to-all overlaps the next micro-step's GEMM) summed over the $L$ MoE layers. The code below evaluates this model across decode batch sizes and reports, for each, the mean tokens per expert, the GEMM tile efficiency, the share of per-layer time spent in the all-to-all, and the resulting TPOT.

import math

E, k, L = 64, 2, 58          # experts, top-k, MoE layers (DeepSeek-V3-scale)
TILE, HIDDEN, BYTES = 128, 7168, 2          # GEMM tile, hidden size, bf16
A2A_ALPHA_US, A2A_GBPS = 8.0, 400.0         # fixed all-to-all latency, link GB/s
GEMM_PEAK_US, GEMM_FIXED_US = 8.0, 1.0      # full-tile GEMM time, weight-load floor

def model(B):
    tpe = B * k / E                                   # mean tokens per expert
    a2a_us = A2A_ALPHA_US + (tpe * HIDDEN * BYTES) / (A2A_GBPS * 1e3)   # latency + bytes/bw
    per_tok_us = GEMM_PEAK_US / TILE                  # peak per-token GEMM cost
    gemm_us = GEMM_FIXED_US + per_tok_us * tpe        # weight load + arithmetic
    gemm_eff = (per_tok_us * tpe) / gemm_us if gemm_us else 0.0   # fraction doing real work
    layer_us = max(a2a_us, gemm_us)                   # a2a overlaps next GEMM
    a2a_share = a2a_us / (a2a_us + gemm_us)
    return tpe, gemm_eff, a2a_share, layer_us * L / 1000.0        # TPOT in ms

print(f"{'batch':>6} {'tok/expert':>11} {'gemm_eff':>9} {'a2a_share':>10} {'TPOT_ms':>9}  regime")
for B in [1, 4, 16, 64, 256, 1024, 4096]:
    tpe, eff, a2a, tpot = model(B)
    regime = "all-to-all-bound" if a2a >= 0.5 else "compute-bound"
    print(f"{B:>6} {tpe:>11.2f} {eff:>9.2f} {a2a:>10.2f} {tpot:>9.2f}  {regime}")

print()
print(f"batch for >=90% expert-GEMM tile efficiency : {TILE * 0.9 * E / k:.0f} concurrent decodes")
print(f"E/k memory-vs-compute provisioning ratio     : {E // k}x")
Code 24.8.1: A pure-Python cost model of one MoE layer at decode, sweeping the concurrent-request batch size. It separates the latency-bound all-to-all from the tile-starved expert GEMM and reports which one sets the pace, with no dependency beyond the standard library.
 batch  tok/expert  gemm_eff  a2a_share   TPOT_ms  regime
     1        0.03      0.00       0.89      0.46  all-to-all-bound
     4        0.12      0.01       0.89      0.46  all-to-all-bound
    16        0.50      0.03       0.89      0.47  all-to-all-bound
    64        2.00      0.11       0.88      0.47  all-to-all-bound
   256        8.00      0.33       0.85      0.48  all-to-all-bound
  1024       32.00      0.67       0.75      0.53  all-to-all-bound
  4096      128.00      0.89       0.58      0.73  all-to-all-bound

batch for >=90% expert-GEMM tile efficiency : 3686 concurrent decodes
E/k memory-vs-compute provisioning ratio     : 32x
Output 24.8.1: At a realistic decode batch of 1 to 64 requests, each expert sees well under a single token (0.03 to 2.0), its GEMM runs at 0 to 11 percent tile efficiency, and almost 90 percent of each layer's time is the fixed all-to-all latency. Only as the batch climbs into the thousands does tile efficiency reach 0.89 and the all-to-all share fall toward 0.58; reaching 90 percent expert-GEMM efficiency needs roughly 3,700 concurrent decodes.

The numbers tell the whole story. At the batch sizes that interactive decode actually runs at, the experts are starved and the all-to-all latency, repeated across all $L = 58$ MoE layers, owns the step. Efficiency does not arrive until the batch is enormous, far larger than any single decode request stream produces, and even at $B = 4096$ the all-to-all still claims well over half of each layer. Two levers follow directly from this model: make the effective batch per expert larger (so tiles fill and the latency amortizes), or make the all-to-all cheaper and more hidden (so its fixed cost stops dominating). The rest of the section is those two levers in practice. Notice also the final two lines: the $32\times$ memory provisioning ratio from Section 1 and the batch needed for tile efficiency are the two walls, memory and batching, that define an MoE serving deployment.

Fun Note: The Expert That Got Routed To Once a Minute

In a lightly loaded MoE deployment with a few concurrent users, an individual expert can go entire decode steps without a single token, then receive one token, run a full kernel launch and weight read to multiply a single vector, and go quiet again. It is the GPU equivalent of opening a warehouse, sending one forklift to fetch one box, and closing up. The fix is never "make the forklift faster"; it is "wait until there are enough boxes to fill a truck", which is precisely what batching does.

4. Lever One: Batching and Hot-Expert Replication Intermediate

The first lever is to grow the tokens each expert sees. Continuous batching, the cross-node request scheduling that Chapter 23 built for dense serving, helps here too: by packing many in-flight decode requests into one step, it raises $B$ and therefore $t_{\text{expert}}$ for free. But there is a ceiling. Output 24.8.1 shows you need thousands of concurrent decodes to fill tiles, which a single deployment may never reach, and pushing batch size up trades latency for throughput, which a strict TPOT budget forbids. Batching alone closes part of the gap, not all of it.

Hot-expert replication attacks the imbalance directly. Real routing is not uniform: a handful of experts attract a disproportionate share of tokens (the load-imbalance problem of Chapter 17), and at small batch a single overloaded expert stretches the whole layer while its peers idle. Replicating the busiest experts onto additional GPUs lets the router spread a hot expert's tokens across several replicas, so no one expert becomes the layer's straggler. The memory cost is real, you pay for the duplicated experts, but it is far smaller than replicating the whole model, and it directly cuts the tail latency that load imbalance creates at decode. Placement matters too: co-locating experts that are frequently co-activated on the same GPU or NVLink island shortens the all-to-all paths, the topology-aware placement idea from Chapter 4 applied to experts instead of gradients.

Thesis Thread: The All-to-All Returns, Now as a Latency Tax

The all-to-all you first met as a hand-written sum in Chapter 1 and as expert routing in Chapter 17 arrives at serving time wearing a new face. In training it was a bandwidth cost over a huge batch, an overhead you amortized. In decode it is a latency cost over a tiny batch, the dominant term in TPOT that Output 24.8.1 puts at almost 90 percent of every layer. The same collective, the same data movement, but the binding constraint flips from bandwidth to latency because the batch shrank by four orders of magnitude. Whenever a primitive from this book reappears in a new setting, ask which of its costs now binds; for the inference all-to-all the answer is latency, and every MoE serving optimization is in some way a response to that fact.

5. Lever Two: Overlapping and Shrinking the All-to-All Advanced

The second lever accepts that the all-to-all will move data on every layer and works to make that movement nearly free. The first technique is overlap: while the experts on a GPU compute the current micro-batch, the network simultaneously ships the next micro-batch's tokens, so the all-to-all latency hides behind the attention and expert compute rather than adding to it. This is the same computation-communication overlap that Chapter 4 introduced for collectives, now scheduled tightly enough that the all-to-all nearly disappears behind the work it serves. Dedicated MoE communication libraries push this furthest, fusing the dispatch and combine all-to-alls with the expert GEMMs and using high-throughput GPU-initiated transfers so the network is busy whenever compute is.

The second technique is to shrink the bytes the all-to-all moves: dispatch tokens in a low-precision format (the hidden vectors crossing the network in FP8 rather than bf16 halve the message), and fuse the two all-to-alls so routing metadata is exchanged once. When even residency is impossible, expert offloading and caching trade latency for memory: cold experts live in CPU memory or are streamed from a fast tier, and only the experts a batch actually routes to are pulled onto the GPU, with a cache keeping recently hot experts resident. This makes a model whose full expert set exceeds GPU memory servable on a smaller fleet, at the cost of a fetch latency that prefetching and good placement try to hide. The frontier deployments combine all of these, which is why a model at the scale of DeepSeek-V3 is served on a large expert-parallel cluster rather than a handful of GPUs.

Research Frontier: Expert-Parallel MoE Serving at Frontier Scale (2024 to 2026)

The 2024-to-2026 wave of open frontier MoE models made large expert-parallel serving a first-class engineering target. DeepSeek-V3 and DeepSeek-R1 (671B total parameters, roughly 37B active per token, 256 routed experts) are served with wide expert parallelism that spreads experts across dozens of GPUs, precisely the $E/k$ memory provisioning the model in Output 24.8.1 quantifies. DeepEP (Zhao et al., 2025), the open communication library released alongside DeepSeek-V3, provides high-throughput and low-latency all-to-all kernels specialized for MoE dispatch and combine, including FP8 dispatch and a latency-optimized path tuned for the small-message decode regime this section is about. Prefill/decode disaggregation, paired with these kernels, lets the latency-sensitive decode phase run its own expert-parallel layout. On the engine side, vLLM and SGLang shipped expert-parallel MoE execution with hot-expert replication and expert-parallel load balancing, and SGLang has reported reproducing large-scale DeepSeek-V3 serving on multi-node GPU clusters. The common thread is treating the decode all-to-all as the quantity to engineer down, by overlap, low precision, and replication, rather than a fixed tax to pay.

Library Shortcut: vLLM and SGLang Serve Expert-Parallel MoE in a Flag

Everything in Sections 4 and 5, sharding experts across the fleet, the dispatch and combine all-to-alls, hot-expert replication, and continuous batching over the routed tokens, is what a modern inference engine does for you. Standing up an expert-parallel MoE deployment by hand means writing the router, the two all-to-all collectives, the expert GEMM scheduling, and the load balancer; vLLM and SGLang collapse that to a launch configuration:

# vLLM: serve a large MoE with expert parallelism across 8 GPUs
vllm serve deepseek-ai/DeepSeek-V3 \
  --tensor-parallel-size 1 \
  --enable-expert-parallel \
  --data-parallel-size 8           # experts sharded across the 8-way EP group

# SGLang: the same model with expert parallelism and EP load balancing
python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3 \
  --tp 8 --enable-ep-moe           # expert-parallel MoE execution + balanced dispatch
Code 24.8.2: Hundreds of lines of router, dual all-to-all, expert-GEMM scheduling, and load balancing reduce to one launch flag. The engine handles process-group setup, the dispatch and combine collectives (calling DeepEP-style kernels where available), continuous batching across the routed tokens, and the hot-expert replication that Section 4 motivated.
Practical Example: The MoE Deployment That Was Fast but Idle

Who: An inference platform engineer rolling out a 256-expert MoE chat model on an eight-GPU node.

Situation: Single-request latency looked great in testing, but under real traffic the GPUs sat at low utilization while TPOT crept up as soon as a few users connected.

Problem: Each decode step carried only a dozen or so concurrent tokens, so after top-$k$ routing almost every expert ran a sub-tile GEMM and the per-layer all-to-all latency dominated, exactly the small-batch regime of Output 24.8.1.

Dilemma: Raise the batch by holding requests longer, which fills tiles and amortizes the all-to-all but inflates latency, or replicate experts and overlap the collective, which costs memory and engineering but keeps latency low.

Decision: They did both in proportion: continuous batching to lift the per-step token count, hot-expert replication for the few experts the traffic hammered, and FP8 dispatch with overlap to shrink and hide the all-to-all.

How: They switched the engine to expert-parallel mode with EP load balancing (Code 24.8.2), enabled the low-latency all-to-all path, and replicated the top eight experts by measured routing frequency.

Result: Per-step tokens per expert rose enough to lift GEMM tile efficiency several-fold, the all-to-all share of each layer fell once it overlapped compute, and aggregate throughput roughly tripled at the same TPOT budget.

Lesson: An MoE model that is fast for one request can be badly underused in production; the win is batching tokens onto experts and hiding the all-to-all, not chasing single-request speed.

6. Why Frontier MoE Needs a Large Serving Fleet Intermediate

Putting the two walls together explains the deployment shape of frontier MoE. The memory wall sets a floor on fleet size: a model with hundreds of billions of total parameters across hundreds of experts cannot fit on a few GPUs, so the experts must be sharded across many, which is wide expert parallelism by necessity, not choice. The batching wall sets a floor on traffic: to keep those many experts fed at decode, the fleet needs enough aggregate request volume that, after the all-to-all spreads tokens across the expert shards, each expert still sees a worthwhile batch. A large expert-parallel deployment satisfies both at once, many GPUs to hold the experts and enough multiplexed traffic to keep tiles full, which is why a DeepSeek-V3-scale model is served on a sizable cluster with specialized all-to-all kernels rather than squeezed onto a single node. The communication that Chapter 4 taught and Chapter 17 wired into training is, at this scale, the thing the whole serving design is organized around.

The next section steps back from MoE specifically to the inference engines that implement all of this, vLLM, TensorRT-LLM, and SGLang, and how the techniques of this chapter (tensor and pipeline parallelism, paged and disaggregated KV cache, continuous batching, and expert-parallel MoE) compose inside a production serving stack. That tour, the practical close of the chapter, begins in Section 24.9.

Exercise 24.8.1: Reading the Two Walls Conceptual

Using only the $E/k$ ratio from Section 1 and Output 24.8.1, answer for a model with $E = 128$ experts and $k = 4$: (a) how many times more parameter memory must the fleet provision than a single token's compute would suggest; (b) why raising $k$ from 2 to 4 changes both the memory ratio and the tokens-per-expert at a fixed batch, and in which direction each moves; (c) why a deployment serving a handful of users will be all-to-all-bound regardless of how fast its GPUs are. State which of the two walls (memory, batching) each part exposes.

Exercise 24.8.2: Hot-Expert Replication in the Model Coding

Extend Code 24.8.1 to model load imbalance. Instead of assuming tokens spread uniformly, give the busiest expert a share $f$ (for example $f = 0.20$) of all routed tokens while the rest split evenly, and define layer time by the slowest expert's GEMM rather than the mean. First show how the hot expert inflates TPOT at small batch. Then add a replication factor $R$ for that expert (its tokens split across $R$ replicas) and find the smallest $R$ that brings the layer time within 10 percent of the balanced case. Discuss the memory cost of your chosen $R$ relative to replicating the whole layer.

Exercise 24.8.3: When Does Overlap Beat Batching? Analysis

In Code 24.8.1 the all-to-all and the GEMM are combined with a $\max$, which assumes perfect overlap of the all-to-all with the next micro-step. Re-derive TPOT under the opposite assumption, no overlap, where layer time is the sum $t_{\text{a2a}} + t_{\text{gemm}}$, and compare the two TPOT curves across the batch sweep. At what batch size does the no-overlap penalty matter most, and why is it largest exactly in the small-batch decode regime? Use this to argue why frontier MoE serving invests in dedicated overlap-capable all-to-all kernels rather than relying on larger batches alone, and connect your answer to the $\alpha$ (fixed latency) term in the cost model of Chapter 3.