Part V: Distributed Inference and Serving
Chapter 24: Distributed LLM Serving

Why Large-Model Serving Spans Many Machines

"I was asked to hold the model and remember the conversation. I could do one of those. They sent for more of me."

A GPU That Ran Out of Memory Mid-Sentence
Big Picture

A large language model cannot be served from one machine, and the reason is not throughput but a single request: the model's weights plus the memory it must keep to continue a conversation already exceed what one accelerator can hold. The remedy is to split the model itself across several GPUs so that one forward pass runs on many devices at once, and then, at fleet scale, to notice that answering a prompt has two phases with opposite resource profiles and to put those phases on different machines. This chapter is the large-language-model deepening of distributed inference: where Chapter 23 served many independent copies of a model that fits, here one copy does not fit, so the serving plan starts from memory arithmetic and ends with two latency targets that pull in different directions. This first section establishes why a single machine is impossible, and names the two service-level objectives that organize everything that follows.

The single-node mechanisms that make a transformer cheap to run, the key-value cache that turns quadratic decoding into linear decoding, paged attention that stops that cache from fragmenting memory, continuous batching that keeps the accelerator busy, and speculative decoding that spends a small model to save the big one, were the subject of Chapter 22. We treated them there as the per-node prerequisite, the scale-up baseline that distribution multiplies. This chapter assumes them and asks the next question. Once you have wrung out one GPU, what do you do when the model and its working memory still will not fit, or when one GPU cannot meet the latency a user expects? The answer is to span machines, and the shape of that span follows directly from where the memory goes.

1. One Request Already Overflows One GPU Beginner

Start with a single user sending a single prompt. Two pools of memory must coexist on the accelerator. The first is the model weights, fixed in size: a model with $P$ parameters stored at $b$ bytes each occupies $P b$ bytes regardless of how many requests are in flight. The second is the key-value cache, the per-request memory that holds the attention keys and values for every token already seen, so that each new token attends to the past without recomputing it. The cache grows with the conversation. For a model with $L$ layers and hidden width $H$, every token stores keys and values across all layers, costing

$$\text{KV bytes per token} = 2 \cdot L \cdot H \cdot b_{\text{kv}},$$

where the leading $2$ counts keys and values separately and $b_{\text{kv}}$ is the bytes per stored number. The total memory a single GPU of capacity $M$ must hold for a batch of $R$ concurrent requests each of context length $C$ is therefore

$$\text{Mem} = \underbrace{P\,b}_{\text{weights}} \; + \; \underbrace{2\,L\,H\,b_{\text{kv}} \cdot C \cdot R}_{\text{KV cache}} \;\le\; M.$$

The weights term is a one-time cost; the cache term scales with both how long the conversations are and how many of them run at once. The crucial observation, the one that forces this entire chapter, is that for a large model the weights alone can exceed $M$, and even when they do not, the cache term overtakes them as context lengths grow into the tens or hundreds of thousands of tokens. Neither term is optional: you cannot drop the weights, and you cannot answer a long prompt without remembering it.

Key Insight: LLM Serving Is Memory-Bound Before It Is Compute-Bound

For classical model serving, you add machines when request volume outgrows one server's throughput, and a single replica always fits. For large-model serving the binding ceiling appears one request in: the weights plus the key-value cache exceed one GPU's memory. The model must be split across devices not to go faster but to exist at all. Every distributed-serving decision in this chapter starts from the memory inequality above, not from a requests-per-second target.

The fix for the weights term is to shard the model across GPUs so that each device holds only a slice of every layer and the devices cooperate on each forward pass. Splitting a single matrix multiply across GPUs and summing their partial results is tensor parallelism, the subject of Section 24.2, and the summation it relies on is the same all-reduce collective introduced for training in Chapter 4. The fix for the cache term, when the cache outgrows the GPUs that hold the weights, is to distribute the key-value cache itself, which Section 24.4 develops. Before either fix, it pays to see the overflow in numbers.

2. The Fit Calculator: Counting GPUs From Memory Intermediate

The inequality above is small enough to evaluate directly, and doing so turns "the model is too big" from a slogan into a count of machines. Given the weight bytes and the per-token cache cost, the minimum number of GPUs is the total memory divided by one GPU's capacity, rounded up,

$$N_{\text{GPU}} = \left\lceil \frac{P\,b + 2\,L\,H\,b_{\text{kv}}\,C\,R}{M} \right\rceil.$$

This is a lower bound: it assumes weights and cache can be packed perfectly across devices with no replication or overhead, which real tensor-parallel layouts only approximate. Even as a lower bound it is decisive. The code below evaluates it for three model sizes at two operating points, a short prompt with modest concurrency and a long-context workload with high concurrency, using bf16 weights and an 80 GB accelerator.

import math

# (name, params_billions, hidden H, layers L)
models = [("8B", 8, 4096, 32), ("70B", 70, 8192, 80), ("405B", 405, 16384, 126)]
B_WEIGHT, B_KV, GPU_GB = 2, 2, 80          # bf16 weights, fp16 KV, one 80 GB GPU

def kv_bytes_per_token(H, L):              # keys + values, all layers
    return 2 * L * H * B_KV

# (context length C, concurrent requests R)
points = [(8_000, 32), (128_000, 256)]

print(f"{'Model':>6} {'ctx':>8} {'reqs':>5} {'weights GB':>11} {'KV GB':>9} {'total GB':>9} {'GPUs':>6}")
for name, pb, H, L in models:
    w = pb * 1e9 * B_WEIGHT
    for C, R in points:
        kv = kv_bytes_per_token(H, L) * C * R
        total = w + kv
        gpus = math.ceil(total / (GPU_GB * 1e9))     # the memory lower bound on devices
        print(f"{name:>6} {C:>8} {R:>5} {w/1e9:>11.1f} {kv/1e9:>9.1f} {total/1e9:>9.1f} {gpus:>6}")
Code 24.1.1: A pure-Python fit calculator. It computes weight memory, key-value cache memory, and the minimum tensor-parallel GPU count from the memory inequality of Section 1, with no model or framework required.
 Model      ctx  reqs  weights GB     KV GB  total GB   GPUs
    8B     8000    32        16.0     134.2     150.2      2
    8B   128000   256        16.0   17179.9   17195.9    215
   70B     8000    32       140.0     671.1     811.1     11
   70B   128000   256       140.0   85899.3   86039.3   1076
  405B     8000    32       810.0    2113.9    2923.9     37
  405B   128000   256       810.0  270582.9  271392.9   3393
Output 24.1.1: Even the smallest model needs two GPUs at modest concurrency, and the cache, not the weights, dominates as context grows. The 405B model at long context demands thousands of GPUs, almost all of it key-value cache.

Read the first column of numbers and the thesis of the chapter falls out. The 8B model has only 16 GB of weights, comfortably inside one 80 GB GPU, yet at a 32-request batch of 8,000-token contexts it already needs two GPUs because the cache adds 134 GB. The 70B model's weights alone, at 140 GB, overflow a single device before any user connects. And the cache term is explosive: at 128,000-token contexts with 256 concurrent requests, the 70B model's cache reaches roughly 86 terabytes, swamping its weights by a factor of six hundred and demanding over a thousand GPUs. The lesson the table teaches, and the reason large-model serving is a distributed-systems problem rather than a single-box optimization, is that the memory you must reserve to remember conversations grows without bound while the weights stay fixed. Section 24.4 returns to this table when it distributes the cache itself.

Fun Note: The Model Is the Cheap Part

It is a strange inversion. We spend months and millions training the weights, then in production those weights are the smallest line in the memory budget. For a long-context, high-concurrency workload the model is rounding error and the cache is the bill. The accelerators in a serving fleet spend most of their silicon not on the model you trained but on its short-term memory of what each user just said.

3. Sharding the Model, and the Two Phases of a Request Intermediate

Because one GPU cannot hold the model, the forward pass is spread across several. Each device holds a slice of every weight matrix, performs its part of each layer's matrix multiplications, and the devices exchange partial results so that the layer's output is correct as if one machine had computed it. Figure 24.1.1 shows the progression: a model and its growing cache overflowing one GPU on the left, the same model sharded across four cooperating GPUs in the middle, and on the right the two phases of generation that, at fleet scale, motivate splitting the work even further across machines.

One GPU: does not fit GPU memory M weights P·b KV cache (growing) overflow Tensor-parallel: one model, four GPUs GPU 0 GPU 1 GPU 2 GPU 3 slice slice slice slice all-reduce partial sums combined per layer Two phases of one request PREFILL read whole prompt at once compute-bound · sets TTFT DECODE one token at a time memory-bound · sets TPOT reads the whole KV cache each step Weights are sharded so the model fits (24.2); the cache is distributed when it overflows (24.4); prefill and decode have opposite profiles, so at fleet scale they run on different machines (24.5).
Figure 24.1.1: From impossible to distributed. Left: weights plus a growing key-value cache overflow one GPU. Middle: tensor parallelism shards every layer across four GPUs that combine partial results with an all-reduce, so one model runs as one forward pass on many devices. Right: a single request has a compute-bound prefill phase that sets time-to-first-token and a memory-bound decode phase that sets time-per-output-token; their opposite profiles motivate the disaggregation of Section 24.5.

The right panel of Figure 24.1.1 introduces the structural fact that the rest of the chapter exploits. Generating an answer is not one uniform computation; it is two phases. In the prefill phase the model reads the entire prompt at once, computing keys and values for every prompt token in parallel and producing the first output token. Prefill is compute-bound: it performs large matrix multiplications over many tokens simultaneously and saturates the accelerator's arithmetic units. In the decode phase the model generates the rest of the answer one token at a time, and each step must read the entire key-value cache to attend to the past while doing comparatively little arithmetic. Decode is memory-bound: it is limited by how fast the device can stream the cache out of memory, not by how fast it can multiply. One request, two opposite bottlenecks.

Thesis Thread: Per-Node Economics, Multiplied Across the Fleet

Chapter 22 derived the key-value cache and paged attention as single-GPU optimizations, the scale-up baseline. This section takes those same per-node quantities, the bytes per token and the prefill-versus-decode split, and multiplies them across a serving fleet until they force a multi-machine architecture. The cache that paged attention packed efficiently into one GPU's memory now overflows the GPUs that hold the model, so it must be distributed (Section 24.4). The prefill and decode phases that shared one accelerator now run on separate pools of machines (Section 24.5). This is the book's recurring move: a primitive that lived on one node returns, scaled out, as the organizing principle of a distributed system. The arithmetic did not change; the number of machines it implies did.

4. Two SLOs, Not One: TTFT and TPOT Intermediate

Because a request has two phases, its quality of service has two numbers, and collapsing them into a single latency figure hides the trade-off that distributed LLM serving exists to manage. The first is time-to-first-token (TTFT), the delay from when the user submits a prompt to when the first output token appears. TTFT is set almost entirely by prefill: a long prompt or a busy prefill queue makes the user wait before anything streams. The second is time-per-output-token (TPOT), sometimes called inter-token latency, the average gap between successive tokens once generation has begun. TPOT is set by decode: it determines how fast the answer streams out, and a user reads comfortably when it stays below roughly the speed of reading aloud.

These two objectives pull in different directions. Batching more requests together raises decode throughput, which is good for serving many users cheaply, but a larger batch lengthens each decode step and inflates TPOT for everyone in it. Prioritizing a new request's prefill to lower its TTFT steals accelerator time from the decode steps of requests already streaming, raising their TPOT. A serving system that optimizes one number in isolation will quietly wreck the other. This is why the right success metric is not latency but goodput, the rate of requests served while meeting both SLOs at once, a notion Section 5.3 defined for distributed systems in general and that Chapter 24 specializes to the TTFT-and-TPOT pair.

Research Frontier: Disaggregating Prefill From Decode (2024 to 2026)

The clearest recent answer to the two-SLO tension is to stop running prefill and decode on the same machines. DistServe (Zhong et al., 2024) assigns prefill and decode to separate GPU pools, each tuned for its own bottleneck and SLO, and reports large goodput gains over co-located serving because a slow prefill no longer stalls another user's decode. Splitwise (Patel et al., 2024, Microsoft) splits the two phases across different hardware generations, sending compute-heavy prefill to GPUs with high arithmetic throughput and memory-bound decode to cheaper memory-rich GPUs, then transfers the key-value cache between them over fast interconnect. Mooncake (Moonshot AI, 2024) takes the idea further with a key-value-cache-centric, disaggregated architecture built around a distributed cache pool that lets prefill results be reused and decode be scheduled independently. Section 24.5 develops this prefill/decode disaggregation in full; for now, note that the frontier has accepted the premise of this section, that the two phases are different enough to deserve different machines.

Practical Example: The Chatbot That Streamed Fast but Started Slow

Who: An inference platform engineer running a 70B assistant for a customer-support product.

Situation: Users pasted long support transcripts, often 10,000 tokens, then asked a short question and expected an immediate reply.

Problem: The dashboard showed a healthy median latency, but complaints spiked: the assistant felt sluggish to start even though it typed quickly once it began.

Dilemma: The single latency metric averaged a slow prefill into a fast decode and looked fine; raising the batch size to cut cost would have made the start even slower, while shrinking it to speed the start would have raised the per-token cost across the fleet.

Decision: They split the metric into TTFT and TPOT, discovered TTFT at the 95th percentile was over four seconds while TPOT was excellent, and treated prefill as the binding constraint.

How: They moved to a prefill/decode-disaggregated layout (the approach of Section 24.5), giving long prompts a dedicated prefill pool so a new transcript no longer queued behind other users' decode steps.

Result: The 95th-percentile TTFT fell below one second with TPOT unchanged, and goodput under the dual SLO rose because the two phases stopped contending for the same accelerators.

Lesson: One latency number is a trap for LLM serving. Measure TTFT and TPOT separately, find which phase binds, and place the phases on machines that can each meet its own target.

5. The Road Through This Chapter Beginner

The memory inequality and the two SLOs together set the agenda for everything that follows, and it is worth seeing the whole route before taking the first step. Section 24.2 shards a single forward pass across GPUs with tensor parallelism, the direct fix for weights that overflow one device. Section 24.3 extends the split across nodes with pipeline parallelism when even a tensor-parallel group spans more machines than one server holds. Section 24.4 distributes the key-value cache itself, the term that dominated Output 24.1.1. Section 24.5 disaggregates prefill from decode, the architectural answer to the two-SLO tension. Section 24.6 schedules requests and runs continuous batching across nodes so the fleet stays busy without violating either SLO. Section 24.7 reuses shared prompt prefixes and serves many fine-tuned adapters from one fleet. Section 24.8 serves mixture-of-experts models whose experts live on different machines, the inference companion to the expert parallelism of Chapter 17. Section 24.9 grounds all of it in the production engines that implement these ideas.

Library Shortcut: vLLM Shards the Model From One Argument

Code 24.1.1 told you how many GPUs a model needs; making it run across them is, in a modern engine, a single argument. vLLM, TensorRT-LLM, and SGLang each take a tensor-parallel degree and handle the weight sharding, the per-layer all-reduce, the paged key-value cache, and continuous batching that this chapter unpacks over the following sections. With vLLM the entire setup is:

from vllm import LLM, SamplingParams

# tensor_parallel_size shards every layer across 4 GPUs; the engine wires up the
# per-layer all-reduce, paged KV cache, and continuous batching for you.
llm = LLM(model="meta-llama/Llama-3.1-70B-Instruct", tensor_parallel_size=4)

out = llm.generate(["Summarize this support transcript: ..."],
                   SamplingParams(max_tokens=256))
print(out[0].outputs[0].text)
Code 24.1.2: The 70B model that overflowed one GPU in Output 24.1.1, served across four with one argument. The hundreds of lines of weight sharding, collective scheduling, and cache paging that Sections 24.2 through 24.4 build by hand collapse to tensor_parallel_size=4, which Section 24.9 opens up.

We have the reason a single machine is impossible, the calculator that turns it into a GPU count, and the two service-level objectives that every later design must honor. The most direct response to weights that do not fit is to split one matrix multiply across devices and combine their answers, and that is exactly where the next section begins, with tensor-parallel inference in Section 24.2.

Exercise 24.1.1: Where Does the Memory Go? Conceptual

Using the two terms of the memory inequality in Section 1, explain in words why a 7B model can serve a single short prompt on one GPU but needs several GPUs to serve hundreds of concurrent 100,000-token conversations, even though the weights never change size. Identify which term grows, what it scales with, and why the weights becoming "rounding error" (the Fun Note) is the expected outcome rather than a surprise. State which later section addresses the term that grows.

Exercise 24.1.2: Extend the Fit Calculator Coding

Modify Code 24.1.1 in two ways. First, add a column for the fraction of total memory spent on the key-value cache, and confirm it approaches 1 as context length grows. Second, add a grouped-query-attention factor: real models share key-value heads across query heads, dividing the cache term by a ratio $g$ (try $g = 8$). Recompute the GPU counts for the 70B long-context row with and without grouped-query attention, and report how many GPUs the optimization saves. Explain why grouped-query attention attacks the cache term specifically and leaves the weight term untouched.

Exercise 24.1.3: Two SLOs in Tension Analysis

A serving system processes a batch of $R$ requests. Argue from the prefill-versus-decode distinction of Section 3 why increasing $R$ tends to lower the per-request cost (good for throughput) while raising TPOT (bad for the streaming SLO), and why admitting a new long prompt lowers nothing but the new user's TTFT while raising the TPOT of everyone already decoding. Sketch why a single latency number cannot capture this trade-off and why goodput under both SLOs, as defined in Section 5.3, is the metric that can. Predict qualitatively how prefill/decode disaggregation (Section 24.5) changes the trade-off.