"I spent my whole life answering one token at a time, then a four-thousand-token prompt walked in and froze the entire room."
A Decode Step Stuck Behind a Prefill
An LLM request has two phases with opposite appetites: prefill reads the whole prompt in one compute-bound burst, while decode emits tokens one at a time, bound by memory bandwidth and lasting far longer. Forcing both onto the same GPU makes them fight, so a long prefill freezes everyone else's decoding. Prefill/decode disaggregation answers this the way a distributed system always does: split the workload across specialized machines. One pool of GPUs does nothing but prefill, another does nothing but decode, and the only thing crossing between them is the KV cache from the prompt. Each pool is then tuned and scaled for its own phase, so a serving fleet can meet its time-to-first-token and time-per-output-token targets independently instead of compromising on a single shared schedule. This section shows why the interference is real, what disaggregation costs, and when it wins.
In Section 24.4 we built the distributed and paged KV cache, the per-request memory that an LLM accumulates as it reads a prompt and writes a response. That section made the cache a movable object: blocks that can live on one machine and be addressed from another. This section spends that capability. Once the KV cache can travel, the two phases of generation no longer have to run on the same GPU, and pulling them apart turns out to fix one of the most stubborn latency problems in LLM serving. To see why, we first have to look closely at what those two phases actually demand from a GPU, because they could hardly be more different.
1. Two Phases, Opposite Resource Profiles Beginner
Every token an LLM generates is produced by a forward pass over a transformer, but the first forward pass is unlike all the others. Prefill processes the entire prompt at once. If the prompt is $S$ tokens long, prefill runs the attention and feed-forward layers over all $S$ positions in a single large matrix multiply, building the key and value vectors (the KV cache) for every prompt token. This is a dense, highly parallel computation that saturates the GPU's arithmetic units; it is compute-bound. It is also bursty and short: one big pass, then it is done. The longer the prompt, the longer this single burst, and a 4,000-token prompt can occupy a GPU for tens of milliseconds doing nothing but prefill.
Decode is the opposite in every respect. After prefill, the model generates the response one token at a time. Each decode step is a forward pass over a single new token, which reads the entire KV cache accumulated so far to compute attention. The arithmetic per step is tiny, but it must stream the whole KV cache and all model weights through the GPU's memory system, so the step is limited by how fast memory can be read, not by how fast the GPU can multiply; it is memory-bandwidth-bound. Decode is also long-running: a request generating 500 tokens performs 500 of these steps, spread over seconds. The contrast is summarized in Table 24.5.1.
| Property | Prefill | Decode |
|---|---|---|
| Work per step | Whole prompt ($S$ tokens) at once | One new token at a time |
| Bottleneck | Compute (arithmetic units) | Memory bandwidth (KV + weights) |
| Duration | Short, bursty | Long, many steps |
| SLO it dominates | Time to first token (TTFT) | Time per output token (TPOT) |
| Batching benefit | Limited (already compute-saturated) | Large (amortizes weight reads) |
Those last two rows name the two latency targets that an LLM serving system lives or dies by. Time to first token (TTFT) is how long a user waits after sending a prompt before any output appears; it is dominated by prefill. Time per output token (TPOT), sometimes called inter-token latency, is the gap between successive generated tokens once output starts; it is dominated by decode and governs how fluid the stream feels. A serving fleet typically commits to service-level objectives (SLOs) on both, for example "p99 TTFT under 350 ms and p99 TPOT under 20 ms." The trouble is that on a shared GPU these two goals are in direct conflict.
2. Why Co-Location Makes Them Interfere Intermediate
A standard inference engine runs prefill and decode on the same GPUs, interleaving them in each scheduling round. When a new request arrives, the engine must prefill its prompt before decoding can begin; that prefill is a large compute burst that occupies the GPU. Every other request that was happily decoding is now stalled until the burst finishes, because a GPU executes one kernel at a time. The freshly arrived prompt got its first token reasonably quickly, but it did so by inflating the TPOT of every request already in flight. This is classic head-of-line blocking: a short, latency-sensitive operation (a decode step) waits behind a long one (a prefill) that happened to land in front of it.
Continuous batching, which we develop in Section 24.6, mixes many requests into one running batch, and that helps throughput enormously, but it does not remove this conflict; it only forces a choice. Batch a big prefill together with ongoing decodes and the whole batch runs at the slower, prefill-bound pace, hurting TPOT. Hold prefills back to protect decode latency and new requests wait longer for their first token, hurting TTFT. The chunked-prefill technique from Section 22.7 softens the blow by slicing a long prefill into smaller pieces that interleave with decode steps, so no single prefill burst is large enough to freeze the GPU. That is the leading non-disaggregated remedy, and on one node it is often enough. But it is still one set of GPUs trying to satisfy two contradictory schedules at once, and at high load the compromise shows.
Prefill wants to run as a large compute burst; decode wants steady, low-latency, memory-bound steps. On a shared GPU, any schedule that is good for one is bad for the other, so co-located serving always trades TTFT against TPOT along a single dial. Disaggregation removes the dial: it gives each phase its own machines, each batched, tuned, and scaled for exactly one resource profile, so both SLOs can be met at the same time instead of being traded off.
3. Disaggregation: A Pool per Phase Intermediate
The distributed-systems instinct, the one this whole book trains, is to stop asking one machine to do contradictory jobs and instead split the work across specialized machines. Prefill/decode disaggregation does exactly that. A prefill pool of GPUs accepts incoming prompts, runs the prefill forward pass, and produces the KV cache. The KV cache is then transferred over a fast interconnect to a decode pool of GPUs, which holds the cache, runs all the one-token decode steps, and streams the output back to the user. Each pool runs one phase only, so each can be batched and scheduled for its own resource profile, and each can be scaled to its own SLO: add prefill GPUs when TTFT is at risk, add decode GPUs when TPOT is at risk, independently. This is the same heterogeneous-placement idea introduced for the cluster as a whole in Section 1.2, now applied inside a single model's serving path. Figure 24.5.1 contrasts the two designs.
This is a genuinely distributed design, not a single-node optimization. The two pools are different sets of machines, possibly with different GPU types, different batch sizes, and different replica counts, coordinated by a scheduler that routes each request from prefill to decode. It is the serving-time sibling of the heterogeneous parallelism we used in training: just as Chapter 22 studied the per-node cost of one request, disaggregation multiplies that economics across a fleet by giving each phase the hardware it actually wants.
The book's spine is that AI at scale is the engineering of work spread across many machines. Disaggregation is that spine applied to a single inference request: rather than scale up one GPU to juggle two incompatible phases, we scale out across two pools, each doing one thing well, and pay a communication cost (the KV transfer from Section 24.4) to connect them. The decision is the same one we made for data-parallel training in Section 1.1: a little communication buys non-interference and independent scaling, and at high load that trade is overwhelmingly worth it.
4. Modeling the Cost: TTFT, TPOT, and the KV Transfer Intermediate
Disaggregation is not free. The KV cache produced by prefill must be physically moved to the decode pool before decoding can start, and that transfer adds to TTFT. We can write the two metrics as simple sums. For a request whose prompt has $S$ tokens, let $t_{\text{pf}}(S)$ be the prefill time, $t_{\text{q}}$ the time it waits in the prefill queue, and $t_{\text{kv}}$ the KV-cache transfer time. Then
$$\text{TTFT} = t_{\text{q}} + t_{\text{pf}}(S) + t_{\text{kv}}, \qquad \text{TPOT} = \frac{1}{G}\sum_{j=1}^{G} \big(t_{\text{step}} + t_{\text{wait},j}\big),$$where $G$ is the number of generated tokens, $t_{\text{step}}$ is the raw cost of one decode step, and $t_{\text{wait},j}$ is the queueing delay the $j$-th step suffers from contention. In co-located serving, $t_{\text{wait},j}$ is large and spiky because decode steps wait behind prefill bursts; that is the interference. Disaggregation adds the $t_{\text{kv}}$ term to TTFT but drives the $t_{\text{wait},j}$ terms in TPOT down to near zero, because the decode pool never runs a prefill. The whole bet is that the constant $t_{\text{kv}}$ we add is small compared to the spiky waiting we remove, and that the prefill queue $t_{\text{q}}$ stays short because we scaled the prefill pool for it. The next code makes this concrete by simulating both designs and checking each SLO.
import heapq
from dataclasses import dataclass
# A tiny discrete-event model of LLM serving. Each request has a prompt to
# prefill (compute-bound, one burst) and tokens to decode (bandwidth-bound,
# many small steps). We measure two SLO metrics:
# TTFT = time to first token (arrival -> prefill done + KV transfer)
# TPOT = time per output token (mean gap between decode steps)
PREFILL_MS_PER_TOK = 0.20 # prefill is fast per token but prompts are long
DECODE_MS_PER_STEP = 2.0 # one decode step for one request on a free GPU
KV_TRANSFER_MS = 4.0 # ship one request's KV cache prefill -> decode
N_PREFILL_WORKERS = 4 # disagg: prefill pool sized for the TTFT SLO
N_DECODE_WORKERS = 2 # disagg: decode pool sized for the TPOT SLO
@dataclass
class Req:
rid: int; arrival: float; prompt_len: int; gen_len: int
# Eight long-prompt requests arriving in a burst, each generating 60 tokens.
REQS = [Req(i, arrival=i * 8.0, prompt_len=700, gen_len=60) for i in range(8)]
def colocated(reqs):
"""One GPU does prefill AND decode, so a prefill burst blocks every decode
step queued behind it (head-of-line interference)."""
gpu_free = 0.0
ttft, gaps = {}, {r.rid: [] for r in reqs}
last_dec, rem = {}, {r.rid: r.gen_len for r in reqs}
arr = {r.rid: r.arrival for r in reqs}
plen = {r.rid: r.prompt_len for r in reqs}
events = [(r.arrival, 0, r.rid) for r in reqs] # 0 = prefill task
heapq.heapify(events)
while events:
ready, kind, rid = heapq.heappop(events)
start = max(ready, gpu_free) # contend for the one GPU
if kind == 0: # blocking prefill burst
gpu_free = start + plen[rid] * PREFILL_MS_PER_TOK
ttft[rid] = gpu_free - arr[rid]
last_dec[rid] = gpu_free
heapq.heappush(events, (gpu_free, 1, rid))
else: # one decode step
gpu_free = start + DECODE_MS_PER_STEP
gaps[rid].append(gpu_free - last_dec[rid])
last_dec[rid] = gpu_free
rem[rid] -= 1
if rem[rid] > 0:
heapq.heappush(events, (gpu_free, 1, rid))
return summarize(ttft, gaps)
def disaggregated(reqs):
"""Two pools. The prefill pool (parallel workers) produces the KV cache;
the decode pool runs token steps and is never blocked by a prefill."""
ttft, gaps, last_dec = {}, {r.rid: [] for r in reqs}, {}
prefill_workers = [0.0] * N_PREFILL_WORKERS
prefill_done = {}
for r in sorted(reqs, key=lambda x: x.arrival): # earliest-free worker
w = min(range(N_PREFILL_WORKERS), key=lambda i: prefill_workers[i])
start = max(r.arrival, prefill_workers[w])
prefill_workers[w] = start + r.prompt_len * PREFILL_MS_PER_TOK
prefill_done[r.rid] = prefill_workers[w] + KV_TRANSFER_MS # + KV ship
ttft[r.rid] = prefill_done[r.rid] - r.arrival
decode_workers = [0.0] * N_DECODE_WORKERS
rem = {r.rid: r.gen_len for r in reqs}
events = []
for r in reqs:
heapq.heappush(events, (prefill_done[r.rid], r.rid))
last_dec[r.rid] = prefill_done[r.rid]
while events: # round-robin decode
ready, rid = heapq.heappop(events)
w = min(range(N_DECODE_WORKERS), key=lambda i: decode_workers[i])
start = max(ready, decode_workers[w])
decode_workers[w] = start + DECODE_MS_PER_STEP
gaps[rid].append(decode_workers[w] - last_dec[rid])
last_dec[rid] = decode_workers[w]
rem[rid] -= 1
if rem[rid] > 0:
heapq.heappush(events, (decode_workers[w], rid))
return summarize(ttft, gaps)
def summarize(ttft, gaps):
n = len(ttft)
all_gaps = [g for gs in gaps.values() for g in gs]
return (sum(ttft.values()) / n, max(ttft.values()),
sum(all_gaps) / len(all_gaps),
sorted(all_gaps)[int(0.99 * (len(all_gaps) - 1))])
TTFT_SLO, TPOT_SLO = 350.0, 20.0 # ms, p99 targets
for name, fn in [("Co-located (one pool)", colocated),
("Disaggregated (two pools)", disaggregated)]:
mt, pt, mtp, ptp = fn(REQS)
print(name)
print(f" TTFT mean={mt:7.1f} p99={pt:7.1f} SLO<={TTFT_SLO:.0f} "
f"[{'PASS' if pt <= TTFT_SLO else 'FAIL'}]")
print(f" TPOT mean={mtp:7.1f} p99={ptp:7.1f} SLO<={TPOT_SLO:.0f} "
f"[{'PASS' if ptp <= TPOT_SLO else 'FAIL'}]\n")
print(f"KV-cache transfer charged per request (disagg only): {KV_TRANSFER_MS:.1f} ms")
colocated serves prefill and decode on a single GPU, so prefill bursts inflate the decode gaps; disaggregated runs a four-worker prefill pool and a two-worker decode pool connected by a fixed KV-transfer cost. Both are scored against the same p99 TTFT and TPOT SLOs.Co-located (one pool)
TTFT mean= 602.0 p99= 1064.0 SLO<=350 [FAIL]
TPOT mean= 24.1 p99= 292.0 SLO<=20 [FAIL]
Disaggregated (two pools)
TTFT mean= 198.0 p99= 252.0 SLO<=350 [PASS]
TPOT mean= 5.3 p99= 8.0 SLO<=20 [PASS]
KV-cache transfer charged per request (disagg only): 4.0 ms
The numbers tell the story the model predicted. In the co-located run, the worst-case TPOT of 292 ms is the head-of-line blocking made visible: some decode step waited behind a 140 ms prefill burst, then another, and the inter-token gap exploded far past the 20 ms target. Disaggregation removes that waiting entirely (its p99 TPOT is 8 ms) at the price of a constant 4 ms added to every TTFT, a price so small it is invisible next to the 350 ms budget. The decode pool's tokens flow because nothing on those machines ever does a prefill, and the prefill pool clears the burst quickly because four workers absorb it in parallel. That is the whole argument for disaggregation, reduced to one runnable experiment.
The hundred-odd lines of scheduling and KV-transfer logic in Code 24.5.1 are exactly what production inference engines have absorbed. vLLM ships a disaggregated-prefill mode where a prefill instance and a decode instance are connected by a KV-transfer connector, and you opt in through configuration rather than code:
# Prefill instance: produces KV cache and pushes it to the connector.
# (illustrative kv_transfer_config; see the vLLM disaggregated-prefill docs)
from vllm import LLM
prefill = LLM(model="meta-llama/Llama-3.1-8B-Instruct",
kv_transfer_config={"kv_role": "kv_producer",
"kv_connector": "PyNcclConnector"})
# Decode instance: pulls the KV cache and runs the token steps.
decode = LLM(model="meta-llama/Llama-3.1-8B-Instruct",
kv_transfer_config={"kv_role": "kv_consumer",
"kv_connector": "PyNcclConnector"})
kv_transfer_config. The engine handles pool routing, the NCCL-based KV transfer, paged-cache layout, and continuous batching internally; the roughly 110 lines of simulation collapse to two constructor calls plus a launcher that runs the instances on separate GPUs.5. When Disaggregation Wins, and When It Does Not Advanced
Disaggregation is a trade, and like every trade in this book it has a regime where it pays and a regime where it does not. It wins when the interference it removes costs more than the KV transfer it adds. Concretely, it pays off under high load, where prefills arrive often enough that they are constantly stepping on decodes, and under tight TPOT SLOs, where even occasional inter-token stalls violate the objective. It also wins when prompts are long, because long prompts make prefill bursts large and therefore more disruptive to co-located decode. In these regimes the spiky $t_{\text{wait},j}$ term dominates, and moving it out is worth a constant KV-transfer penalty.
It does not win when load is light, because an idle GPU has no decode steps to stall, so there is nothing to protect and the KV transfer is pure overhead. It struggles when the interconnect between pools is slow, because then $t_{\text{kv}}$ stops being a small constant and starts dominating TTFT; disaggregation assumes a fast link of the kind built in Chapter 4, and over a thin network it can lose to a co-located engine. And it adds operational complexity: two pools to size, a router to keep them balanced, and a failure mode where a decode replica dies holding KV caches that the prefill pool already discarded. On a single node serving moderate traffic, the chunked-prefill approach from Section 22.7 is simpler and often sufficient; disaggregation earns its keep at fleet scale, which is exactly where this chapter operates.
Who: An inference platform engineer running a customer-facing LLM chat assistant on a cluster of GPUs.
Situation: Traffic was bursty, with long retrieval-augmented prompts (several thousand tokens) arriving alongside many in-flight conversations that were still streaming their answers.
Problem: The TTFT SLO was comfortably met, but p99 TPOT kept spiking past the 25 ms target whenever a long prompt arrived, and users saw the answer stream stutter and freeze.
Dilemma: Enable chunked prefill on the existing co-located pool, simple but still one set of GPUs juggling both phases, or disaggregate into separate prefill and decode pools, which meets both SLOs cleanly but adds a router, a KV-transfer path, and a second pool to operate.
Decision: They disaggregated, because the binding problem was decode-side interference at high load with long prompts, the exact regime where the KV-transfer cost is dwarfed by the stalls it removes.
How: They split the fleet into a prefill pool and a larger decode pool over a fast intra-rack interconnect, transferred the paged KV cache from Section 24.4 between them, and sized each pool to its own SLO, scaling the decode pool out as concurrency grew.
Result: p99 TPOT settled under 10 ms and stopped spiking on long prompts, TTFT rose by only the few-millisecond transfer cost, and each pool could now be scaled on its own metric, just as Output 24.5.1 predicts.
Lesson: Disaggregate when decode interference under load is the binding constraint and the interconnect is fast; below that load, or over a slow link, the simpler chunked-prefill remedy is the right tool.
The epigraph is not much of an exaggeration. A single 8,000-token system prompt landing on a co-located GPU really can freeze every other user's token stream for tens of milliseconds while it prefills, which is why support channels for early LLM products filled with reports of answers that "typed smoothly, then hung, then resumed." Disaggregation is, in effect, giving the long-winded newcomers their own room so the rest of the conversation can keep flowing.
6. The State of the Art Advanced
Disaggregation moved from idea to standard practice over a very short window, and the research record names the milestones. We summarize them here because the design choices they explored (how to size pools, where to put the KV transfer, when to fall back to co-location) are exactly the levers the previous sections described.
Two 2024 systems established the idea. DistServe (Zhong et al., 2024) showed that disaggregating prefill and decode onto separate GPU pools, each tuned for its phase, lets a serving system meet TTFT and TPOT SLOs that a co-located engine cannot, and it formalized how to assign GPUs to each pool given a target SLO. Splitwise (Patel et al., 2024) reached the same conclusion from a cost-and-power angle, splitting the phases across machines (even across heterogeneous GPU types, cheaper hardware for memory-bound decode) to raise throughput per watt. Building on these, Mooncake (Qin et al., 2024), the serving architecture behind the Kimi assistant, made the KV cache the center of gravity: a disaggregated, KV-cache-centric design with a large pooled cache store that turns the transfer of Section 24.4 into a first-class, prefix-shareable resource. MemServe (Hu et al., 2024) generalized this with an elastic memory pool (MemPool) that unifies disaggregated serving with prefix caching across instances. The open-source frontier has since absorbed the pattern: vLLM, SGLang, and TensorRT-LLM (Section 24.9) all ship disaggregated-serving modes, and the active questions are now about smarter KV routing, cache-aware pool balancing, and when adaptive engines should collapse back to co-location under light load.
The throughline from DistServe to MemServe is the one this section has argued from first principles: the two phases are different workloads, so give them different machines and connect them with the KV cache. With disaggregation in hand, the remaining question is how requests flow into and between these pools without idling GPUs, which is the scheduling problem we take up next.
Using the TTFT and TPOT decompositions in Section 4, explain in your own words which term blows up in the co-located run of Output 24.5.1 and why, and which term disaggregation adds in exchange. Then argue what happens to the comparison if every prompt is short (say 20 tokens): does disaggregation still help, and what does that tell you about the role of prompt length in choosing the design? Tie your answer to the "when it wins" conditions in Section 5.
Modify Code 24.5.1 in two ways. First, sweep the request inter-arrival gap (the arrival=i * 8.0 spacing) from very sparse to very dense and plot p99 TPOT for both designs; identify the load at which co-location first violates the TPOT SLO. Second, sweep KV_TRANSFER_MS from 4 ms up to 200 ms (a slow interconnect) and find the transfer cost at which disaggregation's TTFT advantage disappears. State the two break-even points you found and relate them to the win/lose regimes in Section 5.
Suppose prefill costs $0.2$ ms per prompt token, decode costs $2$ ms per step, prompts average $S = 1000$ tokens, responses average $G = 200$ tokens, and requests arrive at $\lambda = 40$ per second. Estimate the aggregate prefill GPU-seconds and decode GPU-seconds demanded per wall-clock second, and from those numbers propose a ratio of prefill workers to decode workers for the pools. Explain why this ratio is generally not one to one, and how it would shift if average prompt length doubled while response length stayed fixed. Connect your reasoning to the independent-scaling argument of Section 3.