Part V: Distributed Inference and Serving
Chapter 24: Distributed LLM Serving

Distributed and Paged KV Cache

"I do not run out of math to do. I run out of places to remember. Give me one more block of memory and I will gladly hold one more conversation."

A KV Cache Manager Counting Free Blocks
Big Picture

At serving time the throughput lever is not how fast a GPU multiplies matrices; it is how the fleet manages the KV cache, because the cache, not the FLOPs, decides how many sequences can run at once. A single GPU can issue far more arithmetic than its memory can keep sequences resident for, so the binding question is always "how many KV-cache blocks are free?" This section scales the single-node KV cache and PagedAttention of Chapter 22 up to a fleet. We treat GPU memory blocks as a managed, sometimes shared resource: we page cold blocks down to CPU memory to fit more sequences, we build a tiered store (GPU HBM, CPU DRAM, remote or SSD) for very long contexts and prefix reuse, we shard each cache across the tensor-parallel GPUs that already hold the model, and we transfer a cache between machines when prefill and decode live on different nodes. Throughput is won and lost in this memory accounting.

In the previous sections of this chapter we split one model across machines: tensor-parallel inference (Section 24.2) cut each weight matrix across GPUs, and pipeline-parallel inference (Section 24.3) cut the layer stack across nodes. Those splits answer "how do we hold a model too large for one device?" This section answers a different and, at serving time, more pressing question: "how do we hold the state of every in-flight conversation when that state grows with every token and every concurrent user?" The model weights are a fixed cost paid once; the KV cache is a variable cost that scales with load, and on a busy server it is what actually runs out. Recall from Section 22.5 that PagedAttention manages that cache like virtual memory on one GPU. We now multiply that idea across the fleet.

1. The KV Cache Is the Binding Capacity Constraint Beginner

To size a serving fleet you first need the unit economics of one node, exactly the discipline Chapter 23 built into fleet sizing. The unit that matters is not peak FLOPs; it is concurrency, the number of sequences a node can keep decoding at the same time. Each active sequence holds a KV cache whose size grows linearly with its context length, and the sum of all those caches must fit in the memory left over after the weights. When that memory fills, the server cannot admit another sequence no matter how idle its arithmetic units are. The cache is the throughput valve.

The footprint follows directly from the attention structure of Section 22.5. For a model with $L$ layers, $H_{kv}$ key/value heads (fewer than the query heads under grouped-query attention), head dimension $d_\text{head}$, and $b$ bytes per element, the cache for a single sequence of $t$ tokens is

$$M_\text{seq}(t) \;=\; 2 \, L \, H_{kv} \, d_\text{head} \, b \, t,$$

where the leading $2$ counts keys and values. The number of sequences a node can hold concurrently is then the memory budget divided by the per-sequence cost,

$$N_\text{concurrent} \;=\; \left\lfloor \frac{M_\text{HBM} - M_\text{weights}}{M_\text{seq}(t_\text{ctx})} \right\rfloor ,$$

with $t_\text{ctx}$ the context each sequence is allowed to grow to. Two facts make this the central equation of serving. First, $M_\text{seq}$ scales with context length, so a few very long conversations can crowd out many short ones. Second, $M_\text{weights}$ is fixed, so every gigabyte not spent on weights is a gigabyte of concurrency. Everything in this section is a strategy for making the numerator larger or the denominator effectively smaller.

Key Insight: KV-Cache Management, Not FLOPs, Is the Serving Throughput Lever

Decode is memory-bound, not compute-bound: each step reads the entire cache and does a small amount of arithmetic. A serving node therefore exhausts KV memory long before it exhausts compute, so the quantity to engineer is bytes of cache per concurrent sequence, not floating-point operations per second. Paging cold blocks to CPU memory, tiering the cache across HBM, DRAM, and remote storage, and sharing blocks across requests are all moves that raise concurrency without touching the model. Treat GPU KV blocks the way an operating system treats physical page frames: a scarce, managed, reclaimable resource.

2. A Tiered, Shared KV Store Across the Fleet Intermediate

Once the cache is the constraint, the natural response is the same one operating systems reached for decades ago: a memory hierarchy. The hottest blocks, those of sequences decoding right now, stay in GPU HBM. Blocks of paused or backgrounded sequences page down to CPU DRAM, which is an order of magnitude larger and one bus away (KV offloading). Blocks worth keeping for reuse across requests, but too cold for DRAM, spill to a remote store on SSD or a dedicated cache service. The tiers trade capacity against latency, and a sequence's blocks move up the hierarchy on demand, exactly as virtual-memory pages fault back into RAM. Figure 24.4.1 shows the hierarchy together with the cross-machine transfer that disaggregated serving forces.

Tiered KV cache on one node GPU HBM · hot blocks sequences decoding now · fastest, smallest CPU DRAM · warm offloaded blocks paused sequences · ~10x larger, one bus away Remote / SSD · cold & shared blocks long-context tails · reusable prefixes page in page out fetch spill hotter ↑ faster, smaller    colder ↓ larger, cheaper KV transfer in prefill / decode disaggregation Prefill node ingest prompt, build KV cache Decode node receive cache, emit tokens KV move full cache over NVLink / RDMA / IB a real cross-machine data movement (Section 24.5)
Figure 24.4.1: Left, the tiered KV cache on a single node: hot blocks live in GPU HBM, warm blocks page out to CPU DRAM (KV offloading), and cold or reusable blocks spill to a remote SSD store, with page-in and page-out arrows mirroring virtual memory. Right, the cross-machine case: when prefill and decode run on different nodes (Section 24.5), the prefill node's KV cache must be transferred in full to the decode node, a significant data movement whose cost we measure in Code 24.4.1.

Two of these tiers do more than enlarge capacity. CPU offload lets a node admit more active sequences than HBM alone would allow, because paused turns of a multi-turn chat can wait in DRAM and page back in only when the user replies. The remote tier enables sharing: a prompt prefix common to thousands of requests, a long system prompt, a shared document, a few-shot preamble, can be stored once and reused, which we develop next and carry into the prefix-caching fleets of Section 24.7.

3. Sharding the Cache, and Sharing It Intermediate

The cache is partitioned along two independent axes, and it helps to keep them separate. The first is within a model replica. When tensor-parallel inference (Section 24.2) splits the attention heads across $P$ GPUs, each GPU naturally holds the KV cache only for its own heads. The cache is sharded for free along the same axis as the weights, so no head's keys and values are duplicated and the per-GPU footprint of $M_\text{seq}$ drops by roughly $P$. This is partitioning and sharding (Chapter 16) reappearing on the inference side: the same tensor cut that distributes computation distributes the cache.

The second axis is across requests. Distinct sequences that begin with identical tokens produce identical keys and values for that shared span, because attention over a prefix depends only on the prefix. A paged cache can therefore point several sequences at the same physical blocks for their common prefix and allocate private blocks only where the sequences diverge, the same copy-on-write trick a process fork uses. With thousands of requests sharing a long system prompt, this turns a per-request cost into a paid-once cost and can multiply effective concurrency well beyond what either offload or sharding alone delivers.

Thesis Thread: Per-Node Memory Economics, Multiplied Across the Fleet

Chapter 22 taught the KV cache and PagedAttention as a scale-up technique: one GPU, managed like virtual memory, serving more sequences. This section is where that scale-up substrate becomes a scale-out resource. The same blocks are now sharded across tensor-parallel GPUs, tiered down to CPU and remote memory, shared across requests, and transferred between machines. The book's recurring move, a primitive returns scaled out, applies even to a data structure: the cache that one node managed for itself becomes a fleet-wide, sometimes shared, sometimes shipped resource, and the throughput of the whole serving system rests on how well it is managed.

4. Measuring Capacity, Sharing, and Transfer Cost Intermediate

The capacity model and the transfer cost are concrete enough to compute. The program below sizes one serving node for a representative grouped-query-attention model, then reports how many concurrent sequences fit under three policies: GPU memory only, GPU plus CPU offload, and paged sharing of a common prefix on top of offload. It then estimates what it costs to move one prompt's KV cache between two machines over three classes of interconnect, the data movement that prefill/decode disaggregation (Section 24.5) pays on every request.

import math

# --- Per-sequence KV-cache footprint (bytes) ---------------------------------
# bytes = 2 (K and V) * L layers * H_kv heads * d_head * dtype_bytes * tokens
L, H_kv, d_head, dtype = 32, 8, 128, 2          # ~ a 7-8B model, GQA, fp16
per_token = 2 * L * H_kv * d_head * dtype       # bytes per token, one sequence
ctx = 8192                                      # tokens held per active sequence
per_seq = per_token * ctx                        # bytes per full-context sequence

GB = 1024 ** 3
print("KV bytes / token / sequence :", per_token, "B")
print("KV bytes / sequence (8k ctx):", f"{per_seq / GB:.3f} GiB")

# --- Capacity model: how many concurrent sequences fit? ----------------------
# A node has HBM; weights take a fixed slice; the rest is the KV-cache budget.
hbm = 80 * GB                                    # one 80 GB accelerator
weights = 16 * GB                                # fp16 weights resident on device
gpu_budget = hbm - weights                       # HBM left for KV blocks

def fits(budget):
    return int(budget // per_seq)

gpu_only = fits(gpu_budget)

# GPU + CPU offload: cold blocks page out to host DRAM, freeing HBM for more
# *active* sequences. Effective budget = HBM-KV + a slice of CPU DRAM.
cpu_dram = 256 * GB
offload_budget = gpu_budget + 0.5 * cpu_dram     # half the host RAM lent to KV
gpu_plus_offload = fits(offload_budget)

# Paged + prefix sharing: many sequences share a common system-prompt prefix,
# so the shared blocks are stored once instead of per sequence.
shared_prefix = 2048                             # tokens of shared system prompt
unique_tokens = ctx - shared_prefix
shared_cost = per_token * shared_prefix          # paid once for the whole group
paged_budget = offload_budget
# solve: shared_cost + n * per_token * unique_tokens <= budget
paged_share = int((paged_budget - shared_cost) // (per_token * unique_tokens))

print()
print("concurrent sequences that fit on one node")
print("  GPU-only (HBM)            :", gpu_only)
print("  GPU + CPU offload         :", gpu_plus_offload)
print("  + paged prefix sharing    :", paged_share)
print("  offload gain              :", f"{gpu_plus_offload / gpu_only:.2f}x")
print("  sharing gain over offload :", f"{paged_share / gpu_plus_offload:.2f}x")

# --- Cross-node KV transfer cost (prefill -> decode disaggregation) -----------
# A prefill node fills the cache for a prompt, then ships it to a decode node.
prompt = 4096
kv_to_move = per_token * prompt                  # bytes of KV for this prompt
for name, gbps in [("PCIe/TCP 16 GB/s", 16), ("NVLink/RDMA 100 GB/s", 100),
                   ("InfiniBand 400 Gb/s", 400 / 8)]:
    secs = kv_to_move / (gbps * 1e9)
    print(f"  move {kv_to_move/GB:.3f} GiB over {name:<22}: {secs*1e3:7.2f} ms")
Code 24.4.1: A pure-Python KV-cache capacity and transfer model. It computes per-sequence footprint, then concurrency under three policies (GPU-only, GPU+offload, paged prefix sharing), and finally the wall-clock cost of shipping one prompt's cache across three interconnect classes. No GPU or model is loaded; the arithmetic is the serving capacity equation of Section 1.
KV bytes / token / sequence : 131072 B
KV bytes / sequence (8k ctx): 1.000 GiB

concurrent sequences that fit on one node
  GPU-only (HBM)            : 64
  GPU + CPU offload         : 192
  + paged prefix sharing    : 255
  offload gain              : 3.00x
  sharing gain over offload : 1.33x
  move 0.500 GiB over PCIe/TCP 16 GB/s      :   33.55 ms
  move 0.500 GiB over NVLink/RDMA 100 GB/s  :    5.37 ms
  move 0.500 GiB over InfiniBand 400 Gb/s   :   10.74 ms
Output 24.4.1: One 80 GiB node holds 64 full-context sequences on HBM alone; lending half of a 256 GiB host to offloaded blocks triples that to 192, and sharing a 2048-token prefix lifts it to 255. The same node's cache for a 4096-token prompt is half a gibibyte, costing 34 ms to ship over a commodity link but only about 5 ms over NVLink-class RDMA, which is why disaggregation lives or dies on interconnect speed.

Three readings matter. Offload raised concurrency threefold by spending host DRAM the node already had. Prefix sharing added a further third on top, and it grows with how many requests share the prefix and how long that prefix is. The transfer numbers explain the design tension of the next section: half a gibibyte is cheap on a fast fabric and painful on a slow one, so where prefill and decode sit relative to each other, and what links them, is a first-order serving decision, not a deployment afterthought.

Practical Example: The Chat Service That Doubled Users Without a New GPU

Who: An inference platform engineer running a customer-support chat assistant on a fixed pool of 80 GB accelerators.

Situation: Every conversation began with the same 1800-token system prompt and policy preamble, and multi-turn sessions sat idle between user replies while still holding full KV caches.

Problem: The fleet hit an admission ceiling at roughly 60 concurrent sessions per GPU and started queueing during business hours, even though GPU utilization sat near 40 percent.

Dilemma: Buy more accelerators to add HBM, the obvious scale-up move, or change how the existing KV memory was managed before spending capital.

Decision: They changed management first: enable prefix sharing for the common preamble and offload idle-turn caches to host DRAM, reserving HBM for sequences actively decoding.

How: They turned on automatic prefix caching and CPU KV offload in their inference engine, paged idle sessions out on a short timeout, and paged them back in on the next user message, no model or weight change.

Result: Sustained concurrency per GPU rose past 150 sessions, the daytime queue vanished, and the planned hardware purchase was deferred, matching the threefold and further gains in Output 24.4.1.

Lesson: When the cache is the constraint, the cheapest capacity is the memory you already own, reclaimed by tiering and sharing rather than bought as more HBM.

Library Shortcut: vLLM and LMCache Do the Tiering and Sharing for You

The capacity accounting in Code 24.4.1 is something production engines manage automatically. In vLLM, the paged KV cache, automatic prefix sharing, and CPU offload are configuration, not code you write: the engine allocates blocks, deduplicates shared prefixes by hashing them, and pages cold blocks to host memory on its own.

# pip install vllm lmcache
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    enable_prefix_caching=True,        # share identical prefix blocks across requests
    swap_space=64,                     # GiB of host DRAM for paged-out (offloaded) KV
    gpu_memory_utilization=0.92,       # fraction of HBM the KV-block pool may use
)
# Thousands of requests that share a system prompt now reuse its KV blocks,
# and idle sequences page to host memory, exactly the policies modeled above.
out = llm.generate([" ... user turn ..."], SamplingParams(max_tokens=256))
Code 24.4.2: The roughly forty lines of capacity logic in Code 24.4.1 collapse to three constructor flags. The engine handles block allocation, prefix deduplication, offload paging, and admission control; a tiered KV layer such as LMCache extends the same idea to a shared cross-node store. You supply the policy, not the bookkeeping.

5. Why the Cache Decides Fleet Throughput Advanced

Stepping back to the fleet, the per-node capacity equation propagates straight into the sizing arithmetic of Chapter 23. If each node holds $N_\text{concurrent}$ sequences and a replica sustains some tokens per second per sequence, total fleet throughput is the per-node concurrency times the number of replicas, minus whatever the cache movements cost. Tiering and sharing raise the multiplicand on every node at once; a poorly managed cache lowers it on every node at once. This is why two clusters with identical GPUs and identical models can differ by more than twofold in sustained throughput: the difference is cache management, not silicon.

The transfer cost is the term that distinguishes a single-node optimization from a distributed one. Inside a node the hierarchy moves blocks over PCIe and the memory bus; across nodes it moves them over the network, and that movement competes with the very requests it serves. The disaggregated design of Section 24.5 accepts this cost deliberately, separating the compute-heavy prefill phase from the memory-heavy decode phase so each can scale independently, and pays for it with the KV transfer in Figure 24.4.1. Whether that trade is worth it is exactly the kind of interconnect-versus-recompute calculation Output 24.4.1 makes concrete.

Fun Note: The Cache Ate the Weights

For a long-context model serving many users, the keys and values quietly outweigh the parameters that produced them. A 7-billion-parameter model is about 14 GiB in fp16, yet 64 sequences at 8k context, as in Output 24.4.1, are also about 64 GiB of cache. The server spends more memory remembering the conversation than it does on the model that holds the conversation. The model is the smaller resident; the memory of what everyone said is the larger one.

Research Frontier: Tiered and Disaggregated KV Stores (2024 to 2026)

Treating the KV cache as a first-class distributed store is one of the most active areas in serving systems. Mooncake (Qin et al., 2024), the architecture behind the Kimi service, is built around a disaggregated, KV-cache-centric design: a pooled KV store spanning GPU, CPU, and SSD across the cluster, with prefill and decode separated and the cache treated as the scheduling unit. LMCache (2024 to 2025) is an open-source KV layer that caches and reuses blocks across a tiered hierarchy and shares them across vLLM instances, reporting large reductions in time-to-first-token for reused contexts. The broader KV-offloading and prefix-aware-routing literature pushes blocks down to host and remote memory and routes requests to the node that already holds the matching prefix, so cache locality, not load alone, drives placement. The throughline is the thesis of this section taken to its conclusion: the KV cache is not a per-GPU implementation detail but a managed, shared, fleet-wide resource, and the next wave of serving gains comes from managing it better rather than from faster matrix units.

Exercise 24.4.1: Read the Capacity Equation Conceptual

Using the per-sequence footprint formula $M_\text{seq}(t) = 2 L H_{kv} d_\text{head} b\, t$ from Section 1, explain in words why a model that switches from multi-head attention ($H_{kv}$ equal to the number of query heads) to grouped-query attention with eight key/value heads can quadruple or more its concurrent-sequence capacity without changing context length or HBM. Then state which two terms in $N_\text{concurrent}$ a serving operator can influence at deployment time and which are fixed by the model architecture, and connect each lever to one policy from Section 2 (tiering) or Section 3 (sharding/sharing).

Exercise 24.4.2: Extend the Capacity Model Coding

Modify Code 24.4.1 so that context length is not fixed at 8192 but drawn from a mix: 80 percent of sequences hold 2k tokens and 20 percent hold 32k tokens. Compute the expected per-sequence footprint and the resulting GPU-only concurrency, and compare it to the uniform-8k case. Then add a fourth policy, "paged sharing plus offload with a 4096-token shared prefix," and report its concurrency. Explain why a heavy tail of long contexts hurts concurrency more than the average context length alone would suggest, and how prefix sharing partially offsets it.

Exercise 24.4.3: Is Disaggregation Worth the Transfer? Analysis

From the transfer figures in Output 24.4.1, a 4096-token prompt's cache is half a gibibyte and costs about 5 ms over a 100 GB/s link and about 34 ms over a 16 GB/s link. Suppose the decode phase that follows produces 256 tokens at 20 ms per token, and that disaggregation (Section 24.5) lets the decode node run at 1.5 times the throughput it would reach if it also had to do prefill. Estimate the end-to-end latency with and without the transfer on each link, and determine the interconnect bandwidth below which moving the KV cache erases the disaggregation benefit. State the general rule your numbers imply for when prefill/decode separation pays off.