Section 40.7: Distributed Model Serving with vLLM

"A KV cache, paged like virtual memory, serving a hundred conversations at once and pretending each one is the only one that matters."
A Decode Step That Never Stops to Wait

Big Picture

Every reasoning step the agents of Section 40.6 take ends in the same place: a token request sent to a model-serving backend, and the whole application is only as fast, as cheap, and as available as that backend. A single agent run fans out into dozens of short generations, all sharing a long system prompt, arriving in bursts. The serving tier must turn that ragged, concurrent traffic into a steady stream of tokens at a tight latency budget, and no single GPU can do it: the model is sharded across several accelerators to fit in memory, and that shard group is then replicated across nodes to meet demand. This section assembles the production serving fleet that powers the agent application, built on vLLM, applying the per-node efficiency of Chapter 22 and the distributed LLM serving theory of Chapter 24. We do not re-derive those mechanisms here; we deploy them, and we size them for the agentic workload.

The agents of Section 40.6 are clients. They plan, call tools, and reflect, but the substance of every step is a call to a large language model that returns the next action as text. In a toy demo that model is a hosted API behind a key. In the system this chapter builds, the application owns its serving tier, because the agent fleet issues far too many calls, with too much shared structure and too strict a cost target, to leave on a metered public endpoint. Owning the backend means owning the engineering that Chapter 24 teaches: continuous batching, paged key-value (KV) caches, prefill and decode scheduling, and tensor parallelism, all wrapped in a replicated, autoscaling fleet. vLLM is the open engine that packages those mechanisms, and it is the backend behind a large share of self-hosted agent deployments today.

The shape of the agentic workload makes the choice of engine consequential rather than incidental. A web-serving stack treats requests as independent and roughly uniform. Agent traffic is neither: a single user task spawns a tree of model calls (plan, then several tool-formatting calls, then a reflection), most of them short, a few of them long, and almost all of them prefixed by an identical multi-thousand-token system prompt that names the tools and the rules. A serving engine that exploits that structure, by batching across requests continuously and by storing the shared prefix once, serves several times more agents per GPU than one that does not. Figure 40.7.1 shows the fleet we are about to build.

Figure 40.7.1: The serving fleet behind the agent application. Bursty, prefix-sharing calls from the agent fleet enter a router that balances them across $R$ replicas. Each replica is one tensor-parallel shard group: the model is split across GPU 0 and GPU 1, which exchange activations with an all-reduce per layer (the model-parallel axis of Chapter 16). Inside the replica, a continuous-batching scheduler feeds a PagedAttention KV cache whose shared system-prompt block is stored once and referenced by every sequence. Replicas are the data-parallel axis: the same shard group, copied to meet demand.

1. The Serving Engine: Continuous Batching and PagedAttention Intermediate

A language model generates one token at a time. Each step runs the full network over the current context and emits the next token, so a request of $T$ output tokens needs $T$ sequential forward passes through the model. Per-node efficiency, the subject of Chapter 22, is about making each of those passes cheap; serving throughput is about running as many requests through each pass as possible. The two combine multiplicatively, which is why the fleet inherits Chapter 22 as a labeled prerequisite rather than re-deriving it.

The naive way to batch is static: collect a fixed group of requests, run them together until the last one finishes, then start the next group. Agent traffic punishes this badly, because the generation lengths vary by an order of magnitude. A batch holding fifteen twenty-token tool calls and one five-hundred-token plan runs for five hundred steps, and for most of those steps the short sequences have already finished and their slots sit idle. Continuous batching, also called in-flight batching, fixes this by scheduling at the granularity of a single decode step: the moment a sequence emits its end-of-sequence token and frees a slot, a waiting request is admitted into that slot on the very next step. The engine never stalls on the slowest member of a batch. If we write $g_i$ for the output length of request $i$ in a window and $d$ for the per-step decode time of the batch, static batching keeps a batch of size $B$ busy for $\max_i g_i$ steps while only $\sum_i g_i$ are useful, so its slot utilization is

$$U_{\text{static}} = \frac{\sum_{i=1}^{B} g_i}{B \cdot \max_i g_i} \le 1, \qquad U_{\text{continuous}} \to 1,$$

and the achieved throughput in tokens per second rises in direct proportion to that utilization. Under the heavy length variance of agent traffic, $U_{\text{static}}$ can fall well below one half, and continuous batching recovers the lost fraction. The demo in Code 40.7.1 measures exactly this gap.

The second mechanism is memory management for the KV cache. During generation, each layer caches the key and value vectors of every past token so they are not recomputed; this cache grows with the context length and dominates the memory budget once the weights are resident. The bytes a single sequence consumes are

$$M_{\text{KV}} = 2 \cdot L \cdot H_{\text{kv}} \cdot d_{\text{head}} \cdot T \cdot b,$$

where $L$ is the number of layers, $H_{\text{kv}}$ the number of key-value heads, $d_{\text{head}}$ the head dimension, $T$ the context length, $b$ the bytes per element (the factor $2$ counts keys and values). Reserving the maximum possible $T$ for every sequence up front, as early servers did, wastes most of the cache on contexts that never grow that long. PagedAttention, the mechanism vLLM is built around (Kwon et al., 2023), borrows the idea of paged virtual memory from operating systems: it splits the KV cache into fixed-size blocks and allocates them on demand as a sequence grows, so physical memory holds only the tokens that actually exist. Fragmentation drops from roughly sixty percent in the reserve-the-maximum scheme to a few percent, which is the same as multiplying the number of sequences that fit by a factor of two or more. Quantizing the cache to a smaller $b$, the scale-up lever of Chapter 22, shrinks $M_{\text{KV}}$ further.

Key Insight: Serving Throughput Is a Memory Problem Wearing a Compute Costume

It is tempting to think a faster GPU serves more agents. Past a point it does not, because the binding constraint during decode is not arithmetic but the KV-cache memory that caps how many sequences can be in flight at once. Continuous batching keeps the compute saturated; PagedAttention and quantization decide how many sequences that saturated compute is allowed to serve. The two levers are orthogonal, and a serving tier that pulls only one of them leaves most of its capacity unclaimed.

2. Sizing the Cache: How Many Agents Fit on One GPU? Intermediate

The formulas above turn into a capacity number the moment we plug in a real model and a real card. The demo below does this for a 13-billion-parameter decoder served in half precision on one 80 GB accelerator. It computes the KV bytes per token, subtracts the resident weights and a working-memory reserve to find the budget left for the cache, and divides to get the number of concurrent sequences. It then models the prefix-sharing win that is specific to agents, and finally compares continuous against static batching on a length distribution drawn from agentic traffic. The numbers it prints are the ones you would use to decide how many replicas the fleet needs.

import math, random

# ---- Model / hardware configuration (a 13B-class decoder served in fp16) ----
n_layers, n_kv_heads, head_dim, bytes_per = 40, 40, 128, 2   # fp16 -> 2 bytes
gpu_mem_gb, weights_gb, overhead_gb = 80, 26, 6              # one A100-80GB

# KV-cache bytes per token = 2 (K and V) * layers * kv_heads * head_dim * bytes
kv_per_token = 2 * n_layers * n_kv_heads * head_dim * bytes_per
kv_budget_bytes = (gpu_mem_gb - weights_gb - overhead_gb) * (1024 ** 3)

avg_ctx = 2048                                  # tokens per sequence
kv_per_seq = kv_per_token * avg_ctx
max_seqs = int(kv_budget_bytes // kv_per_seq)
print("=== KV-cache sizing (13B fp16, A100-80GB) ===")
print(f"KV bytes per token        : {kv_per_token:,} B  ({kv_per_token/1024:.1f} KB)")
print(f"Memory free for KV cache  : {gpu_mem_gb-weights_gb-overhead_gb} GB")
print(f"KV bytes per sequence     : {kv_per_seq/1e6:.1f} MB  (ctx={avg_ctx})")
print(f"Max concurrent sequences  : {max_seqs}")

# ---- Prefix sharing: agents share a long system prompt, stored ONCE ----
sys_prompt_tokens, unique_tokens = 1500, 548
shared_once = kv_per_token * sys_prompt_tokens
unique_each = kv_per_token * unique_tokens
seqs_with_caching = int((kv_budget_bytes - shared_once) // unique_each)
print("\n=== Prefix caching for an agent fleet ===")
print(f"Concurrent seqs, no cache : {max_seqs}")
print(f"Concurrent seqs, prefix-cached : {seqs_with_caching}  "
      f"({seqs_with_caching/max_seqs:.2f}x)")

# ---- Continuous (in-flight) batching vs static batching ----
def static(batch, lens, d_ms):
    steps = max(lens)                            # wait for the slowest sequence
    return sum(lens) / (steps*d_ms/1000), sum(lens) / (batch*steps)
def continuous(batch, lens, d_ms):
    return sum(lens) / ((sum(lens)/batch)*d_ms/1000), 1.0   # refill freed slots

random.seed(7)
batch, d_ms = 16, 12
lens = [random.choice([20, 30, 40, 350, 512]) for _ in range(batch)]
s_tps, s_u = static(batch, lens, d_ms)
c_tps, c_u = continuous(batch, lens, d_ms)
print("\n=== Continuous vs static batching (variable agent outputs) ===")
print(f"Static  batching: {s_tps:6.1f} tok/s   slot utilization {s_u:.2f}")
print(f"Continuous batch: {c_tps:6.1f} tok/s   slot utilization {c_u:.2f}")
print(f"Speedup from continuous batching : {c_tps/s_tps:.2f}x")

Code 40.7.1: Capacity sizing for the agent serving tier. The first block applies $M_{\text{KV}}$ to count concurrent sequences; the second prices the shared-system-prompt win of prefix caching; the third measures continuous against static batching on a length mix typical of agent traffic.

=== KV-cache sizing (13B fp16, A100-80GB) ===
KV bytes per token        : 819,200 B  (800.0 KB)
Memory free for KV cache  : 48 GB
KV bytes per sequence     : 1677.7 MB  (ctx=2048)
Max concurrent sequences  : 30

=== Prefix caching for an agent fleet ===
Concurrent seqs, no cache : 30
Concurrent seqs, prefix-cached : 112  (3.73x)

=== Continuous vs static batching (variable agent outputs) ===
Static  batching:  463.2 tok/s   slot utilization 0.35
Continuous batch: 1333.3 tok/s   slot utilization 1.00
Speedup from continuous batching : 2.88x

Output 40.7.1: One 80 GB GPU holding a 13B model in fp16 has 48 GB left for the cache, enough for thirty full 2048-token sequences. Storing the shared 1500-token system prompt once instead of per sequence raises that to 112 concurrent agents, a $3.7\times$ gain, and continuous batching lifts decode throughput $2.9\times$ over static batching on the variable-length agent mix.

The three numbers tell a coherent story for the fleet planner. Bare capacity is thirty sequences. Because the agents share a long system prompt, prefix caching nearly quadruples that to over a hundred, which is the single largest lever the agentic structure offers. And whatever the concurrency, continuous batching roughly triples the token rate the GPU actually delivers under the bursty, ragged traffic the agents produce. A replica is therefore not "one GPU equals one agent" but "one shard group equals a hundred-odd agents," and the replica count follows from dividing the fleet's peak concurrency by that figure.

Practical Example: The Agent Platform That Stopped Renting Tokens

Who: A platform engineer running a customer-support agent product on a metered hosted LLM API.

Situation: Each support ticket spawned eight to twelve model calls, all carrying the same 1800-token system prompt and tool catalog, and the monthly API bill scaled linearly with ticket volume.

Problem: The hosted endpoint billed the full shared prompt on every call and offered no control over batching, so most of the spend bought re-processing of identical prefix tokens.

Dilemma: Stay on the simple metered API and watch the bill grow with usage, or stand up a self-hosted vLLM fleet that demanded GPU operations but priced the shared prefix once.

Decision: They self-hosted, because the prefix-sharing and continuous-batching gains of Output 40.7.1 applied directly to their traffic shape, where the same prompt led every call.

How: They deployed a 13B model on two GPUs per replica with tensor parallelism, enabled automatic prefix caching, and placed three replicas behind a queue-depth autoscaler on their cluster (Chapter 33).

Result: Effective per-agent serving cost fell by roughly four-fifths at peak, because the GPUs that had been re-encoding the prompt now spent their cycles generating distinct tokens, and tail latency improved because requests no longer waited behind a metered queue.

Lesson: When traffic has heavy shared structure, owning the serving engine that exploits that structure beats renting one that bills as if every request were unique.

3. Prefill, Decode, and the Agentic Traffic Shape Advanced

A generation request has two phases with opposite resource profiles, and the agentic workload makes the split unusually sharp. Prefill processes the entire input prompt in one parallel pass to populate the KV cache; it is compute-bound and its cost grows with the prompt length. Decode then emits output tokens one at a time, each pass reading the whole cache to produce a single token; it is memory-bandwidth-bound and its cost grows with the output length. Agent calls are prefill-heavy: the prompt is a multi-thousand-token system message plus tool schemas plus history, while the output is often a short tool-call object of a few dozen tokens. A scheduler that lets a giant prefill monopolize the engine stalls every in-flight decode behind it, spiking the latency of the short calls that dominate agent traffic.

Two refinements address this. Chunked prefill splits a long prompt into pieces and interleaves them with ongoing decode steps, so a large prompt no longer freezes the batch. Prefill-decode disaggregation, the more aggressive design that Chapter 24 develops, runs the two phases on separate pools of GPUs: a prefill pool optimized for throughput populates the cache, ships the KV blocks to a decode pool optimized for low-latency token streaming, and each pool is sized to its own bottleneck. For an agent fleet, where every call pays a heavy prefill but most calls decode briefly, disaggregation lets the two pools scale independently rather than forcing one GPU profile to straddle both regimes. Prefix caching compounds the benefit: the shared system prompt is prefilled once and its KV blocks are reused across every agent that carries it, so the expensive phase is largely amortized away.

Library Shortcut: From a Python Object to an OpenAI-Compatible Fleet

vLLM packages continuous batching, PagedAttention, prefix caching, and tensor parallelism behind two interfaces. Offline, the LLM object runs a batched generation in a handful of lines; the engine schedules the batch continuously without any orchestration code from you:

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-2-13b-chat-hf",
          tensor_parallel_size=2,            # shard the model across 2 GPUs
          enable_prefix_caching=True,        # store the shared prompt once
          gpu_memory_utilization=0.92)       # leave headroom for the KV cache
params = SamplingParams(max_tokens=128, temperature=0.0)

system = "You are an agent. Tools: search(q), calc(expr). Rules: ..."  # shared prefix
prompts = [system + f"\nUser task {i}: ..." for i in range(256)]       # one fan-out
for out in llm.generate(prompts, params):     # continuous batching, automatic
    handle(out.outputs[0].text)

Online, one command exposes the same engine as a drop-in OpenAI-compatible server, so the agent code of Section 40.6 points its client at a local URL and changes nothing else:

vllm serve meta-llama/Llama-2-13b-chat-hf \
    --tensor-parallel-size 2 --enable-prefix-caching \
    --max-num-seqs 128 --port 8000
# agents call http://localhost:8000/v1/chat/completions exactly as a hosted API

Code 40.7.2: The capacity model of Code 40.7.1 realized in production. The manual KV accounting, batch scheduling, prefix deduplication, and cross-GPU sharding all collapse into a constructor and a serve command; vLLM owns the scheduler, the paged allocator, and the tensor-parallel all-reduce internally.

4. A Model Too Big for One GPU: Tensor and Pipeline Parallelism Advanced

The capacity arithmetic of Section 2 assumed the weights fit on one card. Many serving models do not. A 70-billion-parameter model in half precision needs roughly 140 GB for weights alone, well beyond a single 80 GB GPU before any KV cache is allocated. The model must then be split across devices, and this is the model-parallel axis that Chapter 16 develops for training and that serving reuses almost unchanged. Tensor parallelism partitions each layer's weight matrices across $N$ GPUs, so each device holds and computes on a $1/N$ slice; the partial results are combined with an all-reduce within every layer, the same collective introduced in Chapter 4 and used for gradients in Chapter 15. If the per-GPU compute time of a layer is $c$ and the all-reduce of its activations costs $a$, the tensor-parallel time per layer is approximately

$$t_{\text{TP}}(N) \approx \frac{c}{N} + a(N),$$

so adding GPUs shrinks the compute term but inflates the communication term; tensor parallelism therefore stays inside a single node where the interconnect (NVLink) is fast, and the practical degree is small (two, four, or eight). Pipeline parallelism is the complementary split: it assigns whole contiguous layers to different GPUs and streams micro-batches through the stages, which moves far less data per step and so tolerates slower cross-node links, at the cost of pipeline fill-and-drain bubbles. A model too large even for one node combines them, tensor-parallel within each node and pipeline-parallel across nodes, the two-axis layout of Chapter 16. In Figure 40.7.1 each replica is a two-way tensor-parallel group; scaling the model to 70B would deepen each replica to eight-way tensor parallelism, or add a pipeline axis, without changing the fleet structure around it.

Thesis Thread: Distributed Twice Over, to Meet One Demand Curve

This serving tier is the book's thesis in its most concrete form. The model is distributed across GPUs because it does not fit on one (model parallelism, Chapter 16), and that whole shard group is then distributed across nodes because one copy cannot answer enough requests (data-parallel serving, Chapter 23). Two orthogonal axes of distribution, model and replica, are composed to satisfy a single agent fleet's demand. Neither alone suffices: model parallelism makes the model runnable, replication makes it scalable, and only together do they serve the workload at its latency budget and its cost target.

5. The Serving Fleet: Replicas, Router, and Autoscaling Intermediate

One replica, however well tuned, has a ceiling: the hundred-odd concurrent agents of Output 40.7.1, at some token rate. The agent fleet at peak wants more, so the replica is copied. Each replica is an identical shard group, and a router in front of them spreads incoming requests across the pool, the data-parallel serving pattern of Chapter 23. The router is not a plain round-robin balancer; because prefix caching makes a replica far cheaper for a request whose prefix it has already cached, a prefix-aware router sends agents sharing a system prompt to the same replica, raising the cache-hit rate across the fleet rather than scattering the shared prefix across every node.

Demand is not constant. Agent traffic is bursty: a batch job or a wave of users fans out into thousands of concurrent calls, then subsides. Autoscaling matches replica count to load, and the right signal is not CPU but the serving queue, the number of requests waiting and the time they wait. The cluster layer of Chapter 33 places each new replica on GPUs that satisfy its tensor-parallel group's topology (all $N$ shards on one node, on the fast interconnect), and scales the pool down when the queue drains. The metric that ties capacity to the service-level objective (SLO) is goodput: the rate of requests served within the latency target, not the raw throughput. Writing $\lambda$ for the offered request rate and $\Pr[\text{latency} \le \text{SLO}]$ for the fraction meeting the budget,

$$\text{goodput} = \lambda \cdot \Pr[\text{latency} \le \text{SLO}],$$

and the fleet is sized so that goodput tracks $\lambda$ across the bursts. Push concurrency too high on a replica and per-token latency rises until requests miss the SLO; goodput then falls even as raw throughput climbs, which is the quantitative signal to add a replica rather than pack the existing ones tighter. This goodput-under-SLO framing is the same one Chapter 24 uses to size LLM fleets, applied here to the agent workload, and it feeds directly into the cost model that Section 40.8 builds next.

Research Frontier: Disaggregated and Prefix-Aware Serving (2024 to 2026)

The serving engine is a fast-moving research target precisely because it gates the cost of every agent product. Prefill-decode disaggregation, proven at scale by DistServe (Zhong et al., 2024) and Mooncake (Qin et al., 2024), splits the two phases onto separate GPU pools and reports large goodput gains under tight SLOs by sizing each pool to its own bottleneck. KV-cache offloading and reuse push the shared cache into CPU memory or a distributed store so prefixes survive across requests and even across replicas, turning prefix caching into a fleet-wide rather than per-replica asset. Speculative decoding, where a small draft model proposes several tokens that the large model verifies in one pass, cuts decode latency for the short agent calls that dominate the traffic. vLLM has absorbed continuous batching, PagedAttention, chunked prefill, and speculative decoding into one engine; the open question the field is now working is how to schedule a heterogeneous fleet, mixing prefill nodes, decode nodes, and draft models, so that agentic goodput per dollar is maximized. We turn that per-dollar question into an explicit budget in Section 40.8.

Fun Note: The Prompt Everyone Carries

The shared system prompt that PagedAttention stores once is the serving tier's version of a group chat where everyone has read the same pinned message. Output 40.7.1 quadrupled capacity not by making any single agent faster but by noticing that a hundred agents were all dragging the identical 1500-token rulebook into the cache and agreeing to keep just one copy. The cheapest token to generate is the one you already generated for somebody else.

Exercise 40.7.1: Read the Capacity Off the Formula Conceptual

Using $M_{\text{KV}} = 2 L H_{\text{kv}} d_{\text{head}} T b$, explain in words why a model that uses grouped-query attention (which sets $H_{\text{kv}}$ to, say, one eighth of the number of query heads) can serve many more concurrent sequences than one with full multi-head attention, even though both have the same number of parameters. Then state which of the three levers in Section 1 (continuous batching, paged allocation, cache quantization) attacks $M_{\text{KV}}$ directly and which attacks utilization instead, and why a fleet planner needs both.

Exercise 40.7.2: Size the Fleet for the Agent Burst Coding

Extend Code 40.7.1 into a fleet sizer. Take a peak offered load of agent calls per second, an average output length, and a target per-call latency, and compute (a) the concurrent sequences each replica supports with prefix caching enabled, (b) the token rate per replica from the continuous-batching model, and (c) the number of replicas needed so that goodput keeps up with the offered load. Then plot goodput against concurrency per replica and mark the point where pushing concurrency higher starts to violate the SLO. Discuss how enabling prefix caching shifts that point.

Exercise 40.7.3: Tensor Parallel or More Replicas? Analysis

A 70B model in fp16 needs about 140 GB of weights and must be sharded. Compare two fleet layouts that use the same eight GPUs: four replicas of two-way tensor parallelism versus two replicas of four-way tensor parallelism (assume the model fits in both, ignoring KV cache for the weight question, then revisit). Using $t_{\text{TP}}(N) \approx c/N + a(N)$, argue which layout gives lower per-token latency and which gives higher aggregate goodput, and explain why the answer depends on whether the binding constraint is single-call latency or fleet throughput. State why tensor parallelism is kept within a node, citing the all-reduce cost.