Part V: Distributed Inference and Serving
Chapter 24: Distributed LLM Serving

Prefix Caching and Multi-LoRA Fleets

"Every request handed me the same thousand-token system prompt, and every request asked me to read it again from scratch. The day someone let me remember it, I cut my morning workload by ninety percent and finally had time for the actual questions."

A KV Cache That Learned to Share
Big Picture

Real LLM platforms are not a stream of unrelated requests; they are a stream of requests that share enormous amounts of structure, and a serving fleet that recognizes the sharing serves far more traffic on the same hardware. Two kinds of sharing dominate production. Requests share prompt prefixes: the same long system prompt, the same few-shot template, the same retrieved document appears at the front of thousands of requests, and computing its key-value cache once instead of once per request collapses the prefill cost. Requests also share a base model: a multi-tenant platform may host thousands of fine-tuned variants that differ only by a small low-rank adapter, and holding one base in GPU memory while swapping tiny adapters per request gives every tenant a custom model at almost no marginal memory. This section turns both forms of sharing into fleet-level mechanisms: a distributed prefix-KV store with prefix-affinity routing, and multi-LoRA batching that serves different adapters in one batch over one base.

The previous section built request scheduling and continuous batching that keep a multi-node serving fleet busy at the token level. Those mechanisms treat each request as independent work to be packed efficiently. This section exploits the opposite fact: across a real workload, requests are deeply not independent. They overlap in their inputs (a shared prefix) and in the model they want (a shared base with a per-tenant tweak). Both overlaps are pure waste when ignored and pure savings when exploited, and at fleet scale the savings are the difference between a platform that is economical and one that is not. We take the two optimizations in turn, prefix caching first, then multi-LoRA serving, because they rest on the same idea applied to two different shared resources: the KV cache and the model weights. Figure 24.7.1 places the two side by side: a shared prefix's KV cache computed once for many requests on the left, and one base model serving a batch of different LoRA adapters on the right.

Prefix caching: one KV cache, many requests Shared prefix KV system prompt + few-shot (cached once) Request A unique suffix Request B unique suffix Request C unique suffix prefill skipped for the shared part; only suffixes computed Multi-LoRA: one base, many adapters Shared base model (in GPU memory) 14 GB, loaded once req 1 adapter A 17 MB req 2 adapter B 17 MB req 3 adapter C 17 MB One batched forward pass base GEMM + per-request adapter delta
Figure 24.7.1: The two forms of sharing this section exploits. Left: three requests carry the same long prefix, so the fleet computes that prefix's KV cache once and each request prefills only its short unique suffix. Right: a batch mixes requests for three different tenants, each with its own small LoRA adapter, served by a single base model in GPU memory with the adapter applied per request inside one forward pass.

1. Why Prefixes Repeat, and What Caching Them Saves Beginner

An autoregressive transformer answers a request in two phases. Prefill reads the whole prompt and builds a key-value (KV) cache for every token, a cost that grows with the prompt length; decode then generates the answer one token at a time, reading that cache. Chapter 22 established the per-node economics of that KV cache, and Section 24.4 distributed it across machines as a paged store. The observation that drives prefix caching is simple: in production, the front of the prompt is very often identical across many requests. A chatbot prepends the same multi-thousand-token system prompt to every conversation. A classification service prepends the same few-shot template to every input. A retrieval-augmented generation (RAG) pipeline places the same retrieved document at the front of every question about that document. In each case the KV cache of the shared prefix is bit-for-bit the same no matter which request computed it.

If the prefix is shared, computing its KV cache once and reusing it skips the prefill for that segment entirely on every later request. Let a prompt have $L$ tokens of which a fraction $f$ belong to a cached shared prefix. Prefill compute scales with the number of tokens actually run through the network, so the work drops from $L$ to $(1-f)L$ tokens, a saved fraction of exactly $f$. Time to first token (TTFT), the latency a user feels before any output appears, is dominated by prefill, so it shrinks in the same proportion. The savings are not marginal: when a 4000-token RAG context is shared and only a 200-token question is unique, $f \approx 0.95$ and the fleet does one twentieth of the prefill work it would do without caching.

Key Insight: Shared Prefixes Turn Prefill From Per-Request Into Per-Prefix Work

Without prefix caching, prefill cost is paid once per request and scales with the full prompt length. With it, the cost of a shared segment is paid once per distinct prefix and amortized over every request that reuses it. A workload of $R$ requests that all share one long prefix moves from $O(R \cdot L)$ prefill tokens to $O(L + R \cdot L_{\text{unique}})$. The economic consequence is that the marginal cost of an extra request collapses to the cost of its unique suffix, which is why RAG and agent platforms, where the shared context dwarfs the unique query, gain the most.

2. From One Cache to a Fleet: Prefix-Affinity Routing Intermediate

On a single replica, reuse is automatic: the cache is right there. At fleet scale the problem is placement. A cached prefix lives in the GPU memory of whichever replica computed it, and the savings only materialize if later requests carrying that same prefix are routed to that same replica. Sending them anywhere else forces a recomputation and wastes the cache. The fleet therefore needs two cooperating pieces: a shared or distributed prefix-KV store that records which prefixes are cached where (the paged KV store of Section 24.4 is exactly this substrate), and a router that performs prefix-affinity routing, hashing the incoming prompt's prefix and steering the request to the replica that already holds its cache.

This is the same load-balancing-versus-locality tension that Section 23.2 introduced for request routing, now sharpened by the cache. A purely load-balancing router spreads work evenly but destroys cache locality; a purely affinity-based router maximizes reuse but can pile all traffic for a hot prefix onto one replica. Production routers blend the two: prefer the cache-holding replica, but spill to a less-loaded peer (replicating the hot prefix there) when the affinity target saturates. Because building and matching prefixes by hand is tedious and error-prone, the most influential recent system makes prefix sharing automatic.

Research Frontier: RadixAttention and Automatic Prefix Sharing (2024 to 2026)

RadixAttention, introduced with SGLang (Zheng et al., 2024), organizes all cached KV state in a radix tree keyed by token sequences, so any new request automatically reuses the longest already-cached prefix without the programmer marking shared segments at all. A least-recently-used eviction policy over the tree keeps hot prefixes resident, and the structure naturally captures branching workloads (an agent exploring several continuations of one context, or a few-shot template shared by many distinct queries). vLLM shipped automatic prefix caching on the same principle, and the idea has since extended to cross-request KV reuse across nodes and to disk-and-CPU offload tiers that hold prefixes too large for GPU memory. The frontier now studies KV cache sharing across a whole datacenter, including content-addressed prefix stores that let any replica fetch a remote prefix's cache rather than recompute it, tying prefix caching directly to the distributed KV store of Section 24.4.

Fun Note: The Cache That Pays Rent

A long system prompt used to be pure overhead, dead weight prepended to every call and recomputed every time. Prefix caching inverts its economics: the longer and more widely shared the prompt, the more a single cached copy earns back across the requests that reuse it. The thousand-token persona that once looked like a tax becomes the cheapest part of the request, paid for once and then free.

3. Multi-LoRA: Thousands of Models Over One Base Intermediate

The second form of sharing is in the weights. Section 19.7 introduced low-rank adaptation (LoRA): instead of fine-tuning all of a base model's weights, you freeze the base and learn a small low-rank update $\Delta W = BA$ for selected weight matrices, where $A \in \mathbb{R}^{r \times h}$ and $B \in \mathbb{R}^{h \times r}$ with rank $r \ll h$. The adapted weight is $W + BA$. The adapter is tiny: for a 7-billion-parameter base, a rank-16 adapter on the attention projections is roughly seventeen megabytes against fourteen gigabytes for the base, a thousandfold smaller.

A multi-tenant platform exploits this directly. Suppose a company serves one fine-tuned variant per customer, or a writing tool offers thousands of style adapters. Loading a separate full deployment per adapter would replicate the entire base model $N$ times, an impossible memory bill at $N$ in the thousands. Multi-LoRA serving keeps one base in GPU memory and stores only the small adapters, applying the right one per request. The total memory is the base plus $N$ adapters, $M_{\text{multi}} = M_{\text{base}} + N \cdot M_{\text{adapter}}$, against $M_{\text{sep}} = N \cdot M_{\text{base}}$ for separate deployments. The saved fraction is

$$1 - \frac{M_{\text{base}} + N \cdot M_{\text{adapter}}}{N \cdot M_{\text{base}}} = 1 - \frac{1}{N} - \frac{M_{\text{adapter}}}{M_{\text{base}}},$$

which approaches $1$ as $N$ grows because $M_{\text{adapter}} / M_{\text{base}} \approx 10^{-3}$. Each new tenant costs almost nothing. This is the multi-model serving problem of Section 23.5 in its most favorable form: the models are not arbitrary and unrelated but share a common base, so they cohabit one GPU instead of contending for many.

The mechanism that makes this fast is batching requests for different adapters together. A naive implementation would group requests by adapter and run one batch per adapter, wrecking the batching efficiency that Section 24.6 worked to build. The advance of S-LoRA (Sheng et al., 2023) and Punica (Chen et al., 2023) is a batched kernel that computes the shared base matrix multiply once for the whole heterogeneous batch and then adds each request's low-rank delta $B_j A_j x_j$ using its own adapter, so a single forward pass serves a batch in which every request belongs to a different tenant. The base GEMM is amortized across all requests; only the small per-request adapter math differs.

Thesis Thread: Sharing Is the Fleet-Scale Form of Efficiency

Scale-up made one model fast on one node (the quantization and KV-cache work of Chapter 22). Scale-out multiplies a different lever: it lets one cached prefix and one base model serve an entire fleet of requests and tenants. Prefix-affinity routing and multi-LoRA batching are not single-node tricks; they are fleet-level reorganizations that only pay off when a router, a distributed KV store, and a batching scheduler cooperate across machines. The recurring pattern of this book reappears here, that the biggest wins come from recognizing what is shared across the cluster and computing it once, the same instinct that made the all-reduce of Section 1.1 exact rather than redundant.

4. Modeling the Two Savings Intermediate

The code below makes both effects concrete with arithmetic that mirrors the formulas above. Part A sweeps the shared-prefix fraction $f$ and reports the unique tokens that must still be prefilled, the resulting TTFT, and the fraction of prefill compute saved. Part B sweeps the number of tenants $N$ and contrasts the GPU memory of $N$ separate base deployments against one shared base plus $N$ small adapters. Both parts use only the per-token and per-byte costs introduced above, so the numbers are traceable to the model rather than to any framework.

import numpy as np

# ---------- Part A: prefix caching ----------
# A request has a prompt of L_total tokens; a fraction f is a SHARED prefix
# (system prompt / few-shot template / RAG document) whose KV cache is already
# in the fleet, so only the unique (1 - f) * L_total tokens must be prefilled.
L_total = 4096                      # tokens in the full prompt
prefill_tok_per_ms = 8.0           # tokens prefilled per millisecond on one replica
queue_ms = 12.0                    # fixed scheduling/queue overhead per request

print("Part A: prefix caching as shared-prefix fraction grows")
print(f"{'shared f':>9} {'unique tok':>11} {'TTFT ms':>9} {'compute saved':>14}")
base_tokens = L_total              # f = 0 baseline: prefill everything
base_ttft = queue_ms + base_tokens / prefill_tok_per_ms
for f in [0.0, 0.25, 0.50, 0.75, 0.90, 0.95]:
    unique = int(round((1.0 - f) * L_total))
    ttft = queue_ms + unique / prefill_tok_per_ms
    saved = 1.0 - (unique / base_tokens)        # fraction of prefill FLOPs skipped
    print(f"{f:>9.2f} {unique:>11d} {ttft:>9.1f} {saved:>13.0%}")
speedup = base_ttft / (queue_ms + (1 - 0.90) * L_total / prefill_tok_per_ms)
print(f"TTFT at f=0.90 is {speedup:.1f}x faster than the no-cache baseline")

# ---------- Part B: multi-LoRA memory ----------
# Separate deployments load the FULL base model N times. Multi-LoRA keeps ONE
# base in GPU memory and adds only the small per-adapter low-rank matrices.
print("\nPart B: serving N LoRA adapters, separate vs multi-LoRA over one base")
base_gb = 14.0                     # 7B base model in fp16
h, L_layers, r = 4096, 32, 16      # hidden size, layers, LoRA rank
# LoRA on attention q,v: two projections, each two matrices (A,B) per layer, fp16
adapter_params = L_layers * 2 * (2 * r * h)
adapter_gb = adapter_params * 2 / 1e9          # 2 bytes per fp16 param
print(f"base model: {base_gb:.1f} GB   one adapter: {adapter_gb*1000:.1f} MB (rank r={r})")
print(f"{'N adapters':>11} {'separate GB':>12} {'multi-LoRA GB':>14} {'memory saved':>13}")
for N in [1, 10, 100, 1000]:
    separate = N * base_gb                      # N full copies of the base
    multi = base_gb + N * adapter_gb            # one base + N tiny adapters
    saved = 1.0 - multi / separate
    print(f"{N:>11d} {separate:>12.1f} {multi:>14.2f} {saved:>12.1%}")
Code 24.7.1: A pure-Python model of both fleet savings: prefill compute and TTFT as the shared-prefix fraction $f$ grows (Part A), and GPU memory for $N$ separate base deployments versus one shared base plus $N$ LoRA adapters (Part B). The adapter size follows directly from the rank-$r$ low-rank factorization of Section 19.7.
Part A: prefix caching as shared-prefix fraction grows
 shared f  unique tok   TTFT ms  compute saved
     0.00        4096     524.0            0%
     0.25        3072     396.0           25%
     0.50        2048     268.0           50%
     0.75        1024     140.0           75%
     0.90         410      63.2           90%
     0.95         205      37.6           95%
TTFT at f=0.90 is 8.3x faster than the no-cache baseline

Part B: serving N LoRA adapters, separate vs multi-LoRA over one base
base model: 14.0 GB   one adapter: 16.8 MB (rank r=16)
 N adapters  separate GB  multi-LoRA GB  memory saved
          1         14.0          14.02        -0.1%
         10        140.0          14.17        89.9%
        100       1400.0          15.68        98.9%
       1000      14000.0          30.78        99.8%
Output 24.7.1: A 90 percent shared prefix cuts TTFT by 8.3 times (Part A), and serving a thousand tenants over one shared base needs 30.8 GB instead of 14 TB, a 99.8 percent memory saving (Part B). At $N = 1$ multi-LoRA is marginally worse because the adapter is pure overhead with nothing to amortize; the saving turns decisively positive the moment a second tenant arrives.

Two features of Output 24.7.1 are worth dwelling on. In Part A the saving equals $f$ exactly, confirming the model from Section 1, and TTFT falls from 524 ms to 63 ms once a 4000-token context is mostly shared, the kind of latency drop a user feels immediately. In Part B the separate-deployment column reaches fourteen terabytes of GPU memory for a thousand tenants, which no realistic fleet can hold, while the multi-LoRA column stays under thirty-one gigabytes, comfortably inside two or three GPUs. The negative saving at $N = 1$ is the honest edge case: when there is nothing to share, sharing machinery is overhead. Every cell of the table is reproduced by the arithmetic in Code 24.7.1, which is why a from-scratch model is enough to expose the scaling that makes these optimizations mandatory rather than optional on real platforms.

Practical Example: The RAG Platform That Stopped Re-Reading Its Documents

Who: An ML platform engineer running a document-question-answering service for an enterprise customer base.

Situation: Each request stuffed a retrieved 3500-token document plus a 600-token instruction block in front of a short user question, and traffic was bursty: many users asked different questions about the same handful of popular documents within seconds of each other.

Problem: Prefill dominated cost and latency. The fleet prefilled the same documents thousands of times an hour, TTFT sat above half a second, and GPU hours were budgeted almost entirely for re-reading text the system had already read.

Dilemma: Buy more GPUs to brute-force the prefill, simple but linear in cost, or restructure the fleet to recognize that most of each prompt was shared and route for cache reuse, cheaper but requiring an affinity-aware router and a shared prefix store.

Decision: They enabled automatic prefix caching in the engine and added prefix-affinity routing on top of the existing load balancer, hashing the document-plus-instruction prefix and steering same-document requests to the replica already holding its KV cache.

How: They switched to an engine with RadixAttention-style automatic prefix caching (see the Library Shortcut), set the router to prefer the cache-holding replica and spill to the least-loaded peer only past a saturation threshold, and kept the paged KV store of Section 24.4 as the substrate so hot prefixes stayed resident.

Result: With a measured shared-prefix fraction near 0.9, prefill compute fell by roughly the predicted ninety percent, TTFT dropped from above 500 ms to under 80 ms on cache hits, and the same hardware absorbed several times the request rate, mirroring Part A of Output 24.7.1.

Lesson: When the shared part of the prompt dwarfs the unique part, caching the prefix and routing for affinity is the highest-leverage change available, and it costs router logic rather than GPUs.

Library Shortcut: SGLang and vLLM Do Prefix Caching and Multi-LoRA for You

Code 24.7.1 modeled the savings; production engines deliver them with a flag. SGLang turns on RadixAttention automatic prefix sharing by default, and both SGLang and vLLM serve many LoRA adapters over one base with per-request adapter selection, batching different adapters together:

# vLLM: one base model, automatic prefix caching + multi-LoRA, a few lines.
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest

llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    enable_prefix_caching=True,        # reuse shared-prefix KV across requests
    enable_lora=True, max_loras=64,    # hold many adapters over one base in GPU
)

# Each request picks its tenant's adapter; the engine batches them together
# and reuses any cached prefix automatically. No manual cache or routing code.
out = llm.generate(
    prompts=[long_shared_system_prompt + user_question],
    sampling_params=SamplingParams(max_tokens=128),
    lora_request=LoRARequest("tenant_42", 42, "/adapters/tenant_42"),
)
Code 24.7.2: The hundreds of lines of radix-tree cache management, prefix hashing, adapter paging, and batched LoRA kernels collapse to two constructor flags and a per-request adapter handle. The engine handles cache eviction, prefix matching, and the S-LoRA-style batched delta internally; SGLang exposes the same capabilities with RadixAttention on by default.

5. When Sharing Does Not Pay Advanced

Both optimizations have a regime where the machinery costs more than it returns, and a fleet that applies them blindly can lose. Prefix caching only helps when prefixes actually repeat. A workload of all-unique prompts (every user pastes a different document, no template is shared) has $f \approx 0$, and the cache then consumes GPU memory and bookkeeping for prefixes that are never reused, evicting useful KV blocks in the process. The router's affinity logic adds latency to every request for a reuse that never comes. The remedy is to measure the real prefix-hit rate and enable caching where it is high, which is exactly why the engines expose it as a toggle rather than forcing it on.

Multi-LoRA has its own limits. The $N = 1$ row of Output 24.7.1 already showed that a single tenant gains nothing; the saving needs scale. More subtly, very high adapter ranks erode the assumption that adapters are tiny: a rank-256 adapter is sixteen times larger than the rank-16 one modeled here, and at large $N$ the $N \cdot M_{\text{adapter}}$ term stops being negligible, so the per-tenant cost is no longer near zero. There is also a throughput tax: the batched adapter kernel does extra low-rank work per request on top of the shared base GEMM, so a batch of many distinct adapters runs somewhat slower than a batch of identical ones, the price of the per-request flexibility. The MoE serving of Section 24.8, which we turn to next, faces a related batching question from a different direction: not many small adapters over one dense base, but a few large experts selected per token across the fleet.

Exercise 24.7.1: Where the Prefix Pays Off Conceptual

For each workload, estimate the shared-prefix fraction $f$ and state whether prefix-affinity routing is worth its complexity, justifying with the saved-fraction model from Section 1: (a) a customer-support bot with a 1500-token system persona and 50-token user turns; (b) a code-search tool where every query is a different, unique code snippet with no shared template; (c) a RAG service answering many questions about a small library of frequently accessed documents. For the case where caching does not pay, explain what the cache costs the fleet even when it is never reused.

Exercise 24.7.2: The Crossover Tenant Count Coding

Extend Part B of Code 24.7.1 to find, for a given adapter rank $r$, the smallest number of tenants $N$ at which multi-LoRA saves at least 95 percent of memory versus separate deployments. Then sweep $r \in \{8, 16, 64, 256\}$ and plot or tabulate how that crossover $N$ moves as the adapter grows. Using the saved-fraction formula $1 - 1/N - M_{\text{adapter}}/M_{\text{base}}$, explain analytically why larger $r$ pushes the crossover higher and what adapter rank would make multi-LoRA pointless even at $N = 1000$.

Exercise 24.7.3: Affinity Versus Balance Analysis

A fleet of 4 replicas receives requests for 3 prefixes whose traffic shares are 0.70, 0.20, and 0.10. A pure prefix-affinity router pins each prefix to one replica. Compute the per-replica load and identify the hot spot. Now design a blended policy that replicates the 0.70 prefix onto a second replica past a saturation threshold, and estimate the new load balance and the extra KV memory that replication costs. Argue, with reference to the routing tension of Section 23.2, why neither pure affinity nor pure load balancing is optimal and what signal the router should use to choose between them per request.