"They handed me a load balancer that was perfect for web traffic. It spread my requests so evenly that not one of them ever met a friend, and my GPU sat there at four percent, lonely and underfed."
A Serving Replica That Just Wanted a Full Batch
A model server looks like a web server (it takes requests over the network and returns responses), but three properties of the accelerator underneath it break the assumptions that make web serving simple: the GPU is fastest when it processes many requests together, each replica is enormous and slow to start, and a generation request is stateful, pinned to the replica that holds its key-value cache. A stateless web load balancer wants to spread requests instantly across replicas and spin up new replicas on demand. Do exactly that to a model fleet and you starve the GPU of batches, pay seconds-to-minutes of cold start on every scale event, and shatter the state that multi-step generation depends on. This chapter builds the fleet that respects those three properties. This first section names them precisely, shows in one runnable experiment why instant spreading is the wrong reflex, and previews the AI-specific machinery (batch-aware routing, GPU autoscaling, warm pools) that the rest of the chapter develops.
Chapter 22 took a single accelerator and made it as efficient as it could be: quantized weights, a paged key-value cache, fused attention kernels, the per-node craft of getting the most tokens per second out of one box. That work is the foundation, and this part assumes it. But one optimized box is still one box, with a finite memory ceiling, a finite throughput ceiling, and a finite probability of failing at the worst moment. The moment a deployed model must answer more requests than one accelerator can clear, or must stay up while individual machines crash, the problem becomes distributed: a fleet of replicas behind a router, the genuinely distributed serving problem this chapter is about. Chapter 22 built the node; this chapter builds the fleet around it.
The tempting shortcut is to treat that fleet as a solved problem. We already know how to put many identical servers behind a load balancer and serve millions of stateless HTTP requests; surely a model is just another stateless service with a heavier handle_request function. That intuition is wrong in three specific, consequential ways, and naming them is the work of this section. Each one invalidates a design choice that web serving makes almost without thinking, and each one forces a replacement that the rest of the chapter develops in full.
1. The Three Properties That Break Web-Serving Assumptions Beginner
Standard web serving rests on a small set of assumptions that hold so reliably for HTTP traffic that they fade into the background. Requests are independent and stateless, so any replica can serve any request. Replicas are cheap and start in milliseconds, so the autoscaler can add or remove them freely in response to load. Latency is dominated by the work of one request, so spreading requests as thinly as possible across replicas minimizes the time each one waits. A model fleet violates all three. Table 23.1.1 lays the contrast out directly; the three subsections that follow take each row in turn.
| Web-serving assumption | Why model serving breaks it | Replacement design (this chapter) |
|---|---|---|
| Spread each request instantly to minimize its wait | The GPU is fastest when requests are batched together, so the system wants to group, not spread | Batch-aware routing (23.2), online vs batch (23.3) |
| Replicas are cheap and start in milliseconds | A replica holds a multi-GB model and takes seconds to minutes to load | GPU autoscaling on utilization (23.4), warm pools and cold-start control (23.6) |
| Requests are stateless, so any replica serves any request | Generation keeps a per-request KV cache, pinning a multi-step request to one replica | Session-aware routing and failover (23.2, 23.7) |
1.1 Property one: the GPU wants requests grouped, not spread
An accelerator runs a neural network as a sequence of large matrix operations. Those operations are bandwidth-bound and latency-bound at small sizes: launching a forward pass for a single request pays a large fixed cost (kernel launch, weight reads, pipeline fill) and then does very little useful work. Feed the same pass a batch of requests and the fixed cost is amortized across all of them, while the per-request marginal cost stays tiny. The result is that throughput climbs steeply with batch size before saturating, so the system maximizes throughput by accumulating requests and running them together, the exact opposite of the web instinct to dispatch each request the instant it arrives. This is the latency-throughput tension in its sharpest form: a small, deliberate wait to fill a batch buys a large gain in throughput. We met that tension as a general scaling phenomenon in Section 3.4; here it stops being a curve on a slide and becomes the central routing decision of the whole serving system.
1.2 Property two: replicas are huge and slow to start
A web replica is a few megabytes of code that forks in milliseconds, which is why autoscalers add and remove them freely. A model replica must load weights that are gigabytes to hundreds of gigabytes, pull them across the network or off disk, move them onto the accelerator, and warm up the kernels and the allocator before it can serve a single token. That startup is measured in seconds for a small model and minutes for a large one. You therefore cannot spin up a replica per request, and you cannot let an autoscaler react to a traffic spike by booting cold replicas that arrive after the spike is over. The cost and slowness of a replica forces autoscaling to be predictive and conservative, and forces a pool of pre-warmed replicas to absorb bursts, the subjects of Section 23.4 and Section 23.6.
1.3 Property three: generation is stateful through the KV cache
A classification model returns one answer per request and keeps nothing between requests; it is genuinely stateless, and web-style routing handles it well. Autoregressive generation is different. To produce a sequence token by token, the model keeps a key-value cache, the stored attention keys and values for every token generated so far, on the accelerator, and each new token reads and extends it. That cache, whose per-node economics Chapter 22 dissected in Section 22.5, ties a multi-step generation request to the specific replica that holds it. A stateless load balancer that re-routes the next token of an in-flight generation to a different replica finds no cache there and either fails or silently recomputes the entire history. Statefulness means the router must keep a session on its replica, and means failover has to reckon with lost cache, which Section 23.2 and Section 23.7 address directly.
Web serving spreads requests instantly across stateless, cheap, interchangeable replicas. Model serving must do the reverse on all three counts: group requests so the GPU runs full batches, treat replicas as expensive and slow-to-start so scaling is predictive and warm-pooled, and respect per-request state so a session stays on the replica that holds its KV cache. Every later design in this chapter is a consequence of one of these three inversions. When a serving system surprises you, ask which of the three it is fighting.
2. Watching the Wrong Reflex Fail: Spread vs Batch Intermediate
The clearest way to feel the first property is to simulate two routers against the same stream of requests and a simple model of one accelerator. Our accelerator runs a batch of requests in a single forward pass whose time is a fixed overhead plus a small marginal cost per request: a pass over $b$ requests costs $\text{BASE} + \text{PER}\cdot b$ milliseconds but serves all $b$ at once. With $\text{BASE}=8$ ms and $\text{PER}=0.5$ ms, one request alone costs $8.5$ ms (a throughput ceiling near $118$ requests per second), while a full batch of $32$ costs $24$ ms yet clears $32$ requests, more than ten times the work for less than three times the time. The web-style router dispatches each request the instant it arrives, alone, so every pass is a batch of one. The batch-aware router holds arrivals for a short window (here $10$ ms, or until the batch is full at $32$) and then fires one pass for the whole group. The diagram in Figure 23.1.1 contrasts the two before we run them.
The code below implements both routers and the accelerator model, then runs them against two workloads drawn from the same generator: a short overload burst whose offered rate far exceeds one replica's instant-dispatch ceiling, and a light stream whose rate sits well below it. Reporting throughput together with the median (p50) and tail (p99) latency for each makes the tradeoff visible from both sides.
import random
# A GPU runs a BATCH in one forward pass: fixed overhead BASE plus a small
# marginal cost PER per request. A pass over b requests costs BASE + PER*b ms
# but serves all b at once, so 32-in-one-pass beats 32 single-request passes.
BASE, PER, MAX_BATCH = 8.0, 0.5, 32
def pass_time(b): return BASE + PER * b
def web_style(arr): # dispatch each request instantly, alone
end, lat = 0.0, []
for t in arr:
start = max(t, end) # queue behind the previous pass
end = start + pass_time(1) # one request per forward pass
lat.append(end - t)
return end, lat
WINDOW = 10.0 # ms the router waits to fill a batch
def batch_aware(arr): # accumulate a window, then one pass
end, lat, i = 0.0, [], 0
while i < len(arr):
open_t, j = arr[i], i
while j < len(arr) and arr[j] <= open_t + WINDOW and (j - i) < MAX_BATCH:
j += 1 # admit arrivals within the window
batch = arr[i:j]
start = max(batch[-1], end) # GPU free and window closed
end = start + pass_time(len(batch))
lat.extend(end - t for t in batch)
i = j
return end, lat
def row(name, mk, lat):
lat = sorted(lat); n = len(lat)
print(f" {name:<12} throughput={1000.0*n/mk:8.1f} req/s "
f"p50={lat[n//2]:8.2f} ms p99={lat[int(0.99*n)]:8.2f} ms")
def scenario(label, span_ms, n=1500, seed=0):
random.seed(seed)
arr = sorted(random.uniform(0, span_ms) for _ in range(n))
print(f"{label} ({n} requests over {span_ms/1000:.0f} s, "
f"offered {1000.0*n/span_ms:.0f} req/s):")
mk_w, lat_w = web_style(arr); mk_b, lat_b = batch_aware(arr)
row("web-style", mk_w, lat_w); row("batch-aware", mk_b, lat_b)
print(f" -> batch-aware clears {(n/mk_b)/(n/mk_w):.1f}x the throughput\n")
scenario("OVERLOAD", span_ms=3000) # burst above one replica's ceiling
scenario("LIGHT LOAD", span_ms=20000) # stream well below the ceiling
web_style dispatches each request the instant it arrives (a batch of one); batch_aware holds a short window and runs the accumulated group in a single pass. The same arrival generator feeds an overload burst and a light stream so the tradeoff is visible from both regimes.OVERLOAD (1500 requests over 3 s, offered 500 req/s):
web-style throughput= 117.6 req/s p50= 4876.94 ms p99= 9666.34 ms
batch-aware throughput= 497.4 req/s p50= 17.10 ms p99= 27.64 ms
-> batch-aware clears 4.2x the throughput
LIGHT LOAD (1500 requests over 20 s, offered 75 req/s):
web-style throughput= 74.9 req/s p50= 12.18 ms p99= 67.27 ms
batch-aware throughput= 75.0 req/s p50= 9.50 ms p99= 19.12 ms
-> batch-aware clears 1.0x the throughput
Read the two regimes together, because each tells half the truth. Under the overload burst the web-style reflex is catastrophic: capped at one request per pass, the single replica clears only $118$ requests per second against an offered $500$, so its queue grows without bound and median latency climbs to nearly five seconds. The batch-aware router, running the same hardware, accumulates full batches and clears more than four times the throughput at a median latency of seventeen milliseconds. Under light load, where the accelerator is rarely busy, the two routers are indistinguishable in throughput (both are limited by how fast requests arrive), and the ten-millisecond accumulation window adds no visible cost; the batch-aware policy even tightens the tail by sweeping up the occasional cluster of near-simultaneous arrivals. The lesson is not that batching is always faster but that it is the only policy that does not fall apart under load, and that the small wait it asks for is nearly free when load is light. That asymmetry is exactly why production model servers batch by default and why Section 23.2 makes batch-aware routing the core of the fleet rather than an optimization bolted on later.
The most expensive way to run an accelerator is to feed it one request at a time. A web load balancer doing its textbook best, spreading traffic so evenly that no two requests ever share a forward pass, will hold a six-figure GPU at single-digit utilization and present a beautiful, flat, perfectly balanced dashboard while doing so. The first time an inference team sees that dashboard they congratulate the load balancer. The second time they realize the balancer was optimizing for the wrong thing entirely, and they go looking for the batch button.
3. What This Chapter Builds on Top of the Optimized Node Beginner
The three properties and the experiment above set the agenda for the rest of the chapter, each section repairing one of the broken assumptions for the distributed fleet. Section 23.2 turns the spread-versus-batch finding into real batch-aware routing across many replicas, with load balancing that accounts for batch occupancy and session affinity that keeps a stateful generation on its replica. Section 23.3 separates online serving (latency-critical, one request at a time) from batch serving (throughput-critical, run offline over a corpus), since the right batching window differs by orders of magnitude between them. Section 23.4 builds autoscaling on GPU utilization and queue depth rather than on request count, because a replica running full batches at high utilization should not be scaled the way a busy web server is. Section 23.5 packs multiple models and tenants onto shared accelerators so expensive hardware is not stranded behind one lightly used model. Section 23.6 attacks the cold-start tax head on with warm pools and fast model loading. Section 23.7 makes the fleet survive failures despite the lost-KV-cache problem, and Section 23.8 shows how production frameworks (Triton, Ray Serve, KServe) package all of this so you assemble a fleet instead of building one from scratch.
This part is where the book's distribution-first thesis meets inference. Chapter 22 was the labeled scale-up prerequisite: everything that makes one accelerator fast, quantization, paged KV cache, fused attention, all per-node craft. From here on the subject is scale-out. The KV-cache economics of Section 22.5 do not disappear; they return multiplied across a fleet, where the cache now decides which replica a request must return to and what failover costs when that replica dies. Watch how each per-node quantity from Chapter 22 reappears in this chapter as a fleet-level constraint: batch size becomes a routing decision, model size becomes a cold-start budget, cache state becomes a session-affinity requirement. The single optimized node is the seed; the chapter grows the fleet around it.
Who: A platform team launching a customer-support assistant backed by an open-weights language model.
Situation: They reused the company's mature web-serving stack, putting eight GPU replicas behind a standard round-robin load balancer with autoscaling on requests per second.
Problem: At launch the GPUs idled near eight percent utilization while p99 latency was terrible and the cost per conversation was four times the projection; under a traffic spike the autoscaler booted new replicas that took ninety seconds to load weights and arrived after the spike had passed.
Dilemma: Buy more GPUs to brute-force the latency (expensive, and it would not fix the idle utilization), or keep the hardware and replace the web-style serving assumptions with model-aware ones.
Decision: They kept the hardware. The diagnosis matched the three properties exactly: round-robin spreading starved the batches, request-count autoscaling fought multi-second cold starts, and re-routing mid-conversation kept losing the KV cache.
How: They moved to a serving framework with continuous batching, switched autoscaling to trigger on queue depth and GPU utilization, added a small warm pool of pre-loaded replicas to absorb bursts, and made the router session-affine so each conversation stayed on the replica holding its cache.
Result: GPU utilization rose from eight percent to over sixty, p99 latency fell by more than half, cost per conversation dropped near the original projection, and spikes were absorbed by the warm pool instead of by cold boots.
Lesson: A model fleet is not a web fleet with a heavier handler. Each of the three properties in Table 23.1.1, ignored, becomes a separate production fire; respected, each becomes a design the rest of this chapter hands you.
Code 23.1.1 hand-rolled a fixed-window batcher to make the idea visible. Production serving stacks ship a far more capable version (dynamic batching that adapts the window to load, and, for generation, continuous batching that admits new requests into a running batch between token steps) behind a few lines of configuration. NVIDIA Triton Inference Server turns on dynamic batching with a snippet of model config; Ray Serve and KServe wrap the same idea at the fleet level with autoscaling and routing built in. A representative Triton model configuration:
# config.pbtxt for one model served by Triton
name: "support_llm"
max_batch_size: 32
dynamic_batching {
max_queue_delay_microseconds: 10000 # wait up to 10 ms to fill a batch
preferred_batch_size: [ 16, 32 ] # aim for these efficient batch sizes
}
instance_group [ { count: 2, kind: KIND_GPU } ] # 2 replicas on GPUs
WINDOW and MAX_BATCH) expressed as a few lines of Triton configuration. The framework handles queue management, batch assembly, preferred-size padding, and GPU placement; you declare the policy instead of coding the scheduler. Section 23.8 compares Triton, Ray Serve, and KServe in depth.4. The Frontier: Batching Smarter and Disaggregating the Request Advanced
The fixed-window batcher of Code 23.1.1 is the simplest member of a fast-moving family. Its crudest limitation is that, for generation, a single forward pass is not the whole request: a request runs for many token steps, and a static batch formed once at admission either makes early-finishing requests wait for slow ones or wastes slots as requests complete. The research and systems community has pushed well past it, and the moves below are the ones that Section 23.2 and Chapter 24 build on.
Two ideas now dominate high-throughput model serving. Continuous batching (also called in-flight or iteration-level batching), introduced by the Orca system (Yu et al., OSDI 2022) and made mainstream by vLLM's PagedAttention (Kwon et al., SOSP 2023), admits and evicts requests between token steps rather than freezing a batch at admission, keeping the accelerator densely packed even as individual generations finish at different times; it is now the default in vLLM, TensorRT-LLM, and SGLang. Prefill/decode disaggregation, developed in DistServe (Zhong et al., OSDI 2024) and Splitwise (Patel et al., ISCA 2024) and shipped in production stacks through 2024 to 2026, separates the compute-bound prompt-processing phase from the memory-bandwidth-bound token-generation phase onto different replicas so each can batch on its own terms, since the two phases have opposite hardware appetites. Both ideas take the lesson of Output 23.1.1 (group work to feed the GPU) and apply it at finer grain than a fixed window allows. We develop continuous batching and disaggregation across a fleet in Chapter 24's distributed LLM serving; for now, read Code 23.1.1 as the first rung of a ladder these systems climb.
With the three properties named, the wrong reflex measured, and the chapter's agenda set, we are ready to build the fleet for real. The next section takes the spread-versus-batch finding and turns it into routing across many replicas: how to balance load when the unit of work is a batch rather than a request, and how to keep a stateful generation on the replica that holds its cache. That work begins in Section 23.2.
For each symptom, name which of the three properties from Section 1 (batching, replica cost and startup, KV-cache statefulness) is being mishandled, and which section of this chapter supplies the fix: (a) GPUs sit at five percent utilization while latency is fine, on a model that returns one classification per request; (b) a traffic spike triggers new replicas that come online after the spike has ended; (c) a multi-turn chat occasionally produces garbled continuations whenever a particular replica is busy and the request is re-routed. Explain why applying the fix for the wrong property would not help.
Extend Code 23.1.1 to sweep the accumulation WINDOW over the values $\{0, 2, 5, 10, 20, 50\}$ ms under a fixed moderate load (for example $1500$ requests over $6$ seconds, offered $250$ requests per second). Plot or tabulate throughput and p99 latency against the window. Identify the smallest window that already captures most of the throughput gain, and explain why pushing the window much larger buys little extra throughput while steadily raising latency. Relate the shape you find to the latency-throughput curve of Section 3.4.
A replica serves $120$ requests per second when warm and takes $90$ seconds to load its weights from cold. Traffic doubles instantaneously from $240$ to $480$ requests per second and stays there. Using only these numbers, estimate how long the fleet is under-provisioned if the autoscaler reacts to the spike by booting cold replicas, and how many requests queue up during that window. Then estimate the size of a warm pool (in pre-loaded idle replicas) needed to absorb the same doubling with no cold boots. Explain in one sentence why a web-serving autoscaler, which assumes millisecond startup, would size this pool at zero and be wrong. We make this analysis precise in Section 23.6.