Part V: Distributed Inference and Serving
Chapter 23: Distributed Inference Systems

Replicas, Load Balancing, and Batch-Aware Routing

"They told me to balance the load evenly. So I gave every replica exactly one request, and now every GPU is running a batch of one and hating me equally."

A Round-Robin Router With Good Intentions
Big Picture

You scale inference throughput by replicating one optimized model node across many GPUs behind a router, but the router cannot be a plain web load balancer: it must keep each replica's GPU batch efficiently full and send stateful requests back to the replica that already holds their KV cache. The previous section established why model serving differs from web serving: a GPU is most efficient when it runs a large batch, and generation is stateful because each conversation carries a growing KV cache. Those two facts turn load balancing from a solved problem into the central design question of a serving fleet. Spread requests too evenly and every replica runs half-empty batches, wasting most of the hardware you paid for. Ignore the KV cache and every routed request recomputes a prefix that already exists somewhere else. This section builds the router that gets both right: replicate the node, fill the batches, respect the cache, and protect the tail.

The single-node work is done. In Chapter 22 we made one model run as fast as one GPU allows: quantized weights, a paged KV cache, fused attention kernels, and continuous batching that keeps the accelerator busy. That gives us a node with a known throughput, a known latency curve, and a known memory budget. It is not enough. A node that answers 300 requests per second cannot serve a product that receives 3,000. The remedy for a throughput ceiling is the one named in Section 1.1: replicate the service across more machines. This section is about doing that replication well for GPU inference specifically, where the naive answer (round-robin across identical replicas) quietly wastes most of the fleet.

The architecture is simple to draw and subtle to operate. A router (sometimes called a gateway or a serving frontend) sits in front of $R$ identical replicas, each a full copy of the optimized model node. Every incoming request is assigned to one replica, joins that replica's queue, and is eventually folded into a batch that the GPU executes. Total fleet throughput is, in the best case, $R$ times one replica's throughput. The router's job is to make the common case the best case, and Figure 23.2.1 shows the two ideas that get it there: batch-aware fill and prefix affinity.

Request stream req · req · req(S7) · req · req(S7) Router batch-aware + prefix affinity Replica 1 full batch (efficient) Replica 2 nearly full Replica 3 (holds S7 KV cache) session S7 stays here Replica R S7 routed by affinity
Figure 23.2.1: The distributed-serving architecture. One router fronts $R$ replicas, each a full copy of the Chapter 22 optimized node. Stateless requests are placed to keep batches full (batch-aware routing, blue arrows), so a replica runs eight-wide batches rather than batches of one. Session-tagged requests (here session S7) are routed by prefix affinity (green dashed arrow) to the replica that already holds their KV cache, so the prefix is not recomputed. The empty slots on the lower replicas show the waste that round-robin balancing would create everywhere.

1. Replicate the Node, Multiply the Throughput Beginner

The replicated-service pattern is the oldest move in distributed systems and the one we already named as an axis of distribution. To serve more requests than one server can handle, you run many identical servers and put a load balancer in front. We met this pattern abstractly in Section 2.9, where a stateless replica plus a balancer gives throughput that grows linearly with the replica count and availability that survives any single replica's failure. For ordinary web services, that is essentially the whole story: requests are independent and cheap, so any replica will do, and round-robin or random placement is close to optimal.

GPU inference inherits the pattern and breaks the assumption that makes round-robin optimal. A replica here is not a cheap stateless web worker; it is a full copy of the optimized model node from Chapter 22, holding billions of parameters in GPU memory and most efficient only when it processes many requests together in a batch. Two properties follow, and they are the entire reason this section exists. First, a replica's throughput depends on how it is fed: feed it one request at a time and it runs far below its rated capacity. Second, a replica accumulates state, the KV cache of every active generation, so the replicas are not interchangeable for a request that is mid-conversation. A web balancer assumes replicas are stateless and requests are uniform. Neither holds here, which is why a serving router needs ideas that a web balancer does not have.

Key Insight: A Serving Router Optimizes Batch Occupancy, Not Request Spread

A web load balancer minimizes per-replica request count, spreading load as evenly as possible. A GPU serving router maximizes per-replica batch occupancy, because the cost of a GPU step is paid per step regardless of how full the batch is. The two objectives conflict: perfectly even spreading produces many small batches, which is the worst case for GPU efficiency. The router's real job is to consolidate enough requests onto each replica to fill its batch, while staying inside the latency budget and respecting which replica holds each session's cache. Spread is a constraint; occupancy is the objective.

2. Why Round-Robin Is the Wrong Default Intermediate

Consider $R$ replicas, each running a GPU step that handles up to $B$ requests and takes time $\tau$ regardless of how many requests are actually in the batch. The fleet's peak throughput is

$$\Lambda_{\max} = \frac{R \cdot B}{\tau},$$

achieved only when every step runs a full batch of $B$. The cost the hardware bills you, on the other hand, is one GPU-step of $\tau$ per fired batch, full or not. So the quantity that decides whether you are wasting money is the mean batch occupancy $\bar{b}/B$, where $\bar{b}$ is the average number of requests per fired step. A fleet running at 45 percent occupancy is paying for more than twice the hardware its workload needs.

Round-robin maximizes the wrong thing. By spreading each arriving request to the next replica in rotation, it minimizes the number of requests waiting at any one replica, which sounds desirable until you remember that a replica fires a step as soon as it has any backlog at all. Thin, evenly-spread queues mean every replica frequently fires a batch of one or two, the GPU pays the full $\tau$ for a nearly empty batch, and effective throughput collapses toward $R/\tau$ instead of $R \cdot B / \tau$. The fix is batch-aware routing: deliberately let a small queue accumulate at a replica, and let the replica wait a bounded time for its batch to fill before firing. This is the fleet-level generalization of the continuous batching from Section 22.7, where a single node forms batches from its own queue; here the router shapes which requests land in which queue so that batches can fill at all.

For the placement decision itself, the serving world borrows two ideas from classical load balancing and adapts them to batched work. Least-outstanding-requests routes each request to the replica with the shortest current backlog, which keeps queues balanced without the blindness of round-robin. Power-of-two-choices is the cheap approximation: sample two replicas at random and send the request to the less-loaded of the two. This avoids the herd behavior where every request rushes to the single least-loaded replica (a thundering herd that the global least-loaded rule can cause), and it provably keeps the maximum queue length close to balanced with only two probes per request. Both are adapted for GPU work by measuring "load" as queued-plus-in-flight requests against the batch capacity $B$, not raw connection count.

Fun Note: The Tyranny of the Empty Batch

A GPU running a batch of one is the most expensive way to be idle ever invented. It is fully powered, fully clocked, drawing its entire thermal budget, and doing one customer's worth of work in a slot sized for eight. Round-robin under light load turns an entire fleet into this state simultaneously, which is how a team can watch their dashboard show every GPU at 100 percent utilization while their throughput-per-dollar quietly falls through the floor. Utilization is not occupancy, and the bill knows the difference.

3. Prefix Affinity: Routing to the Cache Advanced

Batch occupancy is only half of the routing problem. The other half comes from statefulness. When a model generates text token by token, it builds a KV cache holding the attention keys and values for every token seen so far, the structure we sized and paged in Section 22.5. That cache lives in the GPU memory of one specific replica. If a follow-up turn in the same conversation, or a new request that shares a long system prompt, is routed to a different replica, that replica has no cache for the shared prefix and must recompute it from scratch, paying the full prefill cost again. Route the request back to the replica that already holds the prefix, and the prefill is nearly free.

This is prefix affinity (also called session affinity or sticky routing). The router keeps a small table mapping each session id or prompt-prefix hash to the replica currently holding its cache, and routes matching requests there. The tension with batch-aware balancing is real: affinity pins a request to a possibly-busy replica, which can hurt load balance, so production routers blend the two, preferring the cache-holding replica unless its backlog exceeds a threshold, in which case they evict and re-route. The payoff is large for workloads with shared prefixes (multi-turn chat, retrieval-augmented prompts with a common context, agent loops that re-send a long tool-description preamble), because the saved prefill is often the dominant cost of the request.

Thesis Thread: A Single-Node Optimization Becomes a Fleet-Level Routing Decision

The KV cache began in Chapter 22 as a within-the-node memory optimization: store the attention state so you never recompute it. Scaled out across a fleet, that same cache becomes a routing constraint, because the saved computation only exists on the one replica that built it. The per-node economics of the cache (its memory cost, its hit value) do not disappear when you replicate; they multiply across the fleet and turn into the affinity policy of the router. This is the recurring shape of the whole book: a primitive perfected on one machine returns, transformed, as a coordination decision among many. It returns again, multiplied further, when we shard the cache itself across machines in Chapter 24.

4. Queueing and Admission Control for the Tail Advanced

Filling batches has a cost, and that cost is latency. A request that waits for its batch to fill, or that lands behind a long backlog, experiences a delay that the mean throughput number hides entirely. The metric that matters for a user-facing service is the tail: the p99 latency we made precise in Section 5.3, the time below which 99 percent of requests complete. A serving fleet can have excellent mean latency and a catastrophic tail if a few requests sit behind a saturated replica's queue.

Two mechanisms protect the tail. The first is a bounded fill wait: a replica waits for its batch to fill only up to a deadline, then fires whatever it has, so no request is held hostage to an arrival that never comes. This caps the latency cost of batch-aware routing directly. The second is admission control. When the whole fleet is saturated, queues grow without bound and every request's latency climbs together; it is better to reject or shed a fraction of requests immediately (returning a fast error or a degraded response) than to accept all of them and blow the SLO for everyone. Admission control trades a small drop in completion rate for a bounded tail, and the threshold is usually set on queue depth: if the shortest replica queue already exceeds the point where the SLO is at risk, the router stops accepting new work rather than queueing it. We make the saturation point a control signal for autoscaling in Section 23.4; here it is a control signal for protecting the requests already in flight.

5. Simulating the Three Policies Intermediate

The argument so far is qualitative; the simulation below makes it quantitative. It models a fleet of $R = 4$ replicas, each running GPU steps of fixed cost $\tau$ that process up to $B = 8$ requests. Requests arrive as a Poisson stream, each tagged with a session prefix, and we hold the offered load fixed while swapping the routing-and-batching policy. The three policies are exactly the ones developed above: plain round-robin with greedy firing, batch-aware power-of-two routing with a bounded fill wait, and prefix-affinity on top of batch-aware. We report sustained throughput, mean batch fill (the occupancy that decides GPU efficiency), total GPU busy-seconds (what the hardware actually bills, including the KV-recompute penalty on cache misses), and the p99 queue wait.

import random, statistics

# Discrete-time fleet simulator (1 ms ticks). R replicas each run one GPU "step"
# at a time. A step processes up to B queued requests and costs STEP_MS regardless
# of how full the batch is, so a half-empty batch burns a full GPU slot for half
# the useful work. Offered load is held fixed; the three policies differ only in
# how the ROUTER assigns each arrival and WHEN each replica fires its batch.
#
#   round_robin : even spread; a replica fires the instant it is idle and holds
#                 >=1 request -> many tiny batches -> poor GPU efficiency.
#   batch_aware : route to the shorter of two sampled backlogs (power-of-two) AND
#                 let a replica wait up to FILL_WAIT_MS for the batch to fill.
#   prefix_affinity : batch_aware PLUS pin each session to the replica already
#                 holding its KV cache, avoiding a per-step recompute penalty.

R, B          = 4, 8         # replicas, max batch size per step
STEP_MS       = 40           # one GPU step, fixed cost regardless of fill
HORIZON_MS    = 60000        # simulated wall-clock
ARRIVAL_RATE  = 0.30         # requests per ms (~300 req/s); capacity = R*B/STEP = 800
N_PREFIX      = 50
FILL_WAIT_MS  = 25           # how long a batch-aware replica waits to fill
KV_MISS_MS    = 3            # extra step cost PER uncached prefix in a batch
CACHE_CAP     = 12           # prefixes a replica can keep KV cache for
random.seed(7)

def simulate(policy):
    queues   = [[] for _ in range(R)]     # lists of (arrive_ms, prefix)
    busy_to  = [0] * R                     # tick until which replica is busy
    wait_since = [None] * R                # when current batch-aware wait started
    cached   = [set() for _ in range(R)]
    affinity = {}
    rr = [0]
    fills, waits = [], []
    gpu_ms = 0
    batch_aware  = policy in ("batch_aware", "prefix_affinity")
    use_affinity = policy == "prefix_affinity"

    def route(pfx):
        if policy == "round_robin":
            r = rr[0] % R; rr[0] += 1; return r
        if use_affinity and pfx in affinity:
            return affinity[pfx]                  # session's KV cache lives here
        a, b = random.randrange(R), random.randrange(R)   # power-of-two-choices
        r = a if len(queues[a]) <= len(queues[b]) else b
        if use_affinity:
            affinity[pfx] = r
        return r

    def fire(r, now):
        nonlocal gpu_ms
        q = queues[r]
        batch = q[:B]; del q[:len(batch)]
        pset = {p for (_, p) in batch}
        # KV-cache misses cost extra per uncached prefix (recompute the prefix).
        # Each replica keeps a bounded cache of recently seen prefixes.
        misses = len(pset - cached[r])
        cost = STEP_MS + KV_MISS_MS * misses
        cached[r] |= pset
        if len(cached[r]) > CACHE_CAP:                 # bounded cache, drop overflow
            cached[r] = set(list(cached[r])[-CACHE_CAP:])
        for (arr, _) in batch:
            waits.append(now - arr)
        fills.append(len(batch))
        gpu_ms += cost
        busy_to[r] = now + cost
        wait_since[r] = None

    for now in range(HORIZON_MS):
        # Poisson arrivals: at most one per tick at this rate (rate < 1/ms)
        if random.random() < ARRIVAL_RATE:
            pfx = random.randrange(N_PREFIX)
            queues[route(pfx)].append((now, pfx))
        # each idle replica decides whether to fire
        for r in range(R):
            if busy_to[r] > now or not queues[r]:
                wait_since[r] = None
                continue
            if not batch_aware:
                fire(r, now)                      # greedy: fire on any backlog
            elif len(queues[r]) >= B:
                fire(r, now)                      # batch full
            else:
                if wait_since[r] is None:
                    wait_since[r] = now
                if now - wait_since[r] >= FILL_WAIT_MS:
                    fire(r, now)                  # waited long enough, fire partial

    served = len(waits)
    thr = served / (HORIZON_MS / 1000)
    util = statistics.mean(fills) / B
    waits.sort()
    p99 = waits[int(0.99 * len(waits)) - 1]
    gpu_busy_s = gpu_ms / 1000
    return thr, util, gpu_busy_s, p99

print(f"offered load ~{ARRIVAL_RATE*1000:.0f} req/s, fleet capacity {R*B*1000//STEP_MS} req/s\n")
print(f"{'policy':<18}{'thrupt(req/s)':>14}{'gpu_fill':>10}{'gpu_busy(s)':>12}{'p99_wait(ms)':>14}")
print("-" * 68)
for pol in ("round_robin", "batch_aware", "prefix_affinity"):
    thr, util, steps, p99 = simulate(pol)
    print(f"{pol:<18}{thr:>14.0f}{util*100:>9.0f}%{steps:>12.1f}{p99:>14.0f}")
Code 23.2.1: A pure-Python fleet simulator that swaps only the routing-and-batching policy while holding the offered load, replica count, batch size, and step cost fixed. The route function carries the placement policy (round-robin, power-of-two, or affinity lookup); the firing logic inside the tick loop carries the batching policy (greedy versus bounded fill wait).
offered load ~300 req/s, fleet capacity 800 req/s

policy             thrupt(req/s)  gpu_fill gpu_busy(s)  p99_wait(ms)
--------------------------------------------------------------------
round_robin                  303       45%       240.1            49
batch_aware                  299       71%       164.9            80
prefix_affinity              299       61%       150.4            64
Output 23.2.1: All three policies serve the same ~300 req/s, but at very different cost. Round-robin runs 45 percent-full batches and burns 240 GPU-seconds; batch-aware lifts occupancy to 71 percent and cuts the GPU bill by a third to 165 seconds; prefix-affinity lowers it further to 150 seconds by avoiding KV recompute, and recovers tail latency (p99 64 ms versus 80 ms) by keeping each session on one warm replica.

Read the table by column. Throughput is the same across all three because the offered load is well inside fleet capacity; what differs is the price paid to serve it. Round-robin has the lowest p99 (49 ms) precisely because it never waits to fill a batch, but it pays for that with 45 percent occupancy and the highest GPU bill (240 seconds), more than half the hardware doing nothing useful. Batch-aware routing trades a slower tail (80 ms, from the fill wait) for a 31 percent cut in GPU-seconds. Prefix-affinity then improves on batch-aware along two axes at once: it cuts the GPU bill further (150 versus 165 seconds) by turning cache misses into hits, and it pulls the tail back down (64 versus 80 ms) because affinity-routed sessions skip the expensive recompute that was inflating their step cost. The lesson is not that one policy dominates; it is that the router's policy is the knob that sets your throughput-per-dollar and your tail at the same time.

Practical Example: The Chatbot Fleet That Was Paying for Empty Batches

Who: An inference platform engineer running a customer-support chat assistant on a fleet of GPU replicas.

Situation: Traffic was comfortably within capacity, every GPU dashboard read near 100 percent utilization, yet the cloud bill was roughly double what the request volume seemed to justify.

Problem: The serving frontend used the cloud load balancer's default round-robin policy, inherited from the team's web services, and the replicas were firing GPU steps on batches of one or two most of the time.

Dilemma: Add more replicas to chase the latency they thought they had a capacity problem with (more cost), or change the router to fill the batches they already had (more engineering, and a small latency cost from waiting to fill).

Decision: They changed the router, because the simulation in Output 23.2.1 matched their symptom exactly: high utilization, low occupancy, a doubled bill. The problem was occupancy, not capacity, so more replicas would only have spread the waste thinner.

How: They switched placement to power-of-two-choices on queue depth, gave each replica a 20 ms bounded fill wait, and added session affinity keyed on the conversation id so multi-turn chats stuck to one warm replica.

Result: Mean batch occupancy rose from the low 40s to the low 70s percent, the fleet served the same traffic on 40 percent fewer GPUs, and p99 rose by under 30 ms, inside the product's SLO. The KV-cache hit rate on follow-up turns went from near zero to most turns, removing repeated prefill.

Lesson: When utilization is high but the bill is wrong, the router, not the replica count, is almost always the cause. Fill the batches and route to the cache before you buy more hardware.

Library Shortcut: Prefix-Aware Routing in One Config Block

Code 23.2.1 builds the router by hand to expose the mechanism. In production you do not write a tick-level simulator; modern serving frameworks expose batch-aware and prefix-aware routing as configuration. A Ray Serve or SGLang router fronting replicas of a vLLM engine reduces the whole policy to a few declarative lines, and the framework handles queue measurement, the fill wait, the affinity table, and the eviction-on-overload that our simulator only sketches:

# A prefix-aware router fronting several model replicas (SGLang-style).
# The framework tracks each replica's KV-cache contents and routes by longest
# matching prefix, falling back to load balance when no replica holds the prefix.
from sglang import Runtime, Router

replicas = [Runtime(model_path="my-model", base_gpu_id=i) for i in range(4)]
router = Router(
    replicas,
    policy="cache_aware",     # longest-prefix-match affinity, else load balance
    max_wait_ms=20,           # bounded fill wait that caps the tail (Section 4)
    balance_threshold=1.5,    # evict affinity when a replica is 1.5x over the mean
)
router.serve(port=8000)       # one endpoint; clients never see the fleet
Code 23.2.2: The same three ideas as Output 23.2.1 (fill the batch, cap the wait, route to the cache) expressed as a handful of router options. Roughly eighty lines of hand-rolled simulator collapse to one Router construction; the framework owns the load measurement, the affinity table, and the overload eviction.

6. What the Router Buys You Beginner

Stepping back, the router converts a pile of identical replicas into a single coherent service whose throughput scales with the replica count and whose efficiency depends almost entirely on three decisions: how full it keeps the batches, whether it respects the KV cache, and how aggressively it protects the tail. Round-robin gets the first two wrong by default, which is why a serving fleet is one of the few places where the obvious load-balancing answer is actively harmful. The replicated-service pattern from Section 2.9 still gives you the linear throughput scaling and the failure tolerance; the GPU-specific routing is what keeps that scaling from costing twice what it should.

Research Frontier: Prefix-Cache-Aware Routing (2024 to 2026)

The idea that a fleet router should route by KV-cache contents, not just by load, has become an active research and systems line. SGLang's RadixAttention (Zheng et al., 2024) organizes cached prefixes in a radix tree and routes each request to the replica whose tree holds the longest matching prefix, turning prefix affinity from a coarse session id into fine-grained longest-prefix-match across the whole fleet. The vLLM project and its production router add prefix-cache-aware scheduling and a routing layer (the vLLM production stack and "router" components) that track per-replica cache state and balance affinity against load. Parallel work studies the scheduling theory underneath: how to weigh cache-hit value against queue imbalance, how to evict affinity under bursty load, and how prefill/decode disaggregation (which we take up in Chapter 24) changes what "route to the cache" even means when prefill and decode live on different machines. The throughline is that the router is no longer a dumb balancer; it is a cache-aware scheduler, and getting its policy right is now a measurable fraction of total serving cost.

We have replicated the node, fronted it with a router that fills batches and respects the cache, and protected the tail with bounded waits and admission control. That gives us a fleet that serves a steady stream of low-latency requests efficiently. But not all inference is a steady stream of small requests: some workloads are enormous offline jobs that score billions of records overnight, where latency does not matter and throughput is everything. Serving those well on the same fleet, and knowing when to separate them, is a different optimization with its own routing and batching rules. Section 23.3 takes up online versus batch inference across a fleet.

Exercise 23.2.1: Occupancy Versus Utilization Conceptual

A teammate points at a dashboard showing every GPU in the fleet at 98 percent utilization and concludes the fleet is at capacity and needs more replicas. Using the distinction between utilization and batch occupancy from Section 2, explain how this conclusion can be wrong, and describe one additional metric you would put on the dashboard to tell "genuinely at capacity" apart from "running tiny batches at full clock". Relate your answer to the gap between the gpu_fill and the implied utilization in Output 23.2.1's round-robin row.

Exercise 23.2.2: Tune the Fill Wait Coding

Starting from Code 23.2.1, sweep FILL_WAIT_MS over the values 0, 5, 10, 20, 40, and 80 for the batch_aware policy, and plot (or tabulate) mean batch fill, GPU busy-seconds, and p99 wait against the fill wait. Identify the knee where increasing the wait stops improving occupancy but keeps raising the tail. Then raise ARRIVAL_RATE to 0.6 req/ms (near capacity) and repeat: explain why the best fill wait is smaller under heavy load, using the fact that batches fill on their own when the queue is already deep.

Exercise 23.2.3: When Does Affinity Stop Helping? Analysis

Prefix-affinity wins in Output 23.2.1 because requests share a small pool of 50 prefixes and a cache miss costs KV_MISS_MS per uncached prefix. Reason about (and then verify in code) the two regimes where affinity stops paying: first, when N_PREFIX is very large so almost every request is a unique prefix (no reuse to exploit); second, when the offered load is so high that pinning sessions to specific replicas causes severe queue imbalance. For the second case, modify the affinity router to fall back to load balancing when the affinity target's backlog exceeds a threshold, and measure how much tail latency you recover and how much GPU efficiency you give back. Connect the trade-off to the admission-control discussion in Section 4.