Part V: Distributed Inference and Serving
Chapter 22: Per-Node Inference Efficiency: A Prerequisite

Continuous Batching and Speculative Decoding

"They told me to wait for the slowest request in my batch. I waited. The sun set. The slowest request was generating a sonnet."

A GPU Slot Held Hostage by the Longest Sequence
Big Picture

An autoregressive model emits one token per forward pass, and different requests want different numbers of tokens, so the naive way of serving them leaves the accelerator idle most of the time. Two single-node techniques recover that lost throughput without touching the model's output. Continuous batching refuses to let a finished request hold its slot hostage: it swaps sequences in and out of the running batch at the granularity of a single token, keeping the GPU full. Speculative decoding refuses to let the large model spend a full forward pass on every token: a small draft model guesses several tokens ahead and the large model verifies the whole guess in one parallel pass, accepting the longest correct prefix. Both are per-node mechanisms; this section builds them so that the distributed serving scheduler of Chapter 24 can multiply their per-GPU gains across an entire fleet.

The previous sections of this chapter made each token cheaper: Section 22.5 paged the KV cache so more sequences fit in memory, and Section 22.6 made the attention kernel itself read fewer bytes. Those wins are real, but they assume the GPU is actually busy. For large-language-model serving that assumption is usually false, because generation is sequential and requests are uneven. This section attacks the two largest sources of idle silicon directly. The first is scheduling waste: a batch that must wait for its slowest member. The second is the sequential nature of decoding itself: a forward pass that produces only one token while the hardware could have produced several. Neither technique changes what the model outputs; both change how fast it gets there.

1. Prefill and Decode Are Two Different Workloads Beginner

Serving one request to a transformer splits into two phases with opposite hardware characters. The prefill phase processes the entire prompt at once: every prompt token attends to every earlier token in a single large matrix multiplication, so prefill is compute-bound, the kind of dense arithmetic a GPU loves. The decode phase then generates the answer one token at a time; each new token requires a forward pass that reads the whole model's weights and the growing KV cache just to produce a single vector. That per-token pass moves a great deal of memory for very little arithmetic, so decode is memory-bandwidth-bound and leaves the compute units mostly idle. A request is therefore a short compute-heavy burst followed by a long memory-heavy trickle, and the trickle is where serving systems spend most of their time.

This asymmetry is the root cause of both techniques in this section. Continuous batching exists because the memory-bound decode phase has spare compute to absorb other requests, so you should keep as many sequences decoding together as memory allows. Speculative decoding exists because a single decode pass is so memory-bound that verifying several proposed tokens costs almost the same as generating one, so you should make each expensive pass do more work. A third optimization, chunked prefill, follows from the same observation: a long prompt's compute-bound prefill can starve the latency-sensitive decode steps of other requests, so serving engines split a large prefill into smaller chunks and interleave them with ongoing decode, smoothing the latency of every request sharing the GPU.

Key Insight: Decode Is Memory-Bound, So the GPU Has Compute to Spare

One decode step reads the entire model and KV cache to emit a single token, so it is starved for bytes, not for arithmetic. That single fact drives the whole section. Because decode wastes compute, you can pack more sequences into the batch almost for free (continuous batching), and you can verify several speculative tokens in one pass for almost the cost of one (speculative decoding). When you see an inference optimization for autoregressive models, ask what it does with the idle compute that memory-bound decoding leaves on the table.

2. Static Batching Wastes the GPU on the Longest Sequence Beginner

Generic model servers batch requests to amortize the cost of a forward pass, a practice Chapter 23 treats as dynamic batching: the server waits a few milliseconds, collects whatever requests arrive, runs them as one batch, and returns all the results together. That works beautifully when every request does the same amount of work, as in image classification, where each input takes exactly one forward pass. It fails for text generation, because the requests in a batch finish at wildly different times. One user asks for a yes-or-no answer (eight tokens) while another asks for an essay (two hundred tokens), yet a static batch cannot return the short answer until the whole batch is done. Until then the short request's slot sits occupied but idle, a bubble of wasted GPU, and no new request can take its place.

Continuous batching, also called in-flight batching, removes the bubble by operating at the token level rather than the request level. After every decode step the scheduler checks which sequences have emitted their end-of-sequence token, evicts them from the running batch, frees their KV-cache pages, and admits waiting requests into the freed slots, all before the next step. The batch is thus never the same from one step to the next; it is a continuously refreshed set of whatever sequences are currently live. The short request leaves the instant it finishes and a new one begins immediately, so the GPU stays full instead of waiting for the essay. Figure 22.7.1 contrasts the two schedules, and the simulation that follows measures the difference.

Static batching: every slot waits for the longest request slot 1 slot 2 slot 3 slot 4 time → (red dashed = idle bubble, batch blocked until longest finishes) Continuous batching: finished slots refill immediately slot 1 slot 2 slot 3 slot 4 time → (blue = a freshly admitted request; almost no idle gap) Speculative decoding: draft proposes, target verifies in one pass Draft model cheap, sequential propose k tokens Target model one parallel pass accept longest correct prefix 3 accepted, 1 rejected → resample loop: target's accepted tokens seed the next draft round
Figure 22.7.1: The two single-node throughput techniques. Top, static batching holds all four slots until the longest request finishes, so finished slots (red dashed) sit idle. Middle, continuous batching evicts a finished sequence and admits a queued one (blue) at the next token step, keeping the GPU full. Bottom, speculative decoding: a cheap draft model proposes $k$ tokens, the target model verifies them in one parallel forward pass, accepts the longest correct prefix (here three of four), and feeds the result back into the next draft round. Section 3 quantifies the batching gain and Section 4 the speculative speedup.

The simulation in Code 22.7.1 models both schedules in pure Python. It draws two hundred requests with decode lengths between eight and two hundred tokens, runs them through eight GPU slots under each policy, and reports utilization (the fraction of slot-time that did real work) and throughput (inverse of the total ticks taken). The static scheme pads each batch to its longest member; the continuous scheme refills a slot the moment its request finishes.

import random
random.seed(7)

B = 8                  # number of parallel slots on the GPU
N_REQUESTS = 200       # total requests to serve
out_lens = [random.randint(8, 200) for _ in range(N_REQUESTS)]  # decode lengths

def simulate_static(out_lens, B):
    useful, total, i = 0, 0, 0
    while i < len(out_lens):
        batch = out_lens[i:i + B]
        longest = max(batch)        # every slot is held until the longest finishes
        total += B * longest        # all B slots occupied for `longest` ticks
        useful += sum(batch)        # but each slot only worked for its own length
        i += B
    return useful, total

def simulate_continuous(out_lens, B):
    useful, total = 0, 0
    queue = list(out_lens)
    slots = [queue.pop(0) if queue else 0 for _ in range(B)]
    while any(s > 0 for s in slots) or queue:
        active = sum(1 for s in slots if s > 0)
        total += B                  # all B slots are available this tick
        useful += active            # only the active ones did work this tick
        for k in range(B):
            if slots[k] > 0:
                slots[k] -= 1
            if slots[k] == 0 and queue:     # finished -> refill immediately
                slots[k] = queue.pop(0)
    return useful, total

us_u, us_t = simulate_static(out_lens, B)
co_u, co_t = simulate_continuous(out_lens, B)
print(f"static   util : {100*us_u/us_t:5.1f}%   makespan ticks: {us_t//B}")
print(f"continuous util: {100*co_u/co_t:5.1f}%   makespan ticks: {co_t//B}")
print(f"throughput gain: {us_t/co_t:4.2f}x  (continuous vs static)")
Code 22.7.1: A pure-Python schedule simulator. Both policies do the identical total decode work; they differ only in how much slot-time is wasted waiting for the longest sequence in each batch.
static   util :  60.0%   makespan ticks: 4334
continuous util:  96.6%   makespan ticks: 2691
throughput gain: 1.61x  (continuous vs static)
Output 22.7.1: Static batching wastes forty percent of the GPU on bubbles and finishes in 4,334 ticks; continuous batching runs at 96.6 percent utilization and finishes the same work in 2,691 ticks, a 1.61x throughput gain with no change to any output.

The numbers in Output 22.7.1 are the whole argument for continuous batching. Identical work, identical model, identical outputs, yet the in-flight scheduler delivers about 1.6 times the throughput on this length distribution simply by never letting a slot idle behind a longer sequence. Real engines see larger gains when the length spread is wider, which is the common case for chat and code workloads. Continuous batching is the default in every modern serving engine for exactly this reason.

Library Shortcut: vLLM and TGI Schedule Continuously by Default

The scheduler in Code 22.7.1 is a teaching model; in production you do not write it. vLLM's engine performs token-level continuous batching coupled to its paged KV cache (the Section 22.5 mechanism), so eviction and admission cost only a page-table update, and Hugging Face Text Generation Inference (TGI) ships the same in-flight batching out of the box. Turning a single-request generate loop into a continuously batched server is one configuration object rather than the dozens of lines of slot bookkeeping above:

# pip install vllm
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct",
          max_num_seqs=256)        # up to 256 sequences batched in flight at once
params = SamplingParams(max_tokens=200)

# Hand the engine a whole queue; it admits, evicts, and refills slots per token,
# overlapping short and long requests automatically. No manual scheduling.
outputs = llm.generate(prompts, params)
Code 22.7.2: The same continuous-batching policy as Output 22.7.1, now a single max_num_seqs setting. The engine handles eviction, KV-page recycling, and admission that the hand-written simulator spelled out.

3. Speculative Decoding Turns Sequential Work Into Parallel Work Intermediate

Continuous batching fills the GPU across requests; speculative decoding attacks the sequential cost within a single request. Decoding is slow because it is inherently serial: token $t+1$ cannot be computed until token $t$ exists, so a length-$L$ answer needs $L$ memory-bound forward passes of the large model. Speculative decoding breaks that chain with a division of labor. A small, cheap draft model generates a guess of the next $k$ tokens quickly. The large target model then runs a single forward pass over the prompt plus all $k$ proposed tokens at once, which a transformer can do in parallel because it scores every position simultaneously. Comparing the target's own distribution at each position to the draft's proposal, the system accepts the longest prefix the target agrees with and resamples the first disagreement from a corrected distribution.

The decisive property is that this procedure is exact: with the standard rejection-sampling acceptance rule, the accepted tokens are distributed exactly as if they had been sampled from the target model alone. Speculative decoding changes the speed of generation and nothing about its output distribution. The win comes from arithmetic intensity. One target forward pass is so memory-bound that processing $k+1$ positions costs almost the same wall-clock time as processing one, so every accepted draft token is a target pass avoided. If the draft and target agree on a fraction $a$ of tokens, a single verification pass yields, in expectation,

$$\mathbb{E}[\text{tokens accepted}] = \frac{1 - a^{k+1}}{1 - a},$$

the mean of a truncated geometric run of agreements plus one bonus token that the target supplies for free at the end of the accepted prefix. Writing $T$ for the latency of one target forward pass and $D \ll T$ for one draft pass, the expected speedup over plain decoding is

$$\text{speedup}(a, k) = \frac{T \cdot \mathbb{E}[\text{tokens accepted}]}{T + k\,D} = \frac{1 - a^{k+1}}{1 - a} \cdot \frac{T}{T + k\,D}.$$

The numerator rewards a high acceptance rate; the denominator charges for the $k$ draft passes that may be wasted if the target rejects early. Code 22.7.3 evaluates this expression across acceptance rates for a draft that is about seven times cheaper than the target.

def expected_speedup(a, k, T, D):
    accepted = (1 - a ** (k + 1)) / (1 - a)   # mean tokens per verification pass
    cost = T + k * D                          # 1 target pass + k draft passes
    return accepted * T / cost                # baseline cost per token is T

T, D, k = 1.0, 0.15, 4                         # target=1.0, draft=0.15, propose 4
for a in (0.5, 0.7, 0.8, 0.9):
    acc = (1 - a ** (k + 1)) / (1 - a)
    print(f"accept rate a={a:.1f} : mean accepted/pass = {acc:4.2f}"
          f"  ->  {expected_speedup(a, k, T, D):4.2f}x speedup")
Code 22.7.3: The speculative-decoding speedup model. Acceptance rate a is the single most important quantity; it is governed by how well the draft model imitates the target on the live token stream.
accept rate a=0.5 : mean accepted/pass = 1.94  ->  1.21x speedup
accept rate a=0.7 : mean accepted/pass = 2.77  ->  1.73x speedup
accept rate a=0.8 : mean accepted/pass = 3.36  ->  2.10x speedup
accept rate a=0.9 : mean accepted/pass = 4.10  ->  2.56x speedup
Output 22.7.3: Speedup climbs sharply with the acceptance rate: a draft that matches the target ninety percent of the time more than doubles decoding throughput, while a poorly aligned draft (fifty percent) barely helps after paying for its own forward passes.

Output 22.7.3 explains why speculative decoding lives or dies on the acceptance rate. A draft model that mirrors the target closely turns four memory-bound target passes into one and yields better than a 2.5x speedup; a weak draft spends its budget on guesses the target throws away and the verification overhead eats most of the gain. The engineering problem is therefore to make the draft both cheap and well-aligned, which is exactly where the recent variants compete.

Fun Note: Autocomplete With a Receipt

Speculative decoding is the model letting an eager intern (the draft) finish its sentences, then proofreading the whole paragraph in one glance and crossing out the first wrong word. The intern is fast and often right; the senior author still signs off on every token, so the published text reads exactly as if the senior had written it alone. You get the intern's speed with the author's distribution, and the only cost is the occasional crossed-out guess.

4. The Variants: Better Drafts, Fewer Passes Advanced

The original formulation uses a separate small model as the draft, which works but requires shipping and serving a second model whose distribution must track the target. A family of variants removes that cost by drawing the draft from the target model itself. Medusa attaches a handful of extra prediction heads to the target so it proposes several future tokens in one pass with no separate model. EAGLE drafts at the feature level rather than the token level, predicting the target's own hidden states to raise the acceptance rate, and its successors push that idea further. Self-speculation uses a subset of the target's own layers as the draft, so a single set of weights both proposes and verifies. The simplest variant of all, n-gram or prompt-lookup decoding, skips a neural draft entirely and proposes tokens by copying repeated spans from the prompt, which is remarkably effective for summarization and code editing where the output echoes the input.

All of these are single-GPU mechanisms: the draft, the verification, and the acceptance test happen inside one accelerator's forward passes. They compose cleanly with continuous batching, since a verified multi-token step is just a different kind of step in the in-flight schedule, and with the paged KV cache of Section 22.5, since accepted tokens extend the cache while rejected ones are simply not committed. What none of them do by themselves is coordinate across machines. Deciding which replica drafts, how a speculative step interacts with cross-node scheduling, and when to disaggregate prefill from decode are fleet-level questions answered by the distributed serving scheduler of Chapter 24, which treats the per-node throughput this section creates as the unit it replicates and load-balances.

Thesis Thread: A Per-Node Multiplier the Fleet Inherits

This chapter is the labeled scale-up exception in a scale-out book, and continuous batching with speculative decoding is where that exception pays off most directly. Every factor of throughput won here, the 1.6x from never idling a slot and the 2x from verifying multiple tokens per pass, multiplies the effective capacity of each GPU before a single request is routed across the network. Chapter 24 takes these per-node numbers as given and asks the distributed question: how many such nodes, arranged how, with prefill and decode possibly split across different machines, meet a fleet-wide latency and throughput target. The per-node mechanism and the fleet scheduler are two layers of the same throughput stack; this section builds the lower one.

Research Frontier: Self-Drafting and Chunked Prefill (2024 to 2026)

The active frontier is making the draft nearly free and the schedule smoother. The EAGLE line (EAGLE-2 and EAGLE-3, 2024 to 2025) drafts at the feature level with a dynamic draft tree and reports acceptance rates and end-to-end speedups well beyond the original separate-model scheme, and Medusa-style multi-head drafting has been folded into production engines. On the scheduling side, chunked prefill (popularized through Sarathi-Serve and now standard in vLLM and TensorRT-LLM) slices a long prompt's compute-bound prefill into smaller pieces and interleaves them with ongoing decode, which stabilizes inter-token latency under mixed traffic and pairs naturally with continuous batching. A parallel thread, prefill/decode disaggregation, runs the two phases on separate hardware pools so each is provisioned for its own bottleneck; Chapter 24 develops that fleet-level idea. The common theme is treating the idle compute of memory-bound decode as a resource to be spent, by drafting, by chunking, or by disaggregating.

Practical Example: The Chat Endpoint That Tripled Its QPS Without New GPUs

Who: An inference platform engineer running a customer-support chat model on a fixed pool of eight GPUs.

Situation: Traffic mixed one-line confirmations with long troubleshooting answers, and the server used static batching with a fixed batch of sixteen.

Problem: Measured GPU utilization sat near sixty percent because short requests waited on long ones, and tail latency spiked whenever a long prompt landed mid-batch.

Dilemma: Buy more GPUs to hit the throughput target, the expensive scale-out reflex, or first extract the throughput already being wasted inside each existing GPU.

Decision: They exhausted the per-node ceiling first, enabling continuous batching, then layering speculative decoding with a small distilled draft and chunked prefill to tame the long prompts.

How: They switched to a vLLM engine with max_num_seqs=256, attached an EAGLE-style draft achieving roughly eighty percent acceptance, and enabled chunked prefill, all configuration changes rather than new code.

Result: Utilization rose from sixty to above ninety percent (the continuous-batching gain of Output 22.7.1) and per-stream decoding roughly doubled (the speculative gain of Output 22.7.3), tripling sustained queries per second on the same eight GPUs and removing the tail spikes.

Lesson: Spend the idle compute inside each node before buying more nodes; the per-node multiplier is the cheapest throughput in the system, and the fleet of Chapter 24 inherits every bit of it.

5. When These Techniques Earn Their Keep Intermediate

Neither technique is universally free. Continuous batching needs a length spread to exploit; a workload where every request emits the same number of tokens (fixed-length classification or scoring) gets little from it, because there are no early finishers to evict. It also raises memory pressure, since more sequences decode concurrently and each holds a KV cache, which is precisely why it is built on top of the paged cache of Section 22.5 rather than alongside it. Speculative decoding helps most when decoding is the bottleneck and the batch is small, the latency-sensitive regime; at very large batch sizes the GPU is already compute-saturated by the batch itself, the spare compute that speculation feeds on is gone, and the extra draft passes can even slow things down. A negative acceptance rate scenario, a draft that disagrees with the target often, can leave you worse off than plain decoding once verification overhead is counted, as the $a = 0.5$ row of Output 22.7.3 hints.

The practical rule is to layer them in the order this section presents them. First make each token cheap (Section 22.5 and Section 22.6), then keep the GPU full across requests (continuous batching), then make each expensive pass produce more tokens (speculative decoding), and finally compile the whole stack for the target hardware, which Section 22.8 takes up next. Each layer assumes the one below it is in place, and together they set the per-node throughput that the distributed serving chapters multiply across a fleet.

Exercise 22.7.1: Where Does the Bubble Go? Conceptual

Static batching in Output 22.7.1 wasted forty percent of the GPU. Explain, in terms of the decode-length distribution, why the wasted fraction grows as the spread between the shortest and longest request in a batch grows, and why it shrinks to nearly zero when all requests have equal length. Then argue why an image-classification server (one forward pass per input, identical work) gains essentially nothing from continuous batching, and name one autoregressive workload where the spread, and therefore the gain, would be unusually large.

Exercise 22.7.2: Find the Break-Even Acceptance Rate Coding

Using the expected_speedup function from Code 22.7.3, hold $T = 1.0$ and $D = 0.15$ fixed and find, for each draft length $k \in \{2, 4, 8\}$, the acceptance rate $a$ at which the speedup drops to exactly $1.0$ (speculation stops helping). Plot or tabulate the break-even $a$ against $k$ and explain the trend: why does a longer proposed draft demand a higher acceptance rate just to break even? Relate your answer to the $k\,D$ term in the denominator of the speedup formula.

Exercise 22.7.3: Continuous Batching Under a Memory Cap Analysis

Extend the continuous-batching simulator in Code 22.7.1 so that admitting a request requires free KV-cache pages, with a fixed total page budget and each sequence consuming pages proportional to its current length. When the budget is full, new requests must wait in the queue even if a slot is logically free. Analyze how utilization and makespan change as you tighten the page budget, and explain why this couples continuous batching to the paged KV cache of Section 22.5: what is the failure mode if the scheduler admits sequences it cannot find pages for?