Section 24.9: Inference Engines and Practice

"I have read every paper in this chapter. I just call them by their command-line flags now."
An Inference Engine With Too Many Backends

Big Picture

Every mechanism in this chapter (tensor and pipeline parallelism, paged and distributed KV cache, prefill/decode disaggregation, cross-node continuous batching, prefix caching, multi-LoRA, and expert-parallel MoE serving) ships today inside a small number of open inference engines, so the practical question is not how to build them but which engine to choose and how to configure it. vLLM, TensorRT-LLM, and SGLang each implement the same core ideas but make different bets on kernel compilation, prefix sharing, disaggregation support, and target hardware. This closing section maps the engine landscape onto the mechanisms you now understand, gives a decision guide for picking one, and folds the whole of Chapter 24 into a single takeaway. The runnable demo is a configurator that turns model size, context, service-level objectives (SLOs), and hardware into a recommended parallelism, disaggregation, and caching plan with an estimated node count.

The previous eight sections built distributed LLM serving from its parts. Section 24.2 split one model across devices with tensor parallelism; Section 24.3 stretched it across nodes with pipeline parallelism; Sections 24.4 and 24.5 made the KV cache paged, distributed, and disaggregated from prefill; Section 24.6 scheduled continuous batches across the fleet; Section 24.7 added prefix caching and multi-LoRA; and Section 24.8 served distributed mixture-of-experts (MoE) models with expert parallelism. None of those mechanisms is something you implement from scratch in production. They live inside inference engines, the serving runtimes that turn a model checkpoint and a request stream into tokens at a target latency and throughput. This section is the practitioner's map of that software layer: what the engines share, where they diverge, and how to choose. Figure 24.9.1 assembles the full stack the chapter has built, labeling each layer with the section that produced it, so the engine can be read as the place where those mechanisms are packaged behind a request API.

Figure 24.9.1: The distributed LLM serving stack the chapter assembles. Applications hit an OpenAI-compatible API; a router and continuous-batching scheduler admit and place requests using prefix-cache locality; the inference engine holds the parallelism, the paged and prefix-aware KV cache, and the MoE expert routing; and separate prefill and decode pools run over GPU nodes whose per-node efficiency is the Chapter 22 prerequisite. Each box names the section that built it.

1. What Every Engine Shares Beginner

The engines converged on a common core because the mechanisms in this chapter are not optional once you serve at scale; they are the only known way to hit competitive throughput and latency together. Every serious engine today does continuous (in-flight) batching rather than static batching, so a finished sequence frees its slot mid-step instead of stalling the batch. Every one pages the KV cache after the PagedAttention design of Chapter 22, so memory fragmentation no longer caps the batch size. Every one supports tensor parallelism within a node and most support pipeline parallelism across nodes, the splits of Chapter 16 applied to inference. And every one now exposes an OpenAI-compatible HTTP endpoint, so the application code above the engine does not change when you swap engines underneath.

This shared core is exactly the set of ideas the chapter built. An engine, in this framing, is a packaging of Chapter 24's mechanisms behind a request API, tuned and kernel-optimized for a particular hardware target. The differences that remain, and they are the ones that decide your choice, sit in four places: how kernels are produced (interpreted versus ahead-of-time compiled), how aggressively prefixes are shared, whether prefill/decode disaggregation is a first-class feature, and which accelerators are supported. We take the three leading open engines in turn.

Key Insight: The Engine Is the Chapter, Compiled

You will almost never write a custom KV-cache pager, a cross-node continuous-batching scheduler, or an expert-parallel all-to-all by hand in production. Those mechanisms are commodities now, implemented once inside vLLM, TensorRT-LLM, and SGLang and reused by everyone. The value of having built them by hand across this chapter is not that you will reimplement them; it is that the engine's configuration flags (tensor-parallel-size, enable-prefix-caching, kv-transfer-config, enable-expert-parallel) now read as the named mechanisms you understand, and you can predict how each one moves the latency-throughput frontier.

2. vLLM: The Open Default Beginner

vLLM is the engine most teams reach for first, and for good reason: it introduced PagedAttention, it ships continuous batching, tensor and pipeline parallelism, automatic prefix caching, and multi-LoRA serving (the LoRA fleets of Section 24.7) in one package, and it runs on NVIDIA, AMD, and other backends. Its design philosophy is broad hardware reach and fast iteration in Python with high-performance custom kernels underneath, rather than a heavyweight compile step. For most workloads it delivers strong throughput out of the box with a one-line launch, which is why it has become the de facto open baseline that other engines benchmark against.

Library Shortcut: One Command Turns On the Whole Chapter

The mechanisms that took eight sections to build are configuration flags on a single launch. vLLM serves a tensor-parallel, prefix-cached, multi-LoRA endpoint with:

# vLLM: tensor parallelism + automatic prefix caching + LoRA adapters
vllm serve meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 8 \        # split the model across 8 GPUs (24.2)
    --enable-prefix-caching \         # share system-prompt prefixes (24.7)
    --enable-lora --max-loras 16 \    # serve 16 LoRA adapters on one base (24.7)
    --max-num-seqs 256                # continuous-batching width (24.6)

Code 24.9.1: A production-grade distributed serving endpoint in one command. The four flags map one-to-one onto Sections 24.2, 24.6, and 24.7; vLLM handles the paged KV cache, the cross-request scheduler, and the OpenAI-compatible API internally, collapsing what would be thousands of lines into a launch line.

The same hardware reach that makes vLLM the safe default also makes it the engine where new research lands first, because contributors can iterate without a compile gate. Prefill/decode disaggregation (Section 24.5) and expert-parallel MoE serving (Section 24.8) both arrived in vLLM as configurable features, which is the practical reason this book's mechanisms are reachable from a single tool.

3. TensorRT-LLM: Compiled Throughput on NVIDIA Intermediate

TensorRT-LLM is NVIDIA's engine, and its bet is compilation. Instead of dispatching kernels at runtime, it builds an optimized engine plan ahead of time: it fuses operations, selects kernels tuned for the exact GPU architecture, and bakes in the chosen precision (including FP8 on Hopper and later). On NVIDIA hardware this regularly yields the highest throughput and lowest latency of the three, because the compiled plan removes per-step dispatch overhead and exploits hardware features that a general runtime cannot assume. The cost is build complexity and rigidity: you compile an engine for a specific model, precision, parallelism degree, and batch profile, and changing any of those means rebuilding. It is the choice when you serve a fixed model at very high volume on NVIDIA GPUs and every percent of throughput is money.

Practical Example: Choosing an Engine for a 70B Assistant

Who: A platform engineer at a software company standing up an internal coding assistant on a reserved H100 cluster.

Situation: A single 70B model would serve every team behind one OpenAI-compatible endpoint, with a fixed checkpoint that changes roughly monthly.

Problem: The first vLLM deployment met latency targets but the cluster was billed flat, so cost per token, set by raw throughput, was the metric leadership cared about.

Dilemma: Stay on vLLM for its simplicity and fast adapter iteration, or move to TensorRT-LLM for higher compiled throughput at the price of a build pipeline and per-change recompiles.

Decision: They moved the steady-state base model to TensorRT-LLM with FP8 because the checkpoint was stable and the volume justified the build pipeline, and kept a vLLM pool for experimental LoRA variants that changed weekly.

How: They compiled a TensorRT-LLM engine for tensor-parallel degree 8 at FP8, served it behind the same router, and routed prefix-heavy traffic with caching enabled on both pools.

Result: Tokens-per-second per GPU on the stable model rose enough to retire several nodes, while the vLLM pool absorbed the churn the compiled engine could not cheaply follow.

Lesson: Compilation buys throughput when the model is stable and the volume is high; pair it with a flexible engine for the parts of the workload that change faster than you can recompile.

4. SGLang: Prefix Sharing and Structured Generation Intermediate

SGLang's distinctive contribution is RadixAttention, a prefix cache organized as a radix tree so that any shared prefix across requests, not only a fixed system prompt, is detected and reused automatically. This is the Section 24.7 idea pushed to its general form: branching conversations, few-shot templates, and agent traces that share long common stems get near-perfect cache reuse without the application managing it. SGLang pairs this with a frontend language for structured and programmatic generation (constrained decoding, control flow, parallel calls), which makes it especially fast for the templated, agentic, and tool-calling workloads of Part IV's large models when they are driven by the orchestration patterns of Part VI. For traffic dominated by shared prefixes and structured outputs it often leads on throughput; for unstructured, low-overlap traffic the advantage narrows.

Fun Note: The Tree That Remembers Everyone's Opening Line

A radix-tree prefix cache is a little like a theater prompter who has heard the same monologue ten thousand times. When the eleven-thousandth actor walks on and starts "To be, or not to be," the prompter does not re-read the script; it picks up exactly where the words stop matching. RadixAttention does the same to your prompts: the shared stem is computed once and the cache only diverges where the requests actually differ.

5. The Rest of the Field Beginner

Three engines do not exhaust the landscape. Hugging Face's Text Generation Inference (TGI) is a production server tightly integrated with the Hugging Face ecosystem and a common choice when you already live there. LMDeploy, from the OpenMMLab community, is known for strong quantized inference and its TurboMind kernels. DeepSpeed-Inference and the older FasterTransformer contributed kernels and ideas that the newer engines absorbed. And above all of these sit managed endpoints: hosted serving from cloud and model providers that run one of these engines for you behind an API, trading control and cost transparency for the elimination of every operational concern in this chapter. The managed option is the right one whenever your team's comparative advantage is the application, not the serving fleet.

These engines do not replace the generic serving frameworks of Chapter 23; they sit inside them. Ray Serve, KServe, or Triton handle autoscaling, multi-model routing, and rollout, and call an LLM engine as the per-replica runtime. And the engines feed upward into applications: the agentic systems of Part VI and the retrieval-augmented systems of Part VIII are clients of exactly the endpoints this section configures. The engine is one well-defined layer, bounded below by the per-node efficiency of Chapter 22 and above by the orchestration of Chapter 23.

6. A Decision Guide Intermediate

The choice reduces to four questions, ordered by how often they decide the outcome. First, hardware: if you are not on NVIDIA, TensorRT-LLM is out and vLLM's broad backend support leads. Second, stability versus flexibility: a fixed, high-volume model rewards TensorRT-LLM's compiled throughput, while churning models and frequent LoRA swaps reward vLLM's no-compile iteration. Third, traffic shape: heavy shared-prefix and structured/agentic traffic favors SGLang's RadixAttention; unstructured traffic narrows that edge. Fourth, team capacity: when serving is not your differentiator, a managed endpoint or TGI inside an existing stack beats running your own. Table 24.9.1 condenses the comparison; the configurator below turns these axes into a concrete recommendation.

Table 24.9.1: Leading LLM inference engines along the axes that decide a choice. All share continuous batching, paged KV cache, tensor parallelism, and an OpenAI-compatible API; the columns capture where they differ.

Engine	Kernel strategy	Prefix sharing	Hardware	Best fit
vLLM	Runtime kernels, no compile gate	Automatic prefix caching	NVIDIA, AMD, others	Open default; churning models, multi-LoRA, broad hardware
TensorRT-LLM	Ahead-of-time compiled engine plan	Supported	NVIDIA only	Fixed model, very high volume, top throughput on NVIDIA
SGLang	Runtime kernels	RadixAttention (general prefix tree)	NVIDIA, AMD	Shared-prefix, structured, agentic and templated traffic
TGI / LMDeploy	Runtime kernels (LMDeploy: TurboMind)	Supported	NVIDIA (LMDeploy strong on quantized)	Hugging Face stack (TGI); quantized serving (LMDeploy)
Managed endpoint	Provider's choice	Provider's choice	Provider's choice	When the application, not the fleet, is your differentiator

7. A Serving Configurator Intermediate

The decision guide picks an engine; you still have to size the deployment. The throughput of a decode step is set by memory bandwidth, because every generated token reads the model weights once across the shards that hold them. For a model of $P$ parameters in fp16 (2 bytes each) served over $\text{TP}\times\text{PP}$ GPUs of aggregate bandwidth $B$ bytes per second, the per-replica decode rate is approximately

$$\text{tok/s} \approx \frac{B}{2P}, \qquad \text{TPOT} \approx \frac{2P}{B},$$

and the tensor-parallel degree is the smallest split that lets the weights plus a context's worth of KV cache fit under a target fraction of each card's memory. The configurator below encodes exactly this reasoning: given model size, context length, SLOs (TTFT and time-per-output-token, TPOT), and a GPU, it recommends a parallelism, disaggregation, and caching plan and estimates throughput and node count. It is the same arithmetic the per-node models of Chapter 22 fed into the fleet-sizing of Chapter 23, now closed into a recommender.

GPU_DB = {  # name: (memory_GB, fp16_TFLOPs, hbm_GB_per_s)
    "A100-80GB": (80, 312, 2039),
    "H100-80GB": (80, 989, 3350),
}

def recommend(name, params_B, ctx_len, ttft_ms, tpot_ms, gpu,
              n_gpu_per_node, qps, kv_bytes_per_tok):
    mem, _tflops, hbm = GPU_DB[gpu]
    w_gb = params_B * 2 * 1.15                       # fp16 weights + overhead
    kv_gb_seq = ctx_len * kv_bytes_per_tok / 1e9     # KV cache for one sequence
    usable = mem * 0.70                              # leave head-room per card
    tp = 1                                           # tensor-parallel degree
    while w_gb / tp + kv_gb_seq > usable and tp < n_gpu_per_node:
        tp *= 2
    pp = 1                                           # pipeline-parallel degree
    while tp * pp < w_gb / usable:
        pp += 1
    disagg = (ctx_len >= 8000) and (ttft_ms <= 500)  # split prefill/decode (24.5)
    prefix_cache = ctx_len >= 2000                   # share long prefixes (24.7)
    tok_s = (hbm * 1e9 * tp * pp) / (params_B * 2 * 1e9)   # decode rate, B/(2P)
    tpot_pred = 1000.0 / tok_s
    req_s_replica = max(tok_s / 200.0, 0.1)          # ~200 gen tok/s per request
    replicas = max(1, -(-int(qps * 100) // int(req_s_replica * 100)))
    per_replica = tp * pp * (2 if disagg else 1)
    total = replicas * per_replica
    nodes = -(-total // n_gpu_per_node)
    print(f"{name}: TP={tp} PP={pp} disagg={'Y' if disagg else 'n'} "
          f"prefix={'Y' if prefix_cache else 'n'} "
          f"TPOT~{tpot_pred:.1f}ms({'ok' if tpot_pred<=tpot_ms else 'MISS'}) "
          f"{tok_s:.0f}tok/s x{replicas}rep -> {total} GPUs / {nodes} nodes")

recommend("13B chat",  13,  2048, 300, 40, "A100-80GB", 8, 20,   800_000)
recommend("70B RAG",   70, 16000, 400, 50, "H100-80GB", 8,  8, 2_500_000)
recommend("405B long", 405,32000, 800, 60, "H100-80GB", 8,  4, 4_000_000)

Code 24.9.2: A pure-Python serving configurator. It picks the tensor-parallel degree so weights plus KV cache fit per card, adds pipeline stages when one node cannot hold the weights, flags disaggregation for long-context tight-TTFT traffic and prefix caching for long prompts, then estimates the bandwidth-bound decode rate and the replica and node count to meet the target QPS.

13B chat: TP=1 PP=1 disagg=n prefix=Y TPOT~12.8ms(ok) 78tok/s x52rep -> 52 GPUs / 7 nodes
70B RAG: TP=8 PP=1 disagg=Y prefix=Y TPOT~5.2ms(ok) 191tok/s x9rep -> 144 GPUs / 18 nodes
405B long: TP=8 PP=3 disagg=n prefix=Y TPOT~10.1ms(ok) 99tok/s x9rep -> 216 GPUs / 27 nodes

Output 24.9.2: The recommender at work. The 13B chatbot fits on one GPU and scales by replication; the 70B RAG model needs full-node tensor parallelism and earns disaggregation from its long context and tight TTFT; the 405B model crosses node boundaries with three pipeline stages, and its looser 800ms TTFT budget leaves prefill and decode fused rather than split.

The three rows of Output 24.9.2 trace the whole chapter in one table. Replication alone handles a small model; tensor parallelism within a node handles a medium one; pipeline parallelism across nodes plus disaggregation and prefix caching handle the large ones, and every recommendation is just the mechanisms of Sections 24.2 through 24.8 selected by the binding constraint. A real engine refines these numbers with quantization, speculative decoding, and measured kernel timings, but the structure of the decision, which parallelism, whether to disaggregate, where caching pays, is exactly what the configurator captures.

Research Frontier: Where Engines Are Heading (2024 to 2026)

The engines are racing on the mechanisms this chapter named. Prefill/decode disaggregation, introduced by DistServe (Zhong et al., 2024) and Splitwise (Patel et al., 2024), moved from research into configurable vLLM and SGLang features, with KV transfer over RDMA the active engineering frontier. RadixAttention (Zheng et al., SGLang, 2024) generalized prefix caching to arbitrary shared stems and is now widely adopted. Speculative and multi-token decoding (Medusa, EAGLE, and the Llama-style draft-and-verify lineage) cut decode latency and are landing as engine flags. On the MoE side of Section 24.8, expert-parallel serving with all-to-all dispatch and DeepSeek-style large-expert deployment is the current throughput frontier on NVIDIA and AMD fleets. The through-line is that the open engines are converging on a shared mechanism set while differentiating on kernels, hardware, and how early each feature ships.

Key Takeaway: Chapter 24 in One Breath

Serving a large language model spans machines. When one model is too big for one device, tensor parallelism splits it within a node and pipeline parallelism stretches it across nodes (Sections 24.2, 24.3). Its memory, the KV cache, is paged and distributed across the fleet so batch size is not capped by fragmentation (Section 24.4). Because prefill is compute-bound and decode is bandwidth-bound, the two phases are disaggregated into separate pools that exchange the KV cache (Section 24.5). A cross-node continuous-batching scheduler keeps every GPU full by admitting and routing requests with prefix-cache locality (Sections 24.6, 24.7), and one base model serves many fine-tunes through multi-LoRA (Section 24.7). When the model is a mixture of experts, expert parallelism shards the experts across nodes and routes tokens with an all-to-all (Section 24.8). All of these ship today inside vLLM, TensorRT-LLM, and SGLang (Section 24.9), so the engineering task is to choose and configure the engine that matches your hardware, traffic, and rate of change. The per-node economics of Chapter 22, multiplied across this fleet, are what these mechanisms exist to manage.

That closes the distributed serving of a single large model. The next chapter turns to the other half of a production LLM system: the retrieval layer that feeds it context. Vector search and approximate nearest-neighbor indices are themselves distributed systems, sharded and replicated like everything else in this book, and Chapter 25 builds them.

Exercise 24.9.1: Read the Flags Conceptual

For each vLLM flag in Code 24.9.1 (--tensor-parallel-size, --enable-prefix-caching, --enable-lora, --max-num-seqs), name the chapter mechanism it switches on and the section that built it, then state in one sentence which serving metric (TTFT, TPOT, throughput, or memory ceiling) it most directly moves and in which direction. Which flag would you change first if profiling showed the GPUs idle between decode steps?

Exercise 24.9.2: Extend the Configurator Coding

Add quantization to Code 24.9.2. Introduce a weight_bytes argument (2 for fp16, 1 for fp8, 0.5 for int4) and use it in both the weight-memory and the decode-rate formulas, since fewer bytes per weight both relax the memory ceiling and raise the bandwidth-bound token rate. Re-run the three workloads at fp8 and report how TP, node count, and predicted TPOT change. Then explain why quantization can lower the tensor-parallel degree even when it leaves the SLOs unchanged.

Exercise 24.9.3: Pick the Engine Analysis

Using Table 24.9.1 and the decision guide of Section 6, choose an engine for each scenario and justify it in two sentences: (a) an agent platform whose requests share a 4,000-token system prompt and emit JSON-constrained tool calls; (b) a fixed 70B model billed at flat-rate on a reserved H100 cluster serving 50,000 requests per minute; (c) a research group on AMD GPUs swapping LoRA adapters daily; (d) a three-person startup whose product is the application, not the fleet. State which of the four decision axes (hardware, stability, traffic shape, team capacity) is decisive in each.

Project Ideas

1. Benchmark the latency-throughput frontier under caching and disaggregation. Serve one open model (for example an 8B instruct model) on vLLM and measure TTFT, TPOT, and tokens-per-second as you sweep batch width, with prefix caching off then on, and with prefill/decode fused then disaggregated. Plot the latency-throughput Pareto frontier for each configuration and quantify how much each mechanism of this chapter moves it. Reuse the load generator and metric definitions from Chapter 23.

2. Validate the configurator against reality. Take the recommender of Code 24.9.2, run its three recommended configurations on real hardware (or a cloud rental), and compare predicted TPOT and throughput against measured values. Identify where the bandwidth-bound model is optimistic (kernel overhead, attention cost growing with context) and add correction terms until predictions land within twenty percent.

3. A multi-LoRA, prefix-shared agent backend. Stand up an SGLang or vLLM endpoint that serves one base model with several LoRA adapters and a long shared system prompt, then drive it with branching agent traces that share stems. Measure the cache hit rate that RadixAttention or automatic prefix caching achieves and the throughput gain over a no-cache baseline, and report how it degrades as the traces diverge earlier.