Section 22.8: Compilation and Kernel Optimization

"I launched ten thousand tiny kernels last night, one per elementwise op, and each one waited politely for Python to call its name. The GPU spent the evening idle and bored."
A Worker That Lost Its Coordinator

Big Picture

An eager-mode model leaves most of its accelerator unused: it dispatches a long parade of small kernels one Python call at a time, and each kernel makes its own round trip to high-bandwidth memory instead of sharing data with its neighbors. Compilation closes that gap. You trace the model into a graph, fuse chains of operations into single kernels that read and write memory once, fold constants, auto-tune the resulting kernels for the exact shapes you serve, and replay the whole sequence with the launch overhead removed. This is a scale-up technique: it makes one node faster without adding a single machine. But because the same node is replicated across the serving fleet, a 1.5x or 2x per-node speedup multiplies into 1.5x or 2x fewer GPUs for the whole deployment, which is why a chapter about distribution spends a section on a compiler. This section shows where the time goes, why fusion attacks the very same memory-bound bottleneck that Section 22.6 attacked with FlashAttention, and which tool to reach for.

The previous section made each decoding step do less wasted work through continuous batching and speculative decoding. This section attacks a different waste, the gap between the arithmetic a model needs and the arithmetic the GPU actually performs, and the cause is mostly mechanical. A model written in eager mode runs one operator at a time: the Python interpreter calls into the framework, the framework launches a kernel on the GPU, the kernel runs, control returns to Python, and the next operator begins. For a transformer block with dozens of small elementwise operations between the big matrix multiplies, this means dozens of separate kernel launches, each with its own fixed dispatch cost, and dozens of separate trips to memory to read an input and write an output that the very next operation will immediately read back. The GPU spends much of its time waiting: for Python, for the launch queue, for memory. Compilation removes those waits by turning the model from a sequence of interpreter-driven calls into an optimized, precompiled graph.

Figure 22.8.1: Why fusion is a memory-traffic optimization. In the unfused chain (top) each of the four elementwise operations is a separate kernel that reads its input from HBM and writes its output back, so the small tensor crosses the memory bus eight times. The fused kernel (bottom) reads the inputs once, keeps every intermediate in on-chip registers, and writes the result once, three crossings in total. For a memory-bound chain, runtime is set by these crossings, so cutting them from eight to three is close to a 2.7x reduction in the time this chain costs. The from-scratch model in Code 22.8.1 makes the same count concrete.

1. Where Eager Mode Wastes the GPU Beginner

Two distinct costs hide inside eager execution, and compilation removes them by different mechanisms, so it pays to separate them. The first is launch overhead. Every kernel dispatch carries a fixed cost, on the order of microseconds, to enqueue the work, and that cost is paid whether the kernel touches a million elements or a thousand. When a model is dominated by large matrix multiplies the overhead is negligible, but a transformer block is full of cheap elementwise operations (bias additions, activation functions, normalizations, residual additions) where the launch can cost as much as the computation. The second cost is memory traffic. Each elementwise kernel reads its input tensor from high-bandwidth memory and writes its output back, and the next kernel immediately reads that output again. The data ping-pongs to HBM and back across the whole chain, even though it never needed to leave the chip.

That second cost is the one that ties this section to the roofline model of Section 3.7. An elementwise chain has almost no arithmetic per byte moved, so it sits deep in the memory-bound region of the roofline: its runtime is governed entirely by how many bytes cross the memory bus, not by how many floating-point operations the GPU can do. The GPU's arithmetic units sit nearly idle while the memory system does all the work. This is exactly the diagnosis that Section 22.6 gave for naive attention, and it has exactly the same cure: stop moving the data so many times.

Key Insight: Fusion and FlashAttention Solve the Same Problem

FlashAttention (Section 22.6) and operator fusion are the same idea applied to different chains. Both notice that an operation sequence is memory-bound, that its intermediate results are being written to HBM only to be read straight back, and both keep those intermediates in fast on-chip memory so the bus is crossed a handful of times instead of dozens. FlashAttention is a hand-written fused kernel for the specific softmax-attention chain; a compiler such as TorchInductor or TensorRT is a machine that discovers and emits such fused kernels automatically for whatever chains your model happens to contain. When you understand fusion, FlashAttention stops being a special trick and becomes the canonical example of a general principle.

2. A Memory-Traffic Model for Fusion Intermediate

The benefit of fusion is countable, and the count is simple enough to do in closed form before running anything. Consider a chain of $n$ elementwise operations applied to a tensor of $B$ bytes. In eager mode each operation is a kernel that reads $B$ bytes and writes $B$ bytes, so the chain moves

$$T_{\text{unfused}} = 2 n B \;\text{bytes} \quad (\text{plus any extra input tensors a stage reads}).$$

A single fused kernel reads each distinct input once and writes the single final output once. For a chain that consumes one streaming tensor plus a constant number $c$ of side inputs (a residual tensor, a bias vector), the fused traffic is the one streaming read, the $c$ side reads, and the single output write,

$$T_{\text{fused}} = (1 + c + 1)\,B = (2 + c)\,B \;\text{bytes},$$

independent of $n$, because every intermediate now lives in registers and never reaches HBM. In the memory-bound regime where runtime equals traffic divided by memory bandwidth $\beta$, the speedup is the ratio of the two traffic figures,

$$\text{speedup} = \frac{T_{\text{unfused}}}{T_{\text{fused}}} = \frac{2 n B + cB}{(2 + c)B} = \frac{2n + c}{2 + c},$$

which grows with the length $n$ of the fused chain. A longer chain of cheap operations is a better fusion target, because more redundant round trips collapse into the one pass. Code 22.8.1 evaluates this model for a four-op chain with a single residual side input ($n = 4$, $c = 1$) and confirms the arithmetic against an actual NumPy execution of the same operations.

import numpy as np

N = 8_000_000          # elements in the activation tensor
bytes_per_elem = 2     # fp16/bf16 inference activations
ops = ["bias_add", "scale", "gelu", "residual_add"]  # 4 elementwise stages
n_ops = len(ops)

x = np.ones(N, dtype=np.float32)
res = np.full(N, 0.5, dtype=np.float32)

# Unfused (eager): every op is its own kernel, each a full read + write.
def unfused(x, res):
    a = x + 0.1                 # bias_add : read x, write a
    b = a * 1.2                 # scale    : read a, write b
    c = 0.5 * b * (1.0 + np.tanh(0.797 * (b + 0.044 * b**3)))  # gelu
    d = c + res                 # residual_add : read c + res, write d
    return d

# Fused: same numbers, one logical pass; a real fused kernel keeps a, b, c in
# registers and never writes them to HBM. NumPy cannot truly fuse, so we model
# the traffic each version WOULD incur on a GPU.
out_u = unfused(x, res)
out_f = unfused(x, res)

elem_bytes = N * bytes_per_elem
unfused_traffic = (2 * n_ops + 1) * elem_bytes   # 2 per op (R+W) + 1 extra residual read
fused_traffic   = 3 * elem_bytes                 # read x, read res, write out

hbm_GBs = 2000.0                                  # ~2 TB/s, a modern data-center GPU
t_unfused_us = unfused_traffic / (hbm_GBs * 1e9) * 1e6
t_fused_us   = fused_traffic   / (hbm_GBs * 1e9) * 1e6

print("elements per tensor      :", f"{N:,}")
print("elementwise ops in chain :", n_ops)
print("max abs result diff      :", f"{np.max(np.abs(out_u - out_f)):.2e}")
print("unfused HBM traffic (MB) :", f"{unfused_traffic/1e6:.1f}")
print("fused   HBM traffic (MB) :", f"{fused_traffic/1e6:.1f}")
print("traffic reduction        :", f"{unfused_traffic/fused_traffic:.2f}x")
print("unfused time @2TB/s (us) :", f"{t_unfused_us:.1f}")
print("fused   time @2TB/s (us) :", f"{t_fused_us:.1f}")
print("memory-bound speedup     :", f"{t_unfused_us/t_fused_us:.2f}x")

Code 22.8.1: A pure-NumPy model of the fusion payoff. It runs the four-op chain to show the numbers are identical whether you fuse or not, then counts the HBM bytes each version moves and turns that traffic into a memory-bound runtime estimate. No GPU and no compiler are required to see why fusion wins.

elements per tensor      : 8,000,000
elementwise ops in chain : 4
max abs result diff      : 0.00e+00
unfused HBM traffic (MB) : 144.0
fused   HBM traffic (MB) : 48.0
traffic reduction        : 3.00x
unfused time @2TB/s (us) : 72.0
fused   time @2TB/s (us) : 24.0
memory-bound speedup     : 3.00x

Output 22.8.1: The fused result is bit-for-bit identical to the unfused one (difference $0$), yet it moves 48 MB instead of 144 MB and would finish in 24 microseconds instead of 72, a 3x memory-bound speedup. Fusion buys a real wall-clock win while changing nothing about the answer, the same bargain that data parallelism offered for gradients in Section 1.1.

The closed-form prediction $(2n + c)/(2 + c) = (8 + 1)/(2 + 1) = 3.0$ matches the executed traffic ratio exactly, which is the point: fusion is not a mysterious compiler benefit but an accounting identity about how many times a tensor crosses the memory bus. A real compiler does this for hundreds of chains across a whole model and stacks the savings.

3. Graph Capture and the Compilation Pipeline Intermediate

To fuse operations a compiler first needs to see them all at once, which eager execution never allows because it only ever knows about the single operation in front of it. The first stage of every compilation pipeline is therefore graph capture: trace the model with a representative input and record the operations and their data dependencies as a graph rather than executing them immediately. Once the graph exists, a sequence of standard optimizations runs over it. Operator fusion combines adjacent elementwise operations into single kernels, the saving we just modeled. Constant folding precomputes any subgraph whose inputs are all known at compile time, so the work never happens at run time. Layout and precision passes choose memory formats and cast to lower precision where it is safe. Finally kernel auto-tuning searches over implementation choices (tile sizes, thread-block shapes, loop orders) for each kernel and the exact tensor shapes you will serve, keeping the fastest variant. The output is a compiled artifact, an engine, that replays the optimized graph with none of the per-operation interpreter overhead.

Two further tools attack the launch overhead directly rather than the memory traffic. CUDA graphs record an entire sequence of kernel launches once and then replay the whole sequence with a single submission, so the thousands of tiny dispatches in a decoding step cost one launch instead of thousands. This matters most for the small-batch, latency-sensitive decoding that dominates LLM serving, where launch overhead is a large fraction of step time. And ONNX, the Open Neural Network Exchange format, is the neutral interchange graph: you export a traced model to ONNX once, then any ONNX-aware runtime or compiler can ingest it, which is how a model trained in one framework reaches an optimized engine built by another.

Library Shortcut: torch.compile Wraps the Whole Pipeline in One Line

Code 22.8.1 modeled fusion by hand. In PyTorch you do not write fused kernels or manage graph capture yourself; torch.compile traces the model, hands the graph to the TorchInductor backend, and TorchInductor performs the fusion, layout selection, and kernel generation (emitting Triton kernels on GPU) automatically:

import torch

model = build_model().eval().cuda()

# One wrapper. Tracing + fusion + autotuning happen on the first call(s);
# afterwards the compiled graph replaces eager dispatch.
fast_model = torch.compile(model, mode="max-autotune")

with torch.inference_mode():
    y = fast_model(example_input)   # first call compiles; later calls are fast

# For decode-step launch overhead, capture the replayed graph with CUDA graphs:
fast_model = torch.compile(model, mode="reduce-overhead")

Code 22.8.2: The same fusion that Code 22.8.1 modeled by hand, now obtained from a single torch.compile wrapper. The hundreds of lines a hand-written fused kernel plus its autotuner would take collapse to one call; TorchInductor handles graph capture, fusion grouping, Triton code generation, and the autotuning search, and mode="reduce-overhead" adds CUDA-graph capture to remove launch overhead.

4. Choosing a Tool, and What It Costs You Advanced

The compilation tools form a spectrum from flexible to rigid, and the right choice is governed by how much you are willing to freeze. At the flexible end, torch.compile stays inside PyTorch, recompiles transparently when input shapes change, and falls back to eager execution for anything it cannot trace, so it costs you almost nothing in workflow and typically buys a moderate speedup. At the rigid end, TensorRT (and TensorRT-LLM, its LLM-specialized sibling) compiles an ahead-of-time engine tuned for a specific GPU, specific precision, and a specific range of input shapes, and squeezes out the most performance precisely because it can assume so much. The price of that assumption is flexibility: a TensorRT engine is a hardware-specific binary. Build it for an H100 and it will not run on an A100; change your sequence-length range and you rebuild. ONNX Runtime sits in between, ingesting the neutral ONNX graph and applying its own graph optimizations and execution providers across more hardware backends.

This is the build-versus-flexibility trade-off, and it is the same shape as every ahead-of-time compilation decision in computing: the more you specialize for one target, the faster you run on that target and the worse you port to any other. Hardware-specific compilation is where the biggest per-node wins live, but it is also where portability goes to die, which is exactly why ONNX exists as a neutral waypoint. A common production pattern is to keep the model in ONNX as the durable, portable source of truth and to compile a disposable, hardware-specific engine from it for each accelerator generation you actually deploy on, rebuilding the engine rather than the model when the fleet's hardware changes.

Practical Example: The Engine That Paid for Itself in Fleet Size

Who: An inference platform engineer running a fixed-shape image embedding model behind a recommendation service.

Situation: The model served a steady stream of 224x224 images at a fixed batch size on a fleet of 40 GPUs, and the bill was the team's largest line item.

Problem: Profiling showed the GPUs at roughly 45 percent utilization; the model was a chain of convolutions and elementwise operations dispatched eagerly, drowning in launch overhead and HBM round trips.

Dilemma: Stay in eager PyTorch for maximum flexibility and accept the waste, wrap it in torch.compile for an easy moderate win, or export to ONNX and build a rigid TensorRT engine for the largest win at the cost of a per-hardware rebuild.

Decision: Because the shapes were genuinely fixed and the hardware fleet was homogeneous, they exported to ONNX and built a TensorRT engine; the rigidity that would hurt a research workflow cost them nothing here.

How: They traced the model to ONNX, built a fp16 TensorRT engine for their exact shape and GPU, kept the ONNX file in source control as the portable master, and wired the build step into CI so a hardware change triggers an engine rebuild.

Result: Per-node throughput rose about 1.8x with identical embeddings; the same request volume now fit on 23 GPUs instead of 40, and the per-node win multiplied directly into 17 fewer machines across the fleet.

Lesson: Specialize only what is genuinely stable. Fixed shapes and homogeneous hardware are exactly the conditions under which the rigid, hardware-specific engine is free money; under churn, the flexible compiler is the safer default.

Fun Note: The Compiler That Warms Up Slowly

The first call to a compiled model is often dramatically slower than eager, sometimes by seconds, because that is when tracing, fusion, and the autotuning search actually happen. Engineers new to torch.compile regularly benchmark the very first inference, conclude the compiler made things worse, and quietly delete the wrapper. The fix is to discard the warm-up calls and measure steady state, the same discipline you would apply to a JIT-compiled language. Compilation pays off across millions of requests, not across the first one.

5. Composing Compilation with Quantization Advanced

Compilation does not compete with the other per-node techniques in this chapter; it composes with them, and the composition with quantization (Section 22.2) is the most important to get right. A quantized model stores its weights and often its activations in low precision, which already cuts the bytes moved per operation. Compiling that quantized graph stacks a second, independent saving on top: the compiler fuses the dequantize, compute, and requantize steps into single kernels so the low-precision data is never expanded to full precision in HBM between operations, and it emits integer or low-precision kernels tuned for the quantized shapes. The order that works in practice is to quantize first and compile second, handing the compiler a graph that already carries the low-precision operators so its fusion and autotuning passes optimize the actual arithmetic you will run. TensorRT-LLM and the int8 and fp8 paths in modern serving stacks are built precisely around this quantize-then-compile composition, which is why production LLM engines reach throughput that neither technique delivers alone.

Research Frontier: Compilers Catch Up to Hand-Written Kernels (2024 to 2026)

Two threads define the current frontier. The first is the maturation of torch.compile and TorchInductor: as of PyTorch 2.x it has moved from an opt-in experiment to the default fast path, with TensorRT-LLM and vendor inference stacks routinely reporting large fp8 and int4 serving gains from quantize-then-compile pipelines on Hopper and Blackwell GPUs. The second is the rise of high-level kernel languages that blur the line between writing a kernel and compiling one. Triton lets researchers express fused kernels in Python and have them lowered to fast GPU code, and it is the backend TorchInductor itself emits; Mojo and the broader MLIR-based compiler ecosystem push toward a single language that spans portable model code and hardware-tuned kernels. The open question the field is chasing is whether an automatic compiler can consistently match a hand-written FlashAttention-class kernel; for an increasing share of memory-bound chains the answer is now yes, which is steadily shrinking the set of operations that still demand hand tuning. We will see these compiled engines multiplied across machines through the fleet-sizing arithmetic of Section 22.9, which turns one node's compiled throughput into a machine count for the whole deployment.

With compilation, this chapter's per-node toolkit is complete: quantization shrinks the bytes, FlashAttention and continuous batching reorganize the heavy operations, and compilation removes the interpreter and the redundant memory traffic that sit between them. Each technique makes one node do more, and every one of those single-node gains is about to be multiplied. Section 22.9 takes the per-node throughput and latency numbers these sections produced and turns them into the central economic question of distributed serving: given a request volume and a latency budget, how many of these optimized nodes does the fleet actually need?

Exercise 22.8.1: Read the Fusion Model Conceptual

Using the traffic model of Section 2, explain why fusing a chain of two elementwise operations buys far less than fusing a chain of ten, even though both remove the same per-op launch overhead. Then argue why fusion gives almost no benefit when the chain is a single large matrix multiply with no neighbors: identify which term in $T_{\text{unfused}} = 2nB$ collapses, and connect your answer to where that operation sits on the roofline of Section 3.7.

Exercise 22.8.2: Extend the Traffic Counter Coding

Modify Code 22.8.1 so the chain length $n$ is a parameter and the side-input count $c$ is configurable. Sweep $n$ from 1 to 16 with $c = 1$, print the predicted memory-bound speedup $(2n + c)/(2 + c)$ for each, and confirm it matches the traffic-ratio computation. Then add a second regime where each op also performs heavy arithmetic (raise the FLOPs per element until the chain becomes compute-bound) and show that once the operation leaves the memory-bound region, the fusion speedup flattens out. Explain what changed.

Exercise 22.8.3: Build versus Flexibility Analysis

You serve one model on a fleet that mixes two GPU generations and whose request shapes vary widely by hour. Compare three strategies (eager PyTorch, torch.compile, and an ahead-of-time TensorRT engine per GPU generation) along four axes: peak per-node throughput, engineering and rebuild cost when hardware or shapes change, portability across the mixed fleet, and risk of a shape the engine was not built for. Recommend one strategy and state the single fleet condition that, if it changed, would flip your recommendation.