"They gave me a thousand teraflops and then fed me one byte at a time. I have never been so idle while looking so busy."
A GPU Idling on a Communication Barrier
Before you spread a workload across a hundred machines, you should know how much performance one machine can actually deliver, because distribution multiplies the per-node number and never repairs it. A single accelerator advertises an enormous peak compute rate, but that rate is only reachable when the operation brings enough arithmetic for every byte it reads from memory. The roofline model captures this with one ratio and one diagram: plot how fast an operation can possibly run against how much arithmetic it does per byte moved, and a sloped memory roof and a flat compute roof tell you at a glance whether the operation is starved for data or limited by raw math. This section builds that model, derives the ridge point that separates the two regimes, and shows why matrix multiply sits comfortably under the compute roof while normalization, elementwise ops, and attention's memory traffic crash into the memory roof, which is exactly the gap that fused kernels and FlashAttention exist to close.
The previous sections of this chapter measured scaling in the large: Amdahl's and Gustafson's laws (Section 3.5) bounded the speedup from many machines, and the work-depth model (Section 3.6) bounded the parallelism available in an algorithm. Both took the speed of a single processor as a given constant. This section opens that constant up. A modern accelerator is not one number; it is two ceilings, a compute rate and a memory bandwidth, and which one binds depends entirely on the operation you ask it to run. Getting this per-node picture right matters for the whole book, because every parallel method we study later replicates or shards a per-node kernel, and a kernel running at five percent of peak stays at five percent of peak no matter how many nodes you buy.
The tool for the per-node picture is the roofline model, introduced by Williams, Waterman, and Patterson in 2009 and now the standard lens for accelerator performance. It rests on a single quantity that decides an operation's fate: how much arithmetic it performs for each byte it moves to or from memory.
1. Arithmetic Intensity: Work Per Byte Beginner
Every operation an accelerator runs does two things: it performs floating-point arithmetic, and it moves data between off-chip memory and the compute units. The decisive ratio between them is the arithmetic intensity, the number of floating-point operations performed for each byte of memory traffic,
$$I = \frac{\text{FLOPs performed}}{\text{bytes moved to and from memory}}, \qquad [I] = \frac{\text{FLOP}}{\text{byte}}.$$Arithmetic intensity is a property of the operation and its data layout, not of the hardware. A dense matrix multiply reuses each loaded value many times and so does a great deal of arithmetic per byte; an elementwise addition touches each value once and does almost no arithmetic per byte. The hardware enters through two peak rates that the vendor publishes: the peak compute rate $\pi$ in FLOP/s, and the peak memory bandwidth $\beta$ in bytes/s. The roofline model combines the operation's $I$ with the hardware's $\pi$ and $\beta$ into a single attainable performance.
An operation's arithmetic intensity $I$ is the lever that determines its fate on a given accelerator. If $I$ is high, the operation feeds the compute units faster than it drains memory, and raw FLOP/s is the limit. If $I$ is low, the operation runs out of data to chew on and the memory system is the limit, leaving the expensive compute units idle. You do not change $I$ by buying a faster chip; you change it by reorganizing the computation so each loaded byte is reused more, which is precisely what kernel fusion and FlashAttention do.
2. The Roofline: Two Ceilings and a Ridge Intermediate
How fast can an operation with intensity $I$ actually run? Two limits apply at once, and the slower one wins. The compute units cannot exceed the peak rate $\pi$. The memory system cannot deliver data faster than $\beta$ bytes per second, and since the operation does $I$ FLOPs per byte, the memory system caps performance at $I \cdot \beta$ FLOP/s. The attainable performance is therefore the smaller of the two,
$$P_{\text{attainable}}(I) = \min\bigl(\pi,\; I \cdot \beta\bigr).$$Plotting $P_{\text{attainable}}$ against $I$ on log-log axes produces the shape that gives the model its name, shown in Figure 3.7.1. For small $I$, the term $I \cdot \beta$ dominates and performance rises linearly with intensity: this is the sloped memory roof, where the slope equals the bandwidth $\beta$. For large $I$, performance flattens at $\pi$: this is the horizontal compute roof. The two roofs meet at the ridge point, the intensity at which a memory-bound operation just becomes compute-bound,
$$I_{\text{ridge}} = \frac{\pi}{\beta} \quad \text{FLOP/byte}.$$The ridge point is the single most useful number on the diagram. An operation whose intensity exceeds $I_{\text{ridge}}$ can in principle reach peak compute; an operation below it cannot, no matter how well written, because the memory system simply cannot supply data fast enough. On current accelerators the ratio $\pi/\beta$ has grown large, because compute throughput has climbed faster than memory bandwidth for many hardware generations. That widening ratio pushes the ridge point to the right and pulls more and more operations into the memory-bound regime, which is the deep reason memory traffic, not arithmetic, dominates so many AI kernels today.
3. Why It Matters for AI: Matmul Versus Everything Else Intermediate
Deep learning kernels fall sharply on either side of the ridge, and knowing which side tells you where the time goes. A dense matrix multiply of two $n \times n$ matrices performs about $2n^3$ FLOPs while moving on the order of $n^2$ elements, so its arithmetic intensity grows with $n$ and quickly clears any realistic ridge point: matrix multiply is the canonical compute-bound operation, and it is why accelerators are built around dense linear algebra in the first place. Convolutions and the large projection layers of a transformer behave the same way. These operations use the hardware well; on them, the advertised peak is nearly honest.
The trouble is everything around the matrix multiplies. An elementwise activation, a bias add, a residual connection, a layer normalization, a softmax: each reads its inputs, does a handful of FLOPs per element, and writes its outputs, giving an arithmetic intensity near one. These operations sit far down the memory roof and run at a tiny fraction of peak. Attention compounds the problem, because a naive implementation materializes a large intermediate score matrix in memory and reads it back, so its bottleneck is memory traffic rather than the modest arithmetic it performs. A model can spend a surprising share of its wall-clock time in these memory-bound steps even though they account for a trivial share of its FLOPs. The roofline diagnoses this immediately: low intensity, low attainable performance, idle compute units.
The code below computes arithmetic intensity and roofline-attainable performance for two operations on a representative accelerator, a large matrix multiply and an elementwise add over the same number of elements, and labels each by regime.
import numpy as np
# A representative modern accelerator (round numbers in the spirit of an H100-class card).
PEAK_FLOPS = 9.89e14 # 989 TFLOP/s of dense BF16 matrix-multiply throughput
BANDWIDTH = 3.35e12 # 3.35 TB/s of HBM memory bandwidth
RIDGE_I = PEAK_FLOPS / BANDWIDTH # arithmetic intensity at the ridge point
def analyze(name, flops, bytes_moved):
intensity = flops / bytes_moved # FLOPs per byte
attainable = min(PEAK_FLOPS, intensity * BANDWIDTH) # the roofline ceiling
bound = "compute-bound" if intensity >= RIDGE_I else "memory-bound"
frac = attainable / PEAK_FLOPS # fraction of peak this op can reach
return name, intensity, attainable, bound, frac
# Operation 1: a large square matrix multiply C = A @ B, all in BF16 (2 bytes/element).
n = 8192
mm_flops = 2.0 * n**3 # 2*n^3 multiply-adds
mm_bytes = 3.0 * (n * n) * 2.0 # read A, read B, write C
# Operation 2: an elementwise add y = a + b over the same number of elements as C.
ew_elems = n * n
ew_flops = ew_elems * 1.0 # one add per element
ew_bytes = 3.0 * ew_elems * 2.0 # read a, read b, write y
print(f"peak compute : {PEAK_FLOPS/1e12:8.1f} TFLOP/s")
print(f"peak bandwidth : {BANDWIDTH/1e12:8.2f} TB/s")
print(f"ridge point I : {RIDGE_I:8.1f} FLOP/byte")
print("-" * 78)
hdr = f"{'operation':<22}{'I (FLOP/byte)':>16}{'attainable':>16}{'regime':>16}{'% peak':>8}"
print(hdr)
print("-" * 78)
for name, flops, bytes_moved in [
("matmul 8192^3 bf16", mm_flops, mm_bytes),
("elementwise add", ew_flops, ew_bytes),
]:
nm, I, att, bound, frac = analyze(name, flops, bytes_moved)
print(f"{nm:<22}{I:>16.1f}{att/1e12:>13.1f} TF{bound:>16}{100*frac:>7.1f}%")
min(PEAK_FLOPS, I * BANDWIDTH) line is the roofline formula $P_{\text{attainable}} = \min(\pi, I\beta)$ applied directly.peak compute : 989.0 TFLOP/s
peak bandwidth : 3.35 TB/s
ridge point I : 295.2 FLOP/byte
------------------------------------------------------------------------------
operation I (FLOP/byte) attainable regime % peak
------------------------------------------------------------------------------
matmul 8192^3 bf16 2730.7 989.0 TF compute-bound 100.0%
elementwise add 0.2 0.6 TF memory-bound 0.1%
The numbers are stark and they are real. On this accelerator the ridge point is about 295 FLOP/byte; the matrix multiply clears it by nearly a factor of ten and saturates the compute roof, while the elementwise add reaches one thousandth of peak. The lesson is not that elementwise operations are badly written but that they are intrinsically memory-bound, and the only way to make them cheaper is to stop paying their memory traffic, which means not writing their intermediate results out to memory at all.
Output 3.7.1 shows that a chain of elementwise ops each pays a full round trip to memory. The fix is kernel fusion: run the whole chain in one pass over the data so intermediates stay in fast on-chip registers, which raises the effective arithmetic intensity by shrinking the bytes-moved denominator. You do not write the fused kernel by hand. PyTorch's compiler generates it from ordinary code:
import torch
@torch.compile # fuses the elementwise chain into one kernel
def block(x, w, b):
h = x @ w # compute-bound matmul, left as is
h = h + b # bias add ......... }
h = torch.nn.functional.gelu(h) # activation ....... } fused: one memory pass
return h * 0.5 # scale ............ }
torch.compile keeps the matmul as is and fuses the bias add, activation, and scale into a single pass, removing two of the three memory round trips that Output 3.7.1 would otherwise charge for the tail.Who: A deep learning performance engineer optimizing inference for a transformer serving stack.
Situation: A model hit only about 30 percent of the accelerator's advertised FLOP/s during generation, far below what the matrix-multiply-heavy architecture seemed to promise.
Problem: The team's first instinct was that the big projection matmuls were inefficient and needed a better library or a larger batch.
Dilemma: Spend the week tuning matmul tiling and batch sizes, which targets operations that were already near the compute roof, or step back and measure arithmetic intensity per operation to find what was actually starving the chip.
Decision: They plotted each operation on a roofline. The matmuls sat on the compute roof exactly as expected; the wasted time lived in the unfused chain of normalizations, residual adds, and the attention softmax, all far down the memory roof.
How: They fused the elementwise and normalization chains into single kernels and replaced the naive attention with a memory-efficient fused attention kernel, raising the effective intensity of the memory-bound steps without touching the matmuls.
Result: Utilization rose from roughly 30 percent to over 60 percent of peak with no change to the math or the model weights; the speedup came entirely from cutting memory traffic the roofline had flagged.
Lesson: Profile by arithmetic intensity, not by FLOP count. The operations that waste the most time are usually the ones doing the least arithmetic, because they are pinned to the memory roof while the compute units wait.
4. From One Roofline to a Cluster Advanced
The roofline is a single-accelerator model, and that is exactly why it belongs at the foundation of a book about distribution. Distribution multiplies the per-node performance ceiling; it does not raise it. If a kernel runs at one tenth of peak on one device because it is memory-bound, then a thousand devices running that kernel run at one tenth of peak a thousand times over, and you have bought a thousandfold more idle compute. The disciplined path is to push each node toward its roofline first, then scale out, so that the per-node baseline you replicate is an efficient one. Per-node inference efficiency, the art of living near the roofline through quantization, fused attention, and cache-aware scheduling, gets its own treatment as the prerequisite that the serving fleet multiplies in Chapter 22.
There is also a second roofline lurking in any distributed system, and it has the same shape. Replace the memory roof's bandwidth $\beta$ with the bandwidth of the interconnect between nodes, and the bytes-moved with the bytes sent over the network, and you get a communication roofline that bounds how fast a distributed step can run. The next section makes that idea precise with the alpha-beta model, where latency and bandwidth play the roles that the two roofs play here. The collective-communication primitives that ride on that interconnect, and the topology that sets its bandwidth, are the subject of Chapter 4, and the matrix-multiply intensity we exploited above is what data-parallel training replicates across workers in Chapter 15.
Because the ridge point keeps moving right, the most active line of accelerator-kernel work is raising the arithmetic intensity of memory-bound operations rather than chasing more FLOP/s. FlashAttention-3 (Shah et al., 2024) restructures attention so the score matrix never touches main memory and overlaps the remaining transfers with computation, pushing attention from the memory roof toward the compute roof on Hopper-class hardware. Low-precision formats extend the same idea by shrinking the bytes-moved denominator: FP8 training and inference, now standard on recent accelerators, roughly halve memory traffic relative to BF16 and lift effective intensity, and microscaling block formats such as MXFP8 and MXFP6 push toward even narrower types. Compiler-driven kernel fusion in stacks like Triton, torch.compile, and Mojo generates the fused kernels automatically, so the memory-bound tail that Output 3.7.1 exposed is increasingly handled by the toolchain rather than by hand. The unifying goal across all of these is the same: move operations rightward on the roofline so the expensive compute units stop waiting on memory.
An accelerator running a memory-bound kernel at one percent of peak is like a Formula One car stuck in city traffic: the engine is magnificent, and it spends the whole trip waiting at red lights. The roofline is the traffic report that tells you the engine was never the problem.
An accelerator advertises a peak of 312 TFLOP/s and a memory bandwidth of 2.0 TB/s. (a) Compute its ridge point $I_{\text{ridge}} = \pi/\beta$ in FLOP/byte. (b) A layer normalization over a tensor has an arithmetic intensity of about 3 FLOP/byte; state whether it is compute-bound or memory-bound on this device and compute its attainable performance from $P = \min(\pi, I\beta)$. (c) Explain in one sentence why upgrading to a chip with twice the FLOP/s but the same bandwidth would not speed this operation up at all.
Extend Code 3.7.1 with a third operation: a chain of three elementwise ops (add, then activation, then scale) over $n \times n$ BF16 elements, computed two ways. First, unfused, where each op reads its input and writes its output to memory separately (so the chain pays three full round trips). Second, fused, where the chain reads the inputs once and writes the result once (one round trip). Report the arithmetic intensity and attainable performance of both, and the speedup factor fusion buys. Explain why fusion raises $I$ even though the FLOP count is unchanged.
Consider scaled dot-product attention for one head with sequence length $L$ and head dimension $d$. A naive implementation computes the score matrix $QK^\top$ (about $2L^2 d$ FLOPs), writes the $L \times L$ scores and the softmax over them to memory, then multiplies by $V$. (a) Write an order-of-magnitude expression for the bytes moved when the $L \times L$ intermediate is materialized, and show that the arithmetic intensity falls toward a constant that does not grow with $L$. (b) Argue, using the roofline, why a fused kernel that never writes the $L \times L$ matrix to memory raises the intensity, and connect this to why FlashAttention matters for long-context models. (c) State which roof, memory or compute, you expect naive attention to hit on the accelerator from Exercise 3.7.1.
With the per-node ceiling now expressed as two roofs and a ridge, the natural next step is the second roofline of any distributed system: the cost of moving bytes between machines rather than between memory and compute. Section 3.8 develops that cost as the alpha-beta model, where a fixed latency and a per-byte bandwidth term play the roles that the compute and memory roofs played here, and turns "how expensive is this communication?" into a number you can put in a budget.