Section 33.2: Compute: CPUs, GPUs, TPUs, and Accelerator Instances

"They gave me sixteen thousand cores and two terabytes a second, and still I sit idle, holding a matrix I cannot use until the host finally decides to feed me the next batch."
A GPU, Waiting for the Host to Feed It

Big Picture

A cluster is not an abstract pool of computation; it is a fleet of physical accelerators, each with a fixed peak compute rate and a fixed memory bandwidth, wired together so that thousands of them act as one machine. Every distributed-training and distributed-serving method in this book ultimately runs on one of these chips, and the efficiency of the whole fleet is the per-node efficiency multiplied across it. This section surveys the accelerators that fill a modern AI cluster, the CPU host that feeds them, the GPU and TPU that do the arithmetic, and the multi-accelerator instances that vendors package with fast intra-node links, and it gives you the one reasoning tool, the roofline, that predicts whether a given operation will be limited by compute or by memory. Understanding the single node is the prerequisite for understanding why the cluster is wired the way it is.

The earlier sections of this chapter treated the cluster as a resource to be scheduled and the workload as a graph of tasks to be placed. That view is correct, but it abstracts away the object that every task eventually lands on: a single accelerator with hard physical ceilings. Before a scheduler can place work well, before an interconnect can be sized, and before a parallelism strategy can be chosen, the per-node compute device must be understood on its own terms. A cluster of ten thousand accelerators that each run at twenty percent of their peak is, in compute delivered, a cluster of two thousand. The fleet only ever multiplies what one node achieves, so the unit of analysis in this section is deliberately small: one chip, its arithmetic units, and its memory.

This is the per-node baseline that the rest of the book distributes. The detailed treatment of squeezing a single accelerator, quantization, KV-cache paging, and attention kernels, is the explicit subject of Chapter 22, which this section assumes as background; here we care only about the device's shape, because that shape dictates how many of them you need and how they must be wired. The wiring itself, the interconnect that turns a rack of accelerators into a single training substrate, is the subject of Chapter 4, and this section ends by showing where the per-node story hands off to it.

1. The Host CPU: Coordinator, Not Calculator Beginner

Every accelerator in a cluster sits behind a host CPU, and the relationship between them is asymmetric in a way that newcomers often misread. The CPU does very little of the arithmetic in a deep-learning workload; its job is to keep the accelerator fed. It reads training shards from storage, decodes and augments images or tokenizes text, assembles batches, copies them across the PCIe or NVLink boundary into accelerator memory, launches the compute kernels, and runs the control logic, the optimizer bookkeeping, the checkpointing, and the communication scheduling, that the accelerator itself does not. A modern host has dozens of cores precisely because data loading and preprocessing for a fast accelerator is itself a parallel problem, and a host that cannot decode images quickly enough leaves the accelerator stalled, exactly the idle state the epigraph complains about.

The CPU also owns coordination. It is the process that joins the communication group, that talks to the scheduler, that decides when a checkpoint is written and when a failed peer should be replaced. In the data-parallel training loop of Chapter 15, it is the host that issues the collective call even though the accelerator moves the bytes. This division of labor, CPU coordinates and feeds, accelerator computes, is the mental model to carry into every later section: when a cluster underperforms, the bottleneck is as often a starved input pipeline on the host as it is the accelerator itself.

Key Insight: The Accelerator Is the Unit, the Fleet Is the Multiplier

The entire book is about wiring many accelerators together, but the wiring can never recover efficiency that the single node throws away. If one device runs an operation at five percent of its peak, ten thousand devices run it at five percent of ten thousand peaks. This is why per-node efficiency (Chapter 22) is a prerequisite for scale-out and not a competing concern: the fleet multiplies the per-node number, and a small per-node number multiplied by a large fleet is still a small number times a large cost.

2. The GPU: Streaming Multiprocessors Fed by HBM Intermediate

A GPU is best understood as a large array of simple arithmetic units organized into streaming multiprocessors, paired with a pool of high-bandwidth memory. Each streaming multiprocessor contains many scalar lanes for general arithmetic and, on accelerators built for deep learning, dedicated tensor cores that perform a small matrix multiply-accumulate as a single instruction. Tensor cores are why a modern GPU reports hundreds of teraFLOPs for the low-precision matrix multiplications that dominate neural-network training and inference: the chip is built to do one thing, dense matrix arithmetic, at enormous rates. The parameters, activations, and optimizer state that the multiply consumes live in high-bandwidth memory, or HBM, a stack of DRAM bonded close to the compute die to deliver bandwidth measured in terabytes per second.

The decisive fact about this arrangement is that compute has grown far faster than memory bandwidth. A contemporary data-center GPU can perform thousands of floating-point operations in the time it takes to read a single byte from HBM. The arithmetic units are therefore frequently idle, not because the chip is slow, but because the data cannot arrive fast enough. Memory bandwidth, not raw compute, is the binding ceiling for a large fraction of deep-learning operations, and recognizing which operations are limited by which resource is the single most useful per-node skill. Figure 33.2.1 shows the hierarchy that every byte must traverse, from slow remote storage up to the registers feeding the tensor cores.

Figure 33.2.1: The accelerator memory hierarchy. Capacity is largest at the base and smallest at the apex; bandwidth runs the other way. A training step streams shards from networked storage through host DRAM into HBM, and the tensor cores can only run at full rate when the bytes they need are already resident in HBM and on-chip SRAM.

The notation we need to make this precise is small. An operation performs some number of floating-point operations, its FLOPs, and moves some number of bytes between HBM and the compute units. Their ratio is the operation's arithmetic intensity,

$$I \;=\; \frac{\text{FLOPs}}{\text{bytes moved}} \qquad [\text{FLOP/byte}],$$

and it is the quantity that decides whether compute or memory is the binding constraint. A device is characterized by two peaks: its peak compute rate $\pi$ in FLOP/s and its peak memory bandwidth $\beta$ in byte/s. The roofline model states that the attainable rate on an operation of intensity $I$ is

$$P(I) \;=\; \min\bigl(\pi,\; \beta \cdot I\bigr).$$

When $I$ is small the term $\beta I$ dominates and the operation is memory-bound, running far below the chip's compute peak no matter how many tensor cores it has. When $I$ is large the operation saturates $\pi$ and is compute-bound. The crossover, the ridge point, sits at $I^{\star} = \pi / \beta$; operations with intensity below it are starved for bandwidth, and operations above it are limited only by the arithmetic units. This one inequality is the per-node efficiency frame that the whole fleet multiplies.

3. Roofline Reasoning, Computed Intermediate

The roofline is most convincing as a calculation. The program in Code 33.2.1 models a single accelerator by its two peaks, computes the ridge point, and then evaluates two operations of identical shape but very different reuse: a large square matrix multiply, which reuses each loaded value across a whole row and column, and a single-token matrix-vector product, which reuses almost nothing. The contrast is the difference between a training step that saturates the chip and a latency-bound decode step that wastes it.

# Roofline reasoning: arithmetic intensity decides compute-bound vs memory-bound.
# Model an accelerator by its peak compute and peak memory bandwidth.
peak_flops = 312e12      # 312 TFLOP/s (representative dense bf16 throughput)
peak_bw    = 2.0e12      # 2.0 TB/s HBM bandwidth

ridge = peak_flops / peak_bw   # ridge-point intensity in FLOP/byte
print(f"peak compute        : {peak_flops/1e12:.0f} TFLOP/s")
print(f"peak HBM bandwidth  : {peak_bw/1e12:.1f} TB/s")
print(f"ridge intensity     : {ridge:.0f} FLOP/byte")
print()

def report(name, flops, bytes_moved):
    I = flops / bytes_moved
    attainable = min(peak_flops, I * peak_bw)   # the roofline P(I) = min(pi, beta*I)
    regime = "compute-bound" if I >= ridge else "memory-bound"
    print(f"{name}")
    print(f"  intensity I       : {I:7.1f} FLOP/byte  ({regime})")
    print(f"  attainable        : {attainable/1e12:7.1f} TFLOP/s")
    print(f"  fraction of peak  : {100*attainable/peak_flops:6.1f}%")

# GEMM C = A B, A:[m,k] B:[k,n], bf16 (2 bytes). FLOPs = 2*m*n*k.
m = n = k = 8192
report("large square GEMM 8192^3", 2*m*n*k, 2*(m*k + k*n + m*n))

# GEMV: one token through a weight matrix, almost no reuse.
m2, k2 = 8192, 8192
report("GEMV 8192x8192 (batch 1)", 2*m2*k2, 2*(m2*k2 + k2 + m2))

Code 33.2.1: A roofline evaluator. The same arithmetic shape, scaled to a full matrix versus a single vector, lands on opposite sides of the ridge point, which is why batching is the central lever for accelerator efficiency.

peak compute        : 312 TFLOP/s
peak HBM bandwidth  : 2.0 TB/s
ridge intensity     : 156 FLOP/byte

large square GEMM 8192^3
  intensity I       :  2730.7 FLOP/byte  (compute-bound)
  attainable        :   312.0 TFLOP/s
  fraction of peak  :  100.0%
GEMV 8192x8192 (batch 1)
  intensity I       :     1.0 FLOP/byte  (memory-bound)
  attainable        :     2.0 TFLOP/s
  fraction of peak  :    0.6%

Output 33.2.1: The large matrix multiply reaches the full 312 TFLOP/s because its intensity of 2731 FLOP/byte sits far above the ridge of 156; the batch-one matrix-vector product reaches 0.6 percent of peak because its intensity of 1 FLOP/byte leaves it starved for bandwidth.

The numbers are stark. The same chip, on two operations built from the same kind of arithmetic, delivers its full compute on one and less than one percent on the other, and the only difference is reuse. This is why batching, packing many tokens or examples into one matrix multiply, is the dominant lever for accelerator efficiency, and why latency-bound single-request inference is so expensive per token. It is also why the roofline travels with us into the fleet: a parallelism strategy that lowers per-device intensity below the ridge can leave a ten-thousand-GPU cluster running at a fraction of its paper peak. Figure 33.2.2 plots the roofline and places the two operations on it.

Figure 33.2.2: The roofline for the device in Code 33.2.1. The rising line has slope equal to memory bandwidth $\beta$ and governs memory-bound operations; the flat ceiling equals peak compute $\pi$. The batch-one matrix-vector product sits far down the slope at intensity 1, while the large matrix multiply sits on the ceiling. Moving an operation rightward, by batching to raise its intensity past the ridge, is the per-node efficiency move that scale-out then multiplies.

Library Shortcut: Query the Devices Before You Reason About Them

Before applying a roofline you need the actual peaks and the device inventory, and you should read them from the hardware rather than trust a datasheet. Each framework exposes a one-line query: PyTorch reports the visible GPUs and their properties, and JAX reports every local accelerator including TPU cores. The command-line companion, nvidia-smi, prints live memory and utilization for the host's GPUs.

import torch
n = torch.cuda.device_count()                       # GPUs visible to this process
for i in range(n):
    p = torch.cuda.get_device_properties(i)
    print(i, p.name, f"{p.total_memory/1e9:.0f} GB", f"{p.multi_processor_count} SMs")

import jax                                            # works for GPU and TPU backends
print(jax.devices())                                 # e.g. [TpuDevice(0), TpuDevice(1), ...]
# shell: nvidia-smi --query-gpu=name,memory.total,utilization.gpu --format=csv

Code 33.2.2: Device discovery in three frameworks. What a scheduler tracks as an abstract resource, these calls expose as concrete chips with named memory and compute counts, the numbers a roofline needs.

4. The TPU: A Systolic Array in a Pod Advanced

The tensor processing unit takes the same observation, that deep learning is dominated by dense matrix multiplication, and specializes even harder than a GPU does. Its core is a systolic array: a two-dimensional grid of multiply-accumulate cells through which the operands flow rhythmically, each cell passing its partial result to its neighbor so that a large matrix multiply streams through the grid with very little control overhead and very few trips back to memory. Where a GPU is a flexible array of programmable multiprocessors, a TPU is a near-fixed-function matrix engine; it gives up generality to spend almost all of its silicon and its energy budget on the one operation that matters, and it keeps data moving between cells rather than to and from memory, which raises the effective arithmetic intensity of the operations it is built for.

The TPU's relevance to a book on scale-out is its packaging. Individual TPU chips are wired into a pod through a dedicated high-speed interconnect arranged as a multi-dimensional torus, so that thousands of chips share a single communication fabric purpose-built for the collective operations of distributed training. The torus topology means each chip talks directly to its neighbors with uniform, predictable latency, which is exactly what the all-reduce of Chapter 4 wants. The TPU is therefore a clean illustration of the book's central pattern: the vendor has already done the scale-out wiring at the node-and-pod level, exposing not one accelerator but a coherent fabric of them. Table 33.2.1 places the three compute types side by side on the axes that matter for a cluster.

Table 33.2.1: The three compute types in an AI cluster, compared on the role each plays and the per-node property that most constrains the work placed on it.

Device	Role in the cluster	Compute style	Binding ceiling for AI work
CPU (host)	Feed data, coordinate, run control logic	Many general cores	Input-pipeline throughput, PCIe transfer
GPU	Train and serve dense models	SMs plus tensor cores	HBM bandwidth on low-intensity ops
TPU	Train and serve at pod scale	Systolic matrix array	Matmul shape fit, pod interconnect

5. Accelerator Instances: NVLink Inside the Node Intermediate

Cloud vendors do not rent bare accelerators; they rent instances, and a high-end AI instance packs four, eight, or sixteen GPUs into a single physical node. The decisive feature of such an instance is not the count of GPUs but how they are connected to each other inside the box. Ordinary host buses such as PCIe move data between GPUs at tens of gigabytes per second, which is far too slow for the constant exchange that sharded training demands. To close that gap, vendors wire the GPUs within a node with a dedicated high-bandwidth link, NVLink, and route it through a switch fabric, NVSwitch, so that every GPU in the node can talk to every other at hundreds of gigabytes per second with uniform bandwidth. The instance is, in effect, a small all-to-all cluster sold as one machine.

This intra-node fabric is what makes the most communication-intensive parallelism strategies viable. Sharded and tensor parallelism, the subject of Chapter 16, split a single layer's weights and activations across several GPUs and must exchange partial results on every forward and backward pass; that exchange is only affordable when the GPUs sit on the same NVSwitch fabric. The standard design pattern that emerges, and that the scheduler of this chapter must respect, is a memory and bandwidth hierarchy: tensor-parallel shards are kept inside one NVLink-connected node, while the slower, less frequent data-parallel all-reduce crosses the network between nodes. Figure 33.2.3 shows this two-level structure, the per-node fabric that the fleet-level interconnect then ties together.

Figure 33.2.3: The two-level bandwidth hierarchy of a multi-GPU fleet. Inside each node, an NVSwitch connects every GPU to every other at hundreds of gigabytes per second (thick orange links), fast enough for tensor-parallel shards. Between nodes, a slower network link (dashed) carries the less frequent data-parallel all-reduce. Scheduling and parallelism strategy both follow this hierarchy: keep the chattiest communication inside the box.

Thesis Thread: The Vendor Already Scaled Out the Node

This section's quiet lesson is that scale-out does not begin at the cluster; it begins inside the accelerator instance. A TPU pod and an NVSwitch node are themselves miniature distributed systems, with their own interconnect, their own collectives, and their own placement decisions. The book's central question, how do we split work across machines and recombine it cheaply, is answered first by the hardware vendor at the node level and then again by you at the cluster level. Every parallelism chapter that follows reads this two-level hierarchy off the hardware and maps the chattiest communication onto the fastest link, a discipline that begins with the collectives of Chapter 4.

Practical Example: The Eight-GPU Node That Ran Like One

Who: A platform engineer standing up training infrastructure for a thirteen-billion-parameter language model.

Situation: The model did not fit in a single GPU's memory, so the layers were split across the eight GPUs of one instance with tensor parallelism.

Problem: The first configuration placed the eight tensor-parallel shards across two four-GPU nodes, and the step time was four times worse than a back-of-envelope estimate predicted.

Dilemma: Buy faster network hardware between nodes, an expensive fleet-wide change, or rearrange the placement so the chatty shards stayed on one NVSwitch fabric.

Decision: They pinned the entire tensor-parallel group inside one eight-GPU node and reserved the inter-node network for the infrequent data-parallel all-reduce only.

How: The scheduler was told to treat the tensor-parallel group as a gang that must land on a single instance, the placement constraint this chapter's scheduler exists to honor.

Result: Step time fell back to the predicted range, because the per-pass activation exchange now crossed NVLink at hundreds of gigabytes per second instead of the network at tens.

Lesson: The roofline tells you a node's compute ceiling; the intra-node fabric tells you which communications are cheap. Place the chattiest parallelism where the bandwidth is, and the eight-GPU node behaves like one fast machine.

6. From One Node to the Fleet Beginner

The accelerator landscape resolves into a simple layered picture that the rest of Part VII builds on. At the base is the host CPU, feeding and coordinating. Above it is the accelerator, a GPU or TPU, whose tensor cores or systolic array deliver compute that only the roofline can predict will be used. Around several accelerators is the instance, stitched together by an NVLink or torus fabric into a small all-to-all cluster. And around many instances is the fleet, tied by the slower network that the scheduler and the interconnect of Chapter 4 must use sparingly. Each layer multiplies the one below it, and at every layer the binding constraint may shift from compute to memory bandwidth to communication.

This is why per-node understanding is not a detour from a scale-out book but its foundation. A scheduler that ignores the roofline will pack work that runs at one percent of peak; a parallelism strategy that ignores the intra-node fabric will drown the network in traffic that NVLink could have absorbed. The accelerator is the unit, and everything this book does is decide how to wire many units together so that the fleet delivers what the chips promise. The next section turns from the compute that fills a cluster to the memory and storage tiers that hold its data, the other half of what a node contributes to the whole.

Research Frontier: Disaggregation and Bandwidth-Centric Accelerators (2024 to 2026)

Two active lines push directly on the ceilings this section named. The first is memory and resource disaggregation: rather than fixing a node's HBM at manufacture, emerging fabrics such as CXL-attached memory and pooled-memory prototypes let an accelerator borrow capacity from a shared pool, loosening the hard per-device memory ceiling that forces model sharding today. The second is the response to the memory-bandwidth wall the roofline exposes: vendors now ship accelerators differentiated chiefly by HBM capacity and bandwidth rather than peak FLOPs, and disaggregated LLM serving, splitting the compute-bound prefill phase and the memory-bound decode phase onto separate, differently-provisioned hardware pools, has moved from research into production stacks. Both trends treat the per-node roofline of Output 33.2.1 as a quantity to be engineered around at the fleet level, which is exactly the move that Chapter 24 develops for serving.

Fun Note

The most expensive silicon in the building spends much of its life waiting on the cheapest. A data-center GPU costing as much as a car routinely idles because a few host threads cannot decode JPEGs fast enough to fill its input queue. The accelerator in the epigraph is not exaggerating; it really is sitting on a finished matrix, holding terabytes per second of unused bandwidth, while the host catches up.

Exercise 33.2.1: Find the Ridge, Place the Op Conceptual

An accelerator advertises a peak of 990 teraFLOP/s in bf16 and 3.35 terabytes per second of HBM bandwidth. (a) Compute its ridge-point arithmetic intensity $I^{\star} = \pi/\beta$ in FLOP/byte. (b) A fused attention kernel achieves an intensity of 220 FLOP/byte and a key-value-cache read during decoding achieves 2 FLOP/byte; state which is compute-bound and which is memory-bound, and give each one's attainable rate from $P(I) = \min(\pi, \beta I)$. (c) Explain in two sentences why raising the decode batch size moves the second operation toward the ridge, and why it cannot move past the attention kernel's regime.

Exercise 33.2.2: Extend the Roofline Evaluator Coding

Starting from Code 33.2.1, add a third workload: a transformer feed-forward block that multiplies a batch of $B$ tokens (each of dimension $d=8192$) by a weight matrix of shape $[d, 4d]$ in bf16. Parameterize the function by $B$ and sweep $B$ over $1, 8, 64, 512, 4096$. For each $B$ print the arithmetic intensity and the fraction of peak attained, and report the smallest $B$ at which the block first becomes compute-bound. Explain how this batch threshold relates to the difference between the two operations already in the code.

Exercise 33.2.3: Place the Parallel Groups on the Hierarchy Analysis

You must train a model with a tensor-parallel group of size 8 and a data-parallel group of size 4, on nodes that each hold 8 GPUs connected by an NVSwitch fabric at 400 gigabytes per second, with nodes joined by a network at 25 gigabytes per second. (a) Using the hierarchy in Figure 33.2.3, decide which group should be confined to a single node and which should span nodes, and justify the choice from the relative communication frequency of tensor and data parallelism. (b) If a scheduler instead spread each tensor-parallel group across two nodes, estimate qualitatively how the per-pass activation exchange time changes, citing the bandwidth ratio. (c) State the scheduling constraint, in the gang-scheduling language of this chapter, that prevents the bad placement, and connect it to the sharded-parallelism requirements of Chapter 16.