"They added eight more of me before anyone asked what was actually slow. I am very fast now at the thing that was never the problem."
A Bottleneck, Asking to Be Named Before It Is Cut
A capstone that distributes the wrong axis is slower, more expensive, and more fragile than the single machine it replaced; the entire design hinges on naming, before you write a line of distributed code, which one resource the baseline actually runs out of first. A system can be limited by how fast it reads bytes, by how fast it does arithmetic, by how much it must hold in memory at once, by how quickly it must answer, or by how much its parts must talk to one another. Each of those ceilings has a different remedy and a different part of this book, and they do not substitute for each other: replicating a model that already does not fit in memory just buys you more copies of the same failure. This section turns the six axes of Chapter 1 into a diagnosis. You profile the single-machine baseline, classify the bottleneck with the same roofline reasoning as Chapter 3, and let the measurement, not your intuition, pick the axis.
The previous section framed the capstone as a problem statement: a workload that one machine can no longer serve, and a goal stated in measurable terms. This section makes the first irreversible design decision. Of the many things a system does, you will distribute one of them, and the rest of the project, the baseline you build in Section 41.3 and the architecture you commit to in Section 41.4, follows from that choice. Choosing well is not a matter of taste or of which technique you find most interesting. It is a matter of finding the one place where the single machine breaks and pointing your effort precisely there. Distribution is forced by a ceiling, never chosen for elegance, and the discipline of this section is refusing to commit until you can name the ceiling.
1. The Six Axes, Recalled as a Menu of Remedies Beginner
Chapter 1 introduced six axes along which an AI system can be distributed, and Section 1.2 walked each with a worked example. Here we recall them not as a taxonomy but as a menu of remedies, because every axis answers a different question and therefore cures a different ailment. Distributing data spreads the dataset and its processing across machines, so it cures a system drowning in bytes it cannot read or store. Distributing training spreads the gradient computation across workers, so it cures a system whose compute is the limit while the model still fits on one device. Distributing the model splits the parameters and activations across devices, so it cures a system whose model is too large to hold at all. Distributing inference replicates and partitions the serving path, so it cures a system that cannot answer requests fast enough. Coordinating the cluster and distributing intelligence handle the scheduling, recovery, and multi-agent reasoning that hold the whole thing together once several machines are involved.
The trap is treating these as interchangeable knobs. They are not. Each axis removes exactly one kind of ceiling, and applying the wrong one leaves the real ceiling untouched while adding the communication tax that Chapter 3 taught you to price. Table 41.2.1 lays the menu out as a lookup: the symptom you measure, the axis that cures it, the part of the book that develops the technique, and the production tool you would reach for. The columns after the symptom are useless until you have measured the symptom, which is why the rest of this section is about measurement.
| Measured symptom (the binding ceiling) | Axis to distribute | Where the book develops it | Representative tools |
|---|---|---|---|
| Dataset too large to read or store on one node (I/O bound) | Distribute data | Part II (Ch 6, Ch 8) | Spark, Ray Data, sharded loaders |
| Compute is the limit but the model fits on one device | Distribute training | Parts III and IV (Ch 11, Ch 15) | PyTorch DDP, Horovod |
| Model plus optimizer state exceeds one device's memory | Distribute the model | Ch 16 (sharded), Ch 11 (embeddings) | FSDP, DeepSpeed ZeRO, Megatron |
| Cannot answer requests within the latency or throughput budget | Distribute inference | Part V (Ch 23) | vLLM, Triton, Ray Serve |
| Many agents or stages must agree, schedule, and recover (coordination bound) | Coordinate / distribute intelligence | Parts VI and VII (Ch 32, Ch 33) | Ray, Kubernetes, agent frameworks |
Every axis of distribution is a remedy for one specific ceiling, and the ceilings do not substitute. Data parallelism multiplies throughput but does nothing for a model that does not fit; model parallelism shrinks the per-device footprint but adds communication that a compute-bound, already-fitting model never needed. The capstone's first design decision is therefore an empirical one: profile the single-machine baseline, find the resource it exhausts first, and read the matching axis off Table 41.2.1. Pick the axis before the measurement and you will, sooner or later, build the right cure for the wrong disease.
2. Diagnosing the Binding Bottleneck Intermediate
To choose an axis you must answer one question about the single-machine baseline: which resource does it exhaust first? There are five candidates, and they map cleanly onto the menu. A workload is data I/O bound when it spends its time reading or writing bytes, the processor idling while the disk or network catches up. It is compute bound when the arithmetic units are saturated and feeding them faster would not help. It is memory bound in two distinct senses worth separating: memory-bandwidth bound when the cores starve waiting for data to stream from RAM, and memory-capacity bound when the working set simply does not fit and the program cannot run at all without paging or crashing. It is latency bound when the clock that matters is the time to first response, not total throughput. And it is coordination bound when the time goes to synchronization, queuing, and agreement between parts rather than to any single part's work.
The cleanest way to separate compute from memory bandwidth is the roofline reasoning of Chapter 3. Characterize a step by its arithmetic intensity, the ratio of floating-point operations to bytes moved,
$$I \;=\; \frac{W_{\text{flops}}}{Q_{\text{bytes}}} \qquad \text{(FLOP per byte).}$$A machine has two ceilings: a peak compute rate $\pi$ (FLOP per second) and a peak memory bandwidth $\beta$ (bytes per second). The most arithmetic a step can sustain is the lower of what compute allows and what the memory feed allows,
$$P_{\max}(I) \;=\; \min\!\big(\pi,\; \beta \cdot I\big).$$The crossover sits at the ridge intensity $I^\star = \pi / \beta$. A step with $I < I^\star$ lives on the slanted part of the roofline, starved by bandwidth: it is memory-bandwidth bound, and on one machine the fix is to shrink the footprint, which on a cluster means distributing the model. A step with $I > I^\star$ lives under the flat part, limited by the arithmetic units: it is compute bound, and the fix is to add compute, which means distributing training. The two simplest binding tests fall straight out:
$$\underbrace{I < \tfrac{\pi}{\beta}}_{\text{memory-bandwidth bound}} \qquad\text{vs.}\qquad \underbrace{I > \tfrac{\pi}{\beta}}_{\text{compute bound}}.$$Communication adds a third roof. When a step also moves $Q_{\text{net}}$ bytes between machines over a link of bandwidth $\beta_{\text{net}}$, with the per-message latency $\alpha$ of the alpha-beta model from Chapter 3, a single round costs about $\alpha + Q_{\text{net}}/\beta_{\text{net}}$, and the step is communication bound when that time dominates the local work,
$$\alpha + \frac{Q_{\text{net}}}{\beta_{\text{net}}} \;>\; \frac{W_{\text{flops}}}{\pi}.$$When the communication roof is the binding one, no amount of extra compute or memory helps; the remedy lives on the coordination axis, and adding workers can make matters worse by raising $Q_{\text{net}}$. Figure 41.2.1 collapses these tests into the diagnostic flow you actually run, and the demo that follows it puts numbers under each branch.
3. A Profiler That Picks the Axis for You Intermediate
The diagnosis in Figure 41.2.1 is only as good as the profile underneath it, so the right first move on any capstone is to time the candidate steps of the single-machine baseline and compute their arithmetic intensity. The demonstration below builds three deliberately different steps, a compute-heavy dense matrix multiply, a memory-heavy streaming reduction over a large array, and an I/O-heavy write-then-read against disk, then times each and reports its arithmetic intensity and sustained throughput. The classification falls out of those numbers: a high FLOP-per-byte step that nonetheless sustains few gigabytes per second is compute bound, a low FLOP-per-byte step that saturates memory bandwidth is bandwidth bound, and a step dominated by bytes crossing the disk boundary is I/O bound. Each verdict maps to a row of Table 41.2.1.
import time, numpy as np
def timed(fn, reps=3):
best = float("inf")
for _ in range(reps):
t0 = time.perf_counter(); fn(); best = min(best, time.perf_counter() - t0)
return best
rng = np.random.default_rng(0)
# Workload A: compute-bound. Dense matmul: high FLOP per byte.
A = rng.standard_normal((1200, 1200)); B = rng.standard_normal((1200, 1200))
compute_step = lambda: A @ B
flops_A, bytes_A = 2.0 * 1200**3, 3 * A.nbytes # ~2 n^3 flops; read A, read B, write C
# Workload B: memory-bandwidth-bound. Stream a 305 MB array, do almost no arithmetic.
big = rng.standard_normal(40_000_000)
memory_step = lambda: big.sum()
flops_B, bytes_B = 2.0 * big.size, big.nbytes # ~2 flops/elem; one streaming read
# Workload C: I/O-bound. Write then read a 61 MB file from disk.
import os, tempfile
payload = rng.standard_normal(8_000_000)
path = os.path.join(tempfile.gettempdir(), "_bn_io_probe.bin")
def io_step():
payload.tofile(path)
with open(path, "rb") as f: f.read()
flops_C, bytes_C = 0.0, 2 * payload.nbytes # one write + one read
def classify(name, t, flops, byts, axis):
I = flops / byts if byts else 0.0 # arithmetic intensity, FLOP per byte
print(f"{name:14s} t={t*1e3:7.1f} ms I={I:8.2f} F/B "
f"{flops/t/1e9:7.1f} GFLOP/s {byts/t/1e9:6.2f} GB/s -> {axis}")
print("workload time arith.intensity throughput recommended axis")
print("-" * 92)
classify("compute-heavy", timed(compute_step), flops_A, bytes_A, "distribute training (split compute)")
classify("memory-heavy ", timed(memory_step), flops_B, bytes_B, "distribute the model (split footprint)")
classify("io-heavy ", timed(io_step), flops_C, bytes_C, "distribute data (shard the dataset)")
os.remove(path)
workload time arith.intensity throughput recommended axis
--------------------------------------------------------------------------------------------
compute-heavy t= 2495.4 ms I= 100.00 F/B 1.4 GFLOP/s 0.01 GB/s -> distribute training (split compute)
memory-heavy t= 47.6 ms I= 0.25 F/B 1.7 GFLOP/s 6.73 GB/s -> distribute the model (split footprint)
io-heavy t= 72.4 ms I= 0.00 F/B 0.0 GFLOP/s 1.77 GB/s -> distribute data (shard the dataset)
Code 41.2.1 hand-derives FLOP and byte counts to make the roofline visible, but on a real baseline you do not annotate arithmetic by hand. A sampling profiler such as py-spy attaches to a running process and produces a flame graph in one command (py-spy record -o profile.svg --pid <pid>), showing instantly whether time pools in a data loader (I/O bound), a kernel (compute or bandwidth bound), or a synchronization barrier (coordination bound). For deep learning specifically, torch.profiler records per-operator CUDA time, memory, and FLOP, and emits a Chrome trace that separates compute from communication from data movement; the NVIDIA stack adds nsys and ncu for the same split at the hardware level. The reasoning of this section stays identical; the tool just measures $W_{\text{flops}}$, $Q_{\text{bytes}}$, and the waiting time for you.
4. Distributing the Wrong Axis Advanced
The failure mode this section exists to prevent is mechanical and common: a team feels their system is slow, reaches for the most familiar form of distribution, and applies it to a ceiling it cannot lift. The canonical example is data parallelism aimed at a memory-capacity problem. Data-parallel training, the exact-gradient method of Chapter 15, replicates the full model on every worker and splits the batch among them. If the model already does not fit on one device, replicating it puts the same too-large model on each of $K$ workers, every one of which fails to allocate exactly as the single machine did. You have multiplied the cost by $K$ and the capacity by nothing. The memory-capacity ceiling is lifted only by splitting the parameters and activations themselves, the model-parallel and sharded methods of Chapter 16.
The symmetric error runs the other way. A model that fits comfortably and is purely compute bound gains nothing from model parallelism; splitting it across devices adds the activation-passing communication that the roofline test above would have flagged, slowing a step that needed only more arithmetic units thrown at it in parallel. The same logic disqualifies inference replication for a training-compute problem, and data sharding for a system whose dataset already fits in memory and whose pain is request latency. Each mismatch shares one shape: a remedy whose ceiling was never the binding one, paying the communication tax of Chapter 3 for no return. The diagnosis in Figure 41.2.1 is precisely the guard against this, which is why it precedes any commitment to an architecture.
Who: A graduate team building a capstone around fine-tuning a 13-billion-parameter language model on a domain corpus.
Situation: Their single-GPU baseline crashed with an out-of-memory error before the first step completed, so they concluded the system did not scale and needed distribution.
Problem: Reasoning that more machines meant more capacity, they wrapped the model in data-parallel training across four GPUs and launched the job.
Dilemma: The job crashed again, now four times over, each worker hitting the identical allocation failure; the team faced spending their remaining time debugging a distribution layer that had not addressed the actual ceiling.
Decision: They went back and profiled, finding the working set never fit (a memory-capacity ceiling, the first diamond in Figure 41.2.1), not a throughput ceiling that data parallelism could touch.
How: They switched to fully sharded data parallel (FSDP) from Chapter 16, which partitions parameters, gradients, and optimizer state across the four GPUs so no device ever holds the whole model.
Result: The model fit, training ran, and the project shipped; the wasted days came entirely from choosing the axis before measuring the ceiling.
Lesson: Data parallelism multiplies throughput, never capacity. A model that does not fit needs the model axis, and only a profile, not an instinct, tells you which ceiling you are against.
5. From Verdict to Project Intermediate
Once the profile has named the binding ceiling and Table 41.2.1 has named the axis, the capstone has a spine. The axis fixes which part of the book you mine for technique, which production tool you adopt, and which metric defines success, so the rest of the design becomes a matter of following one column down rather than surveying the whole field. A capstone is rarely a single axis forever; real systems eventually stack several, as a large-model project distributes data, training, the model, and cluster coordination at once. But a capstone earns the right to a second axis only by saturating the first. You distribute the binding ceiling, re-profile, and let the next-tightest ceiling, now revealed, justify the next axis. Choosing one axis at a time, each one forced by a fresh measurement, is what keeps the project honest and finishable.
The spine of this book is that scale-out beats scale-up when, and only when, a single machine has run out of a specific resource. The capstone is where that thesis becomes a personal decision: you do not get to pick the axis you find elegant, you measure the baseline and let the ceiling pick for you. Every chapter taught the remedy for one ceiling; this section is where the reader inverts the book, starting from a measured symptom and reading backward to the chapter that cures it. The whole of Chapter 1's six-axis map exists to be used exactly here, in anger, on your own system.
Picking the distribution axis by hand is giving way to systems that search the parallelism space automatically. Alpa (Zheng et al., 2022) framed the choice of data, operator, and pipeline parallelism as a joint compilation problem and auto-derived near-optimal plans; the line continues through later auto-parallel compilers and through scheduler-integrated planners that treat the roofline and the alpha-beta communication cost as the objective. Recent work on training and serving large mixtures-of-experts pushes further, jointly placing experts, choosing the collective, and balancing load under measured bandwidth, so the axis decision is made per layer rather than per project. The reasoning is still the reasoning of this section, arithmetic intensity against the roofline ridge and communication against local work; what is changing is that a cost model, not a graduate student, now reads the profile and selects the axis. The capstone value of doing it by hand once is that you can tell when the automated planner has chosen wrong.
There is a reliable reflex in any team with a budget: the moment something feels slow, someone proposes adding GPUs. It is the computing equivalent of turning a stuck jar lid harder. Sometimes the lid is genuinely stuck and more torque helps; just as often the jar is upside down. The profile is the half-second of looking before you grip, and it has saved more capstones than any framework.
With the axis named and defended, the project can move from diagnosis to construction. The next step is to build and measure the single-machine baseline properly, the artifact you profiled informally here becomes the rigorous reference point that every distributed speedup is measured against, in Section 41.3. The architecture that distributes the chosen axis, with its concrete partitioning, communication pattern, and fault-tolerance plan, is committed in Section 41.4.
For each capstone idea, state the most likely binding ceiling and the axis you would reach for first from Table 41.2.1, and explain why one plausible alternative axis would not help: (a) a real-time speech transcriber that must return a caption within 200 milliseconds per request, serving thousands of users; (b) a one-time pretraining run for a model whose parameters plus optimizer state are three times the memory of the largest available accelerator; (c) a nightly job that must deduplicate and tokenize a 50-terabyte crawl that does not fit on any one node's disk. For each, identify the failure you would see if you distributed the wrong axis instead.
Adapt Code 41.2.1 to the actual steps of a capstone baseline you care about. Replace the three synthetic steps with the real candidate steps of your workload (for example, the data-loading step, the forward-plus-backward step, and the optimizer step of a training loop), instrument each with the same timing and arithmetic-intensity calculation, and produce the same table. Then attach py-spy record or torch.profiler to a full run and confirm that the step the profiler says dominates is the same one your hand-counted table flags. Where they disagree, explain which measurement you trust and why.
Take a machine with peak compute $\pi = 20$ TFLOP/s and peak memory bandwidth $\beta = 800$ GB/s. Compute the ridge intensity $I^\star = \pi/\beta$ in FLOP per byte. A candidate training step performs $W_{\text{flops}} = 6 \times 10^{12}$ FLOP and moves $Q_{\text{bytes}} = 1.2 \times 10^{11}$ bytes from memory. Compute its arithmetic intensity $I$, decide whether it is compute bound or memory-bandwidth bound, and name the axis. Then suppose distributing it would require an all-reduce of $Q_{\text{net}} = 4 \times 10^{9}$ bytes over a $\beta_{\text{net}} = 25$ GB/s link with per-message latency $\alpha = 10\,\mu\text{s}$; using the communication-bound test of Section 2, determine whether the communication roof would dominate the local compute time $W_{\text{flops}}/\pi$, and state what that implies for how many workers you should add.