"They plotted my speedup against a baseline so slow I could have beaten it asleep. The curve looked magnificent. The cluster was on fire."
An All-Reduce That Has Seen Some Gradients
A scaling curve is only as honest as the measurement behind each point, and most of the work in evaluating a distributed system is making those points trustworthy. Section 3.1 defined speedup $S(K) = T(1)/T(K)$, efficiency $E(K) = S(K)/K$, and scalability as the shape of $E(K)$ as $K$ grows. Those definitions are arithmetic; the difficulty is empirical. To get $T(K)$ right you must fix the workload, warm up the hardware, exclude one-time setup, repeat the run, and report the spread, because a single cold run on a poorly chosen baseline can manufacture any speedup a slide deck wants. This section is the measurement discipline that turns the three numbers into a curve you can defend: how to collect the timings, how to plot them against the ideal line, how to read where the curve bends and name the cause, and how to avoid the two mistakes that quietly inflate almost every scaling claim you will ever see.
In the previous chapter we argued that distributed systems need their own evaluation discipline because adding machines can make a system slower, not faster. This section operationalizes the most basic of those evaluations: the scaling study. By the end you will be able to take a system, produce a table of $T(K)$, $S(K)$, and $E(K)$ that another engineer would trust, plot it against the ideal-linear reference, and diagnose the bend where efficiency starts leaking away. We treat the definitions from Chapter 3 as settled and focus entirely on the part Chapter 3 deferred: getting the numbers off a real cluster without fooling yourself.
1. Measuring T(K) So the Curve Means Something Beginner
Every scaling number descends from one measurement: the wall-clock time $T(K)$ to finish a fixed job on $K$ workers. If that measurement is sloppy, no amount of careful plotting downstream can repair it. Five rules turn a raw timer into a defensible $T(K)$. First, fix the workload: the dataset, the model, the global batch size, the number of steps, and the stopping criterion must be identical at every $K$. The instant you change what the job is between points, you are no longer measuring scaling; you are comparing two different jobs and calling the ratio a speedup. Second, warm up: the first iterations on a GPU pay for kernel autotuning, memory-pool allocation, and just-in-time compilation, and on a cluster they also pay for connection setup and cache population. Time only the steady state, after those one-time costs have been paid, or your $T(1)$ and $T(K)$ will each carry a different amount of startup noise.
Third, exclude setup: process-group initialization, the first NCCL handshake, dataset download, and checkpoint loading are real costs, but they are amortized over a long run and do not belong in a per-step or per-epoch timing meant to characterize the steady-state scaling behavior. Measure them separately if they matter; do not smuggle them into $T(K)$. Fourth, average over runs: a single run is a sample from a noisy distribution, because schedulers, neighbors on shared nodes, thermal throttling, and network contention all jitter the time. Report the mean over several repeats. Fifth, report variance: a mean with no spread hides whether your "1.9x speedup" is solid or a coin flip away from 1.6x. Carry the standard deviation (or a min/max band) alongside every point so the reader can see the noise floor.
Speedup is a ratio, and a ratio has two ends. Reviewers obsess over $T(K)$, the numerator's denominator, and forget that $T(1)$ sets the entire scale. An honest $T(1)$ is the best single-machine run you can produce: the same optimized code, the same libraries, the same precision, running on one well-fed worker. If instead you measure $T(1)$ with distributed-training overhead still switched on, or on a machine starved of memory, or with debugging assertions left in, every $S(K)$ inherits that handicap and looks better than the system deserves. The first question to ask of any scaling plot is not "how high does it go?" but "what exactly is the point at $K = 1$?"
2. From Timings to a Scaling Table Intermediate
With the five rules in hand, a scaling study is mechanical: collect repeated timings at each $K$, take the baseline as the mean of the warm $K = 1$ runs, and derive $S(K)$ and $E(K)$ point by point. The code below stands in for that pipeline. Rather than occupy a real cluster, it generates synthetic per-$K$ timings from a model with a fixed serial part, a parallel part that shrinks as $1/K$, and a communication part that grows with $K$ (the all-reduce tax from Chapter 4), plus measurement noise so that variance is real. The analysis half, which is the part you would reuse verbatim on real logs, recovers $S(K)$ and $E(K)$ from nothing but the timings and flags the first $K$ where efficiency drops below a chosen threshold.
import numpy as np
# Synthetic but realistic per-K measurements of a fixed (strong-scaling) job.
# We pretend to have run the SAME training step on K GPUs, R repeated runs each,
# after a warm-up that excludes one-time setup. Each run time is modelled as a
# fixed serial part f, a parallel part (1-f)/K, and a communication part that
# grows with K, plus a little measurement noise. We then recover S(K), E(K)
# purely from the timings, exactly as you would from a real log.
rng = np.random.default_rng(7)
Ks = [1, 2, 4, 8, 16, 32, 64]
R = 5 # repeated runs per K
f = 0.04 # irreducible serial fraction of one step
T1_compute = 1.000 # baseline single-GPU step time (seconds)
def true_mean_time(K):
serial = f * T1_compute
parallel = (1.0 - f) * T1_compute / K
# all-reduce-style cost: grows slowly with K (log term dominates real rings)
comm = 0.010 * np.log2(K + 1)
return serial + parallel + comm
# Generate R noisy timing samples per K (2% relative Gaussian noise).
samples = {K: true_mean_time(K) * (1.0 + 0.02 * rng.standard_normal(R)) for K in Ks}
T1 = float(np.mean(samples[1])) # baseline = mean over repeated K=1 runs
print(f"{'K':>4} {'mean T(K)':>11} {'std':>8} {'S(K)':>8} {'E(K)':>8} {'verdict':>10}")
print("-" * 56)
threshold = 0.70 # flag the first K where efficiency < 0.70
bend_K = None
for K in Ks:
t = samples[K]
mean_t = float(np.mean(t))
std_t = float(np.std(t, ddof=1))
S = T1 / mean_t
E = S / K
verdict = "ok" if E >= threshold else "below thr"
if E < threshold and bend_K is None:
bend_K = K
print(f"{K:>4} {mean_t:>11.4f} {std_t:>8.4f} {S:>8.2f} {E:>8.3f} {verdict:>10}")
print("-" * 56)
print(f"baseline T(1) = {T1:.4f} s (mean of {R} warm runs)")
print(f"efficiency threshold = {threshold:.2f}")
print(f"first K with E(K) < {threshold:.2f}: K = {bend_K}")
print(f"interpretation: the curve bends near K = {bend_K}; beyond it, added")
print("GPUs return less than 70% of their cost because the per-step all-reduce")
print("has grown to rival the shrinking per-step compute.")
K mean T(K) std S(K) E(K) verdict
--------------------------------------------------------
1 1.0047 0.0091 1.00 1.000 ok
2 0.5343 0.0098 1.88 0.940 ok
4 0.3032 0.0034 3.31 0.828 ok
8 0.1884 0.0039 5.33 0.667 below thr
16 0.1392 0.0026 7.22 0.451 below thr
32 0.1189 0.0026 8.45 0.264 below thr
64 0.1140 0.0022 8.82 0.138 below thr
--------------------------------------------------------
baseline T(1) = 1.0047 s (mean of 5 warm runs)
efficiency threshold = 0.70
first K with E(K) < 0.70: K = 8
interpretation: the curve bends near K = 8; beyond it, added
GPUs return less than 70% of their cost because the per-step all-reduce
has grown to rival the shrinking per-step compute.
C:\Python314\python.exe. Efficiency holds above $0.82$ through $K = 4$, then crosses below the $0.70$ threshold at $K = 8$ and collapses toward $0.14$ by $K = 64$, where speedup has all but flat-lined at $8.8\times$. The std column shows the noise is small relative to the trend, so the bend is real and not a sampling artifact.Read the table as a budget. Through $K = 4$, efficiency above $0.82$ means each added GPU still does most of its potential work, and buying more is close to free. At $K = 8$ the threshold crossing is the warning light: efficiency $0.667$ says a third of the fleet is now idling on the network. By $K = 64$ the system has spent eight times the hardware to buy $8.8\times$ the speed, an efficiency of $0.14$ in which six of every seven GPUs are, on average, waiting rather than computing. The variance column matters here too: because each std is roughly $1\%$ of its mean, the gaps between points are far larger than the noise, so we can state the bend location with confidence rather than hand-waving.
3. Reading the Curve: The Ideal Line and the Bend Intermediate
A table is precise but hard to feel; a plot makes the loss visible at a glance. The standard scaling plot puts $K$ on the horizontal axis and $S(K)$ on the vertical, and draws two things: the measured speedup curve and the ideal-linear reference line $S(K) = K$, the diagonal that a perfect system with $E(K) = 1$ would trace. Real curves hug that diagonal for small $K$, then peel away and flatten toward a ceiling. The point where the curve visibly departs from the line is the bend, and its location is the single most useful thing a scaling plot tells you: it is the largest cluster size at which adding machines still pays. Figure 5.2.1 plots the measured speedup from Output 5.2.1 against the ideal line and annotates the bend at $K = 8$, the first point below the efficiency threshold.
Once you can see the bend, the diagnostic question is what causes it, and there are three usual suspects. Communication is the most common: the per-step all-reduce cost grows with $K$ while the per-worker compute shrinks as $1/K$, so at some cluster size the network time overtakes the compute time and further workers mostly wait, which is precisely the mechanism baked into Code 5.2.1. Stragglers bend the curve a different way: because a synchronous step finishes only when the slowest worker finishes, one slow GPU (thermal throttling, a noisy neighbor, a bad link) drags every step to its pace, and the probability that at least one worker is slow rises with $K$. Load imbalance is the third: if the work does not divide evenly (uneven shard sizes, variable-length sequences, a skewed graph partition), some workers idle at the barrier while others grind, and the wasted fraction grows with the imbalance. The shapes differ in informative ways: a communication bend is smooth and predictable, a straggler bend is noisy with a fat variance band, and a load-imbalance bend often appears even at small $K$ and does not improve with repeated runs.
Who: A platform engineer validating a vendor's claim before signing for a 64-GPU training cluster.
Situation: The vendor's benchmark showed near-linear speedup to 64 GPUs, efficiency reported at $0.9$, on the engineer's own model architecture.
Problem: The reported single-GPU baseline ran in mixed precision with gradient checkpointing left on, options that exist only to fit large models and that slow down a job that already fits in one GPU's memory.
Dilemma: Accept the headline $0.9$ efficiency and the linear curve, or rebuild the baseline fairly and risk a far less flattering number that complicates the purchase justification.
Decision: They re-measured $T(1)$ with the fastest correct single-GPU configuration, warmed up, and averaged over five runs, exactly the recipe in Section 1.
How: They reran the full sweep with the honest baseline, plotted $S(K)$ against the ideal line, and found the curve bent at $K = 16$, not $K = 64$.
Result: True efficiency at 64 GPUs was near $0.5$, not $0.9$; the cluster still made sense, but only at 32 GPUs, where efficiency stayed above $0.7$, saving the budget for the half of the fleet that would have idled.
Lesson: A speedup is only as honest as its baseline. Fixing $T(1)$ moved the bend by a factor of four and changed the purchase decision.
4. Strong-Scaling and Weak-Scaling Plots Intermediate
The study in Code 5.2.1 is a strong-scaling plot: the total problem is fixed and we ask how much faster $K$ workers finish it, so the natural vertical axis is speedup against the ideal-linear line, and the bend marks the cluster size beyond which the fixed job no longer benefits. A strong-scaling plot always bends eventually, because Amdahl's law (Section 3.5) caps speedup at $1/f$ for any serial fraction $f$. A weak-scaling plot asks a different question and therefore looks different. In weak scaling, defined in Section 3.3, the per-worker work is held constant and the total problem grows with $K$ (double the GPUs, double the global batch or the dataset), so the right vertical axis is not speedup but weak-scaling efficiency, the ratio $T(1)/T(K)$ of the single-worker time to the $K$-worker time on the proportionally larger problem. The ideal there is a flat line at $1.0$: a perfectly weak-scaling system keeps per-step time constant as it grows, so its efficiency curve stays horizontal rather than climbing a diagonal.
Plotting the two correctly is what keeps an argument honest. A strong-scaling plot with a flat ideal line, or a weak-scaling plot drawn against the $S(K) = K$ diagonal, mislabels the experiment and invites the reader to expect the wrong shape. The convention to internalize: strong scaling plots speedup against the rising ideal line $S(K) = K$ and the curve bends; weak scaling plots efficiency against the flat ideal line at $1.0$ and the curve sags. Foundation-model training is overwhelmingly a weak-scaling story, which is why a thousand-GPU run that would look like a catastrophic strong-scaling failure (efficiency near $0.05$) can be an excellent weak-scaling success (efficiency near $0.9$) on the much larger model it was actually built to train. Reporting one when you measured the other is one of the most common ways scaling claims mislead.
The same cluster produces a falling speedup curve under strong scaling and a flat efficiency curve under weak scaling, and neither number is wrong, but they answer different questions. Before interpreting any scaling plot, establish which experiment produced it: was the problem held fixed (strong) or grown with the machines (weak)? A curve has no meaning until you know what was held constant, and the most persuasive misleading plots are the ones that quietly switch regimes between the baseline and the largest point.
5. The Two Mistakes That Inflate Almost Every Curve Advanced
Two errors account for the majority of overstated scaling claims, and both are subtle enough to survive peer review. The first is the weak single-node baseline: measuring $T(1)$ on a deliberately or accidentally crippled single-machine run so that every $S(K)$ is divided by an inflated denominator. The crippling can be innocent (distributed-training code paths left active at $K = 1$, a debug build, an unoptimized data loader) or strategic (mixed precision and checkpointing forced on when they only slow a job that fits in memory, as in the Practical Example). The fix is a rule: the baseline must be the fastest correct single-worker configuration, not the distributed code run with one worker. A useful sanity check is to ask whether $T(1)$ would embarrass you as a standalone single-GPU benchmark; if it would, your speedups are borrowed against it.
The second is changing the problem between points: shrinking the per-step work, lowering the iteration count, relaxing the convergence target, or growing the batch size as $K$ grows, then reporting the result as strong scaling. Increasing the global batch size with $K$ is the most frequent offender because it feels natural in data-parallel training, but it converts a strong-scaling experiment into a weak-scaling one midway through, so the curve looks far more linear than the fixed job would. The remedy is the first rule of Section 1, enforced ruthlessly: the workload at $K = 64$ must be byte-for-byte the same job as at $K = 1$, and if you intend to grow the batch, declare it and plot weak-scaling efficiency instead. Both mistakes share a signature: they make the curve hug the ideal line longer than the physics allows, so a curve with no visible bend out to large $K$ should prompt suspicion, not applause.
A reliable way to report an enormous speedup is to make the baseline slow enough. Run $T(1)$ in pure Python with assertions on, logging every tensor, on a machine swapping to disk, and a competently configured 64-GPU run will beat it by four orders of magnitude. The number is arithmetically true and completely meaningless. Whenever a speedup is too good to believe, the denominator is usually where the magic was performed.
Code 5.2.1 timed steps by hand, which means you are responsible for skipping warm-up iterations and excluding setup. In practice the framework profiler does this for you. The PyTorch profiler schedules explicit wait, warmup, and active phases so that only steady-state steps are recorded, and it separates compute time from communication and data-loading time automatically:
import torch
from torch.profiler import profile, schedule, ProfilerActivity
sched = schedule(wait=1, warmup=3, active=5) # skip 1, warm up 3, time 5 steps
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
schedule=sched) as prof:
for step, batch in enumerate(loader):
train_step(batch) # your one training step
prof.step() # advances the wait/warmup/active phases
# Mean steady-state step time, already excluding warm-up, becomes your T(K).
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=5))
schedule. The roughly ten lines of manual timing-loop bookkeeping collapse to one scheduler call, and the profiler additionally attributes time to compute versus communication, which is what Chapter 3 needs to diagnose the bend.The measurement loopholes this section closes by hand are being closed by community standards. MLCommons continued to tighten the MLPerf Training rules through its 2024 and 2025 rounds (v4.0 and v4.1), fixing the workload, the convergence target, and the timing window precisely so that a vendor's $T(K)$ is comparable across submissions rather than a private definition, which is the weak-baseline and changing-problem defenses turned into an auditable benchmark. In parallel, the empirical-rigor literature has pushed back on uninstrumented scaling claims: work in the lineage of "reproducibility checklists" for ML systems now asks papers to report variance, the exact baseline configuration, and the strong-versus-weak regime as a condition of acceptance, and large-scale training reports such as the Llama 3 herd paper (Dubey et al., 2024) document model-FLOPs utilization and step-time stability rather than a bare speedup number. The direction of travel is clear: a scaling curve without its provenance is increasingly treated as unpublishable, and the six-row checklist below is converging with what reviewers and benchmarks now demand.
6. The Reporting Checklist Beginner
A scaling result that another engineer can trust carries its provenance with it. Before a curve leaves your notebook, confirm every item in Table 5.2.1; each row closes one of the loopholes that this section opened. The table doubles as the rubric Chapter 5 applies to scaling claims in system papers, and as the spec for the reproducible measurement harness that Chapter 2's coordination concepts make possible on a real cluster.
| Report | Why it matters | Failure it prevents |
|---|---|---|
| The exact baseline $T(1)$ | Sets the scale of every $S(K)$ | Weak single-node baseline |
| Fixed workload spec at all $K$ | Guarantees one job, not many | Changing the problem between points |
| Warm-up and setup exclusion | Times steady state only | Startup noise inflating $T(1)$ |
| Number of repeats and variance | Shows the noise floor | A lucky single run |
| Strong vs weak regime | Tells the reader the question | Regime switching mid-curve |
| Efficiency $E(K)$, not just $S(K)$ | Exposes idle hardware | Large speedup hiding low efficiency |
With these six in place, the scaling curve stops being a marketing artifact and becomes a measurement. The next section keeps the same measurement discipline but changes the quantity: speedup and efficiency describe how fast a fixed job finishes, while a serving system cares about how many useful requests per second it sustains and how long the slowest of them takes. That shift, from throughput and goodput to tail latency and service-level objectives, is the subject of Section 5.3.
A paper reports speedups of $1.0$, $2.0$, $4.0$, and $8.0$ at $K = 1, 2, 4, 8$, a perfectly linear curve with no bend out to eight workers, and credits it to a "communication-free" design. Using the reasoning in Sections 3 and 5, give two distinct measurement mistakes that could produce a perfectly linear curve even on a system that genuinely pays communication cost, and describe the single re-measurement you would demand to tell an honest linear result from an inflated one.
Extend Code 5.2.1 in two ways. First, compute and print a $95\%$ confidence interval for each mean $T(K)$ (use $\pm 1.96 \cdot \text{std}/\sqrt{R}$) so each point carries an error bar. Second, inject a straggler: at each $K$, multiply the slowest of the $R$ samples by a factor drawn from $1.0$ to $1.5$ before taking the mean, modeling a synchronous step gated by its slowest worker. Re-run and report how the bend location and the variance band change, and explain why a straggler-driven bend is distinguishable from a communication-driven one by its variance.
Take the timing model in Code 5.2.1 and use it to produce two plots from the same underlying system. For the strong-scaling plot, hold the problem fixed and plot $S(K)$ against the ideal line $S(K) = K$. For the weak-scaling plot, let the per-worker problem stay constant (so the parallel part of the time no longer shrinks with $K$, only the communication part grows) and plot weak-scaling efficiency $T(1)/T(K)$ against the flat ideal at $1.0$. Report the efficiency at $K = 64$ under each regime, and explain in two or three sentences how the same hardware can look like a failure under one plot and a success under the other, referencing the foundation-model argument from Section 3.3.