"They added a hundred more of me and were shocked that I went only twice as fast. Nobody asked the one part of the job that refuses to be cloned."
A Worker That Read Amdahl's Fine Print
"Scaling" is not a vibe; it is a measurable relationship between the resources you add and the speed you get back. Chapter 1 argued that AI workloads are forced off a single machine by ceilings on data, model size, and throughput, and Chapter 2 gave us the distributed-systems vocabulary for the machines we add. This chapter asks the harder follow-up: when you spend more machines, what do you actually buy? The answer lives in three numbers computed from one quantity, the time a job takes on $K$ machines. Speedup says how much faster you went, efficiency says how much of each added machine you wasted, and scalability describes how that efficiency erodes as $K$ grows. Get these definitions exact and the rest of the chapter, from Amdahl's law to the roofline model, becomes arithmetic rather than folklore. This first section pins the definitions down and shows, in a table you can run yourself, why doubling the machines almost never doubles the speed.
Every engineer has heard the phrase "it scales" and every engineer has watched a system that supposedly scaled grind to a near-standstill once a few dozen machines were involved. The word carries too much hope and too little precision. Before this chapter can model where speedups come from and where they leak away, we need a definition of scaling sharp enough to put on an axis and plot. That definition starts from a single measurable thing: how long the job takes when you run it on $K$ machines instead of one. From that one quantity we derive everything else, and we ground each derivation in the workload this book cares about most, a training step distributed across $K$ GPUs.
1. Speedup: How Much Faster Did We Go? Beginner
Let $T(K)$ be the wall-clock time to finish a fixed job using $K$ workers, so $T(1)$ is the single-machine baseline. The speedup on $K$ workers is the ratio of the baseline time to the parallel time,
$$S(K) = \frac{T(1)}{T(K)}.$$Speedup is a pure number that answers the only question a practitioner truly cares about: did adding machines make the job finish sooner, and by what factor? If a training epoch takes 80 minutes on one GPU and 20 minutes on four, then $S(4) = 80/20 = 4$. Concretely, picture the job as processing a fixed number of training steps, say one epoch over a fixed dataset. On one GPU the epoch streams through every minibatch in sequence; on $K$ GPUs running data-parallel training, each GPU owns a shard of the minibatches, so the per-step throughput (minibatches finished per second across the whole cluster) is what $S(K)$ ultimately measures. The baseline $T(1)$ must be the best honest single-machine time, not a deliberately crippled one, or every later number is inflated.
The aspirational case is linear or ideal speedup, $S(K) = K$: four GPUs finish in a quarter of the time, a thousand GPUs in a thousandth. Real systems almost always fall short, giving sublinear speedup, $S(K) < K$, because some part of the work does not parallelize and because the workers must spend time communicating. The gap between the line $S(K) = K$ and the curve a real system traces is the entire subject of this chapter, and the all-reduce of Chapter 4 is the operation that usually opens it.
You measure exactly one thing, the time $T(K)$ to finish a fixed job on $K$ workers. Speedup $S(K) = T(1)/T(K)$ is that time relative to the baseline, efficiency $E(K) = S(K)/K$ is that speedup relative to the workers you paid for, and scalability is the shape of $E(K)$ as $K$ climbs. Nail down what counts as "the job" and what counts as $T(1)$, and the three numbers are unambiguous. Leave either vague and a vendor can report a "10x speedup" that quietly compares a tuned cluster against a sabotaged laptop.
2. Efficiency: How Much of Each Machine Did We Waste? Beginner
Speedup alone flatters large clusters: a speedup of 50 sounds triumphant until you learn it took 200 GPUs to get there. Efficiency normalizes speedup by the resources spent,
$$E(K) = \frac{S(K)}{K} = \frac{T(1)}{K \, T(K)}.$$Efficiency lives in $[0, 1]$ (occasionally above, as Section 3 notes) and answers a budgeting question: of the $K$ machines you are paying for, what fraction is doing useful work rather than waiting? Ideal speedup $S(K) = K$ corresponds to $E(K) = 1$, every machine fully earning its keep. A speedup of 50 on 200 GPUs is $E = 0.25$: three quarters of the fleet is, in effect, idling on barriers and shuffling gradients. For a data-parallel training step, low efficiency usually means the per-step all-reduce has grown to rival the per-step compute, so the GPUs spend their time waiting on the network rather than on matrix multiplies. Efficiency becomes a reporting standard for system papers in Chapter 5; here it is simply the honest companion that keeps speedup from lying.
3. Scalability: How Does Efficiency Behave as K Grows? Intermediate
A single $(S, E)$ pair at one value of $K$ tells you about one cluster size. Scalability is the trend: how $E(K)$ behaves as $K$ increases. A system is said to scale well if efficiency stays high as you add machines, and to scale poorly if efficiency collapses. Because some part of almost every job is inherently serial and communication grows with $K$, the typical curve is monotonically decreasing: $E(1) = 1$, then a gentle decline, then a cliff. Plotting $S(K)$ against $K$ on the same axes as the ideal line $S(K) = K$ makes the erosion visible at a glance, which is exactly what Figure 3.1.1 does.
The flattening in Figure 3.1.1 is not an accident of one bad system; it is the generic fate of a fixed job spread ever more thinly. Some fraction of the work, loading the first batch, the optimizer step that touches the whole model, a logging barrier, simply will not run faster no matter how many GPUs wait on it, and that fraction sets the horizontal ceiling the real curve approaches. The next section makes the curve numeric so you can see the ceiling arrive.
4. A Runnable Speedup and Efficiency Table Intermediate
To see scalability turn into numbers, model a training job with a small serial fraction $f$: the portion of the per-step work that cannot be parallelized across GPUs. If the parallel portion $1 - f$ shrinks by a factor of $K$ while the serial portion $f$ stays fixed, then $T(K) = f + (1 - f)/K$ with $T(1) = 1$, and the speedup follows directly. The code below sweeps $K$ from $1$ to $512$ for $f = 0.05$ (a realistic 5% serial step on a well-tuned data-parallel job) and prints $T(K)$, $S(K)$, and $E(K)$ in one pass.
serial_fraction = 0.05 # 5% of the work cannot be parallelized across GPUs
def speedup(K, f):
# T(1)=1; parallel part 1-f shrinks by K, serial part f stays fixed
return 1.0 / (f + (1.0 - f) / K)
print(f"serial fraction f = {serial_fraction}")
print(f"{'K':>6} {'T(K)':>10} {'S(K)':>10} {'E(K)':>10}")
for K in [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]:
S = speedup(K, serial_fraction)
T = 1.0 / S
E = S / K
print(f"{K:>6} {T:>10.4f} {S:>10.4f} {E:>10.4f}")
serial fraction f = 0.05
K T(K) S(K) E(K)
1 1.0000 1.0000 1.0000
2 0.5250 1.9048 0.9524
4 0.2875 3.4783 0.8696
8 0.1688 5.9259 0.7407
16 0.1094 9.1429 0.5714
32 0.0797 12.5490 0.3922
64 0.0648 15.4217 0.2410
128 0.0574 17.4150 0.1361
256 0.0537 18.6182 0.0727
512 0.0519 19.2844 0.0377
C:\Python314\python.exe. Speedup climbs steeply at first, then stalls: from $K = 256$ to $K = 512$ it rises only from $18.6$ to $19.3$, while efficiency falls below $4\%$. A mere $5\%$ serial fraction caps the entire job at roughly $20\times$ no matter how many GPUs you buy.The table is Figure 3.1.1 made arithmetic. For $K \le 4$ the speedup is nearly linear and efficiency stays above $0.85$; this is the regime where adding GPUs is almost free. By $K = 32$ efficiency has fallen below $0.4$, meaning more than half the fleet is idle, and by $K = 512$ each added GPU returns almost nothing while still drawing power and rent. The horizontal ceiling that Figure 3.1.1 drew is the limit $S(\infty) = 1/f = 20$, reached here from below. This is the central lesson of scalability stated in numbers: a small unparallelizable fraction sets a hard ceiling on speedup, and efficiency is the early-warning gauge that tells you the ceiling is near long before the speedup curve visibly flattens. Where that $1/f$ ceiling comes from formally is the business of Section 3.5.
The classic way to feel a serial fraction in your bones: one mother produces a baby in nine months, but nine mothers do not produce a baby in one month. The gestation is irreducibly serial, so its speedup ceiling is $1\times$ regardless of how many mothers you assign. A training job with $f = 0.05$ is the same story with a friendlier constant: most of the work parallelizes, but the stubborn $5\%$ quietly caps you at $20\times$.
5. Superlinear Speedup: When the Curve Beats the Line Advanced
Occasionally a measured system reports $S(K) > K$, efficiency above $1$, a curve that rises above the ideal diagonal of Figure 3.1.1. This superlinear speedup looks like a free lunch and is almost never one. The usual cause is the memory hierarchy: a dataset or working set that overflows one machine's cache (or spills from GPU memory to host memory) fits comfortably once it is sharded across $K$ machines, so each worker suddenly runs from a faster level of the hierarchy. The extra speed is real, but it comes from a change in where the data lives, not from the parallelism itself, so it is a cache or memory-pressure artifact rather than a law you can extrapolate. Treat a superlinear report as a prompt to ask which resource the single-machine baseline was starved of; an honest $T(1)$ on adequate memory usually removes the effect.
Efficiency above $1$ almost always means the single-machine baseline $T(1)$ paid a penalty the cluster avoided, typically cache or GPU-memory thrashing on a working set too big for one device. The parallel version is genuinely faster, but the credit belongs to the memory hierarchy, not to the algorithm scaling past ideal. When you see $E(K) > 1$, audit the baseline before you celebrate, because the effect vanishes the moment $K$ is large enough that each shard already fits in fast memory.
6. Scaling a Fixed Problem vs Scaling the Problem With Resources Intermediate
Everything above measured speedup on a fixed job: the dataset, the model, and the work were held constant while $K$ grew. This is strong scaling, and the table in Output 3.1.1 is a strong-scaling study. Strong scaling answers "can I finish the same job faster by adding machines?" and, as the serial fraction shows, it has a hard ceiling. But it is not the only thing people mean by "scaling", and conflating the two is a common source of argument.
The other regime grows the problem along with the resources. If you double the GPUs and also double the global batch size or the dataset, you are asking a different question: "can I do a proportionally bigger job in the same time?" This is weak scaling, and because the per-worker work stays constant, it often holds efficiency far higher than strong scaling does. Training foundation models is overwhelmingly a weak-scaling story: nobody buys a thousand GPUs to train the same tiny model faster; they buy them to train a vastly larger model on vastly more data within a fixed wall-clock budget. The distinction is important enough that Section 3.3 is devoted to it, and the optimism of Gustafson's law in Section 3.5 rests entirely on the weak-scaling view. For now, fix the vocabulary: strong scaling holds the problem fixed and chases speed, weak scaling grows the problem and chases size, and the same cluster can look like a failure under one lens and a triumph under the other.
Who: An ML platform engineer auditing a vendor benchmark before signing a multi-year GPU-cluster contract.
Situation: The vendor's slide showed a data-parallel training job going from 1 to 256 GPUs with a headline "near-linear 180x speedup", presented as proof the cluster scaled.
Problem: A 180x speedup on 256 GPUs is an efficiency of $E = 180/256 = 0.70$, so 30% of a very expensive fleet was idle, a fact the speedup-only framing hid.
Dilemma: Accept the impressive speedup number and buy 256 GPUs, or compute efficiency, find the knee in the curve, and size the purchase to where each GPU still earned its rent.
Decision: They reran the vendor's own numbers as an efficiency table in the style of Output 3.1.1 and found efficiency stayed above $0.85$ only through $K = 64$, then fell off a cliff.
How: They asked the vendor to rerun the benchmark as a weak-scaling study (grow the global batch with $K$) since their real workload was foundation-model training, and efficiency held above $0.9$ to $K = 256$.
Result: They bought 256 GPUs but committed to weak-scaling workloads on them, and reserved a 64-GPU partition for the strong-scaling retraining jobs where larger $K$ was pure waste.
Lesson: Speedup sells; efficiency sizes. Always demand the efficiency curve and always ask whether the benchmark was a strong-scaling or weak-scaling study, because the right $K$ depends entirely on which question the workload is really asking.
The word "scaling" now carries two distinct meanings that the field is actively reconciling. Systems scaling is the $S(K)$ and $E(K)$ of this section: more machines, faster jobs. Statistical scaling laws, in the Chinchilla lineage and its compute-optimal successors, predict how model loss falls as parameters, data, and total compute grow. Recent work ties the two together by asking how to spend a fixed cluster so that the compute-optimal model is also trained at high hardware efficiency: Megatron-style and FSDP-based studies report data on how efficiency $E(K)$ holds (or does not) at thousands of GPUs, and 2024-2026 work on overlap-aware and communication-avoiding parallelism (building on the local-update and compression lines from Chapter 10) pushes the knee of the efficiency curve to larger $K$. The frontier question is no longer "does it scale?" but "at what $K$ does each added GPU stop buying either speed or a better model?", a question that needs both kinds of scaling on the same axes. We give the data-parallel half of this story its full treatment in Chapter 15.
Code 3.1.1 modeled speedup from an assumed serial fraction to build intuition. For a real job you measure $T(K)$ directly and let a benchmarking tool compute the ratios. PyTorch ships torch.utils.benchmark for exactly this, turning warmup, repeated timing, and statistics into a few lines:
import torch.utils.benchmark as bench
# Time the same training step on 1 GPU and on K GPUs (run each config separately).
t = bench.Timer(stmt="train_one_step(model, batch)",
globals={"train_one_step": train_one_step, "model": model, "batch": batch})
median_seconds = t.blocked_autorange().median # robust per-step time, warmup handled
# Then S(K) = t1.median / tK.median and E(K) = S(K) / K, computed from measured times.
torch.utils.benchmark instead of assuming it. The timer handles warmup, GPU synchronization, and outlier-robust medians, collapsing a dozen lines of manual timing into one call; you supply the two medians and the speedup and efficiency formulas of this section do the rest.7. From Definitions to Models Beginner
We now have the three quantities the rest of the chapter manipulates. Speedup $S(K) = T(1)/T(K)$ says how much faster a fixed job ran; efficiency $E(K) = S(K)/K$ says how much of each machine was wasted; and scalability is the shape of $E(K)$ as $K$ climbs, a shape that bends and flattens whenever a serial fraction or communication cost refuses to shrink. We distinguished strong scaling (fixed problem, chasing speed, hard ceiling) from weak scaling (problem grows with resources, chasing size, far gentler erosion), and we learned to read efficiency above $1$ as a baseline smell rather than a victory. Every later model in this chapter is a way to predict the curve in Figure 3.1.1 from the structure of a workload rather than measuring it after the fact. The most direct of those models, the one that turns the serial fraction $f$ into the ceiling $1/f$ we watched the table approach, is Amdahl's law, and its optimistic weak-scaling counterpart is Gustafson's law; both wait in Section 3.5. The next section stays closer to the hardware, separating the two ways to add capacity, scaling out across more machines and scaling up within one, in Section 3.2.
Using only the numbers in Output 3.1.1, answer without rerunning the code: (a) Between which two consecutive values of $K$ does efficiency first fall below $0.5$? (b) What is the theoretical speedup ceiling $S(\infty)$ for $f = 0.05$, and how close is $K = 512$ to it as a percentage? (c) A colleague proposes jumping from 256 to 512 GPUs to "go faster". State the speedup gained, the efficiency paid, and whether the move is defensible if GPU-hours are billed linearly. Frame your answer in terms of a data-parallel training step whose serial $5\%$ is the optimizer update and the first-batch load.
Extend Code 3.1.1 to sweep three serial fractions, $f \in \{0.01, 0.05, 0.20\}$, printing one speedup-and-efficiency table per fraction over the same range of $K$. For each $f$, report the ceiling $1/f$ and the smallest $K$ at which efficiency drops below $0.5$. Then add a fourth column reporting, for each $K$, the smallest $f$ that would still keep efficiency at or above $0.8$ at that $K$. Summarize in one sentence how sensitive the usable cluster size is to the serial fraction, and connect it to why shaving the optimizer step or the data-load barrier matters so much in practice.
A benchmark reports $T(1) = 600$ s and $T(8) = 60$ s for the same fixed training job, giving $S(8) = 10$ and $E(8) = 1.25$. (a) Explain why an efficiency above $1$ should make you suspicious rather than pleased. (b) Propose the single most likely cause from Section 5 and the one measurement you would take on the single-machine run to confirm it. (c) Predict what happens to the apparent superlinear effect as you continue to $K = 16, 32, 64$, assuming the cause is a working set that overflows one device's memory, and explain why a properly-sized $T(1)$ would have produced a sublinear curve consistent with the rest of this section.