"They added thirty-one more of me and called it a thirty-two times speedup. I did the arithmetic on the way to the barrier. It was fourteen, and I was the one waiting."
A Speedup Curve, Refusing to Be Linear Forever
A scale-out capstone is judged by four numbers reported together: how much faster the work finished (speedup), how well the added machines were actually used (efficiency), whether the model quality stayed put (the held-constant gate), and what the run cost (dollars per unit of work). Any one number alone can be made to look good while the project is failing, so the rubric demands all four, co-computed on one configuration, one seed, and one data split, in a single pass. This section consolidates the performance models of Chapter 3 and the evaluation discipline of Chapter 5 into one rubric you apply to your own project. It turns "my system scales" from a claim into a measurement that a reader can audit row by row.
By the time you reach this section you have a baseline, the single-machine time $T_1$ established in Section 41.3, and a working distributed implementation. The remaining question is whether the distribution earned its complexity, and that question only has a defensible answer if you measure it correctly. A capstone that reports "we used a cluster and it was fast" defends nothing. A capstone that reports a speedup curve against the ideal line, a parallel efficiency that you can read off at every worker count, a held-out quality column that did not move, and a cost per unit of work, defends a thesis. The four metrics are not independent decorations; they form a single verdict, and the central skill of this section is reporting them as one coherent artifact rather than four convenient fragments collected from four different runs.
1. Strong Scaling: Speedup and the Ideal Line Beginner
The most basic question is how much faster the same fixed job finished when you added machines. Hold the total work constant, time it on one worker to get the baseline $T_1$, then time the identical job on $p$ workers to get $T_p$. The strong-scaling speedup is the ratio
$$S(p) = \frac{T_1}{T_p}.$$This is the same definition introduced in Chapter 3, and the word "strong" matters: the problem size is fixed, so each worker does a smaller slice as $p$ grows. The ideal is $S(p) = p$, the diagonal line where doubling the machines halves the time. Real curves bend away from that line, and how they bend is the whole story. A curve that hugs the diagonal up to eight workers and then flattens is telling you exactly where the combine cost (the all-reduce of Chapter 4, the shuffle of Chapter 6) began to dominate the useful computation. Figure 41.6.1 shows the shape every capstone speedup plot should be compared against.
2. Parallel Efficiency: Were the Machines Actually Used? Beginner
Speedup alone flatters large clusters. A speedup of $14$ sounds impressive until you learn it came from $32$ workers, at which point more than half of every machine sat idle. Parallel efficiency normalizes speedup by the resources spent on it:
$$E(p) = \frac{S(p)}{p} = \frac{T_1}{p \, T_p}.$$Efficiency lives in $[0, 1]$ (ignoring rare super-linear cache effects), and it is the number that tells you whether adding the next machine is still worth it. $E = 1$ is perfect linear scaling; $E = 0.5$ means you are paying for two machines to get the work of one. The reason efficiency decays is the same reason the speedup curve in Figure 41.6.1 bends: the fraction of time spent communicating and synchronizing grows with $p$, and that fraction is dead weight against useful computation. A capstone should report the efficiency at every measured $p$ and state a threshold (a common engineering choice is $E \geq 0.70$) below which it declines to scale further, because past that point dollars buy idle silicon rather than throughput. Figure 41.6.2 shows the characteristic decay and where that threshold cuts.
3. Weak Scaling and the Throughput-Latency Pair Intermediate
Strong scaling fixes the problem and asks it to finish sooner. Many real AI systems ask the opposite question: as the workload grows, can I grow the cluster to keep the time-per-unit constant? That is weak scaling, where the work per node is held fixed and both the total work and the worker count scale together. Perfect weak scaling means $T_p \approx T_1$ as you grow $p$ in step with the data; the runtime stays flat while the throughput climbs linearly. Weak scaling is the right lens for a training capstone that wants to consume a bigger corpus in the same wall-clock budget, or a data-pipeline capstone that ingests a growing stream. It is governed by Gustafson's law rather than Amdahl's, and Section 5 makes that distinction precise.
Alongside scaling sit the two operational metrics every serving or pipeline capstone must report: throughput, the number of units processed per second (tokens, requests, items, gradients), and latency, the time from input to result for a single unit, usually quoted at a tail percentile such as $p99$ rather than the mean. These two trade against each other: batching more requests together lifts throughput but lengthens the latency of the requests caught in the batch, a tension developed for the serving fleet in Chapter 23 and Chapter 24. A capstone whose goal is interactive serving reports latency under a throughput floor; a capstone whose goal is batch processing reports throughput under a latency cap. Choosing which one is the constraint and which is the objective is part of the project design, and it determines which axis of Figure 41.6.1 you optimize against.
Each metric in isolation can be gamed. Speedup looks best when you pour in machines, exactly when efficiency looks worst. Throughput looks best with huge batches, exactly when latency looks worst. Cost per work looks best on one machine, exactly when wall-clock looks worst. Quality can always be traded for speed if you are willing to let accuracy slip. The verdict is the joint reading: a configuration that holds quality constant, keeps efficiency above your threshold, meets the latency or throughput constraint, and minimizes cost per unit of work, all at once. Report the tuple, never a single hero number.
4. The Quality-Held-Constant Gate Intermediate
This is the rule that separates a credible scale-out result from a meaningless one, and it is where Chapter 5 enters the rubric. A speedup is only a speedup if the distributed system produces the same answer quality as the baseline. If your sixteen-worker run finished four times faster but the held-out accuracy dropped two points, you did not speed up the task; you changed the task to an easier, worse one and timed that instead. The exact-gradient identity of Chapter 1 showed that data-parallel training can hold quality exactly, but many scale-out moves do not come with that guarantee: a larger global batch shifts the optimization trajectory, asynchronous updates inject staleness, aggressive gradient compression perturbs the result, and a sharded retrieval index can change which neighbors are returned. Each of those can quietly cost quality, and a speedup measured without checking quality hides the cost.
The discipline is construct-matched, single-pass evaluation: the quality metric must be co-computed in the same run that produced the timings, on the same held-out split, with the same seed, and saved as part of the same artifact. Reading a speedup from one run and an accuracy from a different, more favorable run is the canonical way to publish a number that is not real. The case-study evaluation sections enforce exactly this: the web-scale RAG study reports retrieval quality beside throughput in Section 36.8, and the agentic-applications study reports task success beside latency and cost in Section 40.9. Your capstone holds itself to the same gate.
The book's spine is that distributing the essential work across machines is the way forward, but the entire argument collapses if distribution silently degrades the result. That is why the capstone rubric refuses to score speedup without quality beside it. A defensible scale-out claim is a conjunction, not a single clause: the system is faster and uses its machines efficiently and holds model quality constant and costs less per unit of work. Drop any conjunct and the thesis is unproven. Every parallel method in Parts III through V earns its place by clearing this same joint bar; the capstone simply asks you to clear it for a system you built yourself.
5. Scalability Limits: Amdahl, Gustafson, and the Comm Wall Advanced
The bend in every speedup curve is not bad luck; it is a law. Amdahl's law says that if a fraction $s$ of the work is inherently serial (it cannot be parallelized: parameter updates, a final reduction, a coordination barrier), then no matter how many workers you add, the speedup is bounded:
$$S(p) \leq \frac{1}{\,s + \dfrac{1 - s}{p}\,}, \qquad \lim_{p \to \infty} S(p) = \frac{1}{s}.$$A mere $5\%$ serial fraction caps speedup at $20\times$ regardless of cluster size. This is the horizontal ceiling in Figure 41.6.1, and fitting $s$ from your own measured timings (the demo below does exactly this) tells you the absolute best your project could ever do and how close the current run is to that wall. Gustafson's law reframes the same physics for weak scaling: if the problem grows with the machines, the achievable scaled speedup is
$$S_{\text{weak}}(p) = p - s\,(p - 1) = s + p\,(1 - s),$$which grows without bound in $p$ because the serial part is a shrinking fraction of an expanding job. The two laws are not in conflict; they answer different questions. Amdahl governs "finish this fixed job faster" (strong scaling), Gustafson governs "do a proportionally bigger job in the same time" (weak scaling). Beyond both sits the communication wall: even the perfectly parallel fraction does not stay free, because exchanging data between $p$ workers costs time that grows with $p$, so real curves can flatten below the Amdahl ceiling once the all-reduce or shuffle cost overtakes the per-worker computation. The $\alpha$-$\beta$ cost model of Chapter 3 is the tool that predicts where that wall stands for your topology.
6. Cost Metrics and the Quality-Throughput-Cost Triangle Intermediate
Wall-clock speed is only half of why anyone scales out; the other half is money. Two workers that finish in half the time at double the hourly rate cost the same dollars as one, so a faster run is not automatically a cheaper one. The cost metrics make this explicit. Let $W$ be the total useful work (items scored, tokens generated, examples trained on) and let $C$ be the dollars the run cost (worker-hours times rate, plus storage and egress). Then
$$\text{cost per unit work} = \frac{C}{W}, \qquad \text{cost efficiency} = \frac{W}{C}.$$Cost per work is the number a capstone optimizes when the budget binds; cost efficiency (work per dollar) is its reciprocal, convenient when you want bigger-is-better. Crucially, the cheapest configuration is rarely the fastest one. Strong scaling buys wall-clock by spending efficiency, and lost efficiency is wasted dollars, so the cost-per-work curve usually rises as you add workers even while the runtime falls. The capstone therefore optimizes a point, not an extreme: the configuration that meets the deadline at the lowest cost, or delivers the most work per dollar within the latency budget. This is the quality-throughput-cost triangle in Figure 41.6.3, where you may push hard on any two corners only by relaxing the third, and where the quality corner is non-negotiable because of the gate in Section 4.
7. The Rubric in One Pass Intermediate
Everything above becomes a single report. From a table of measured $(p, T_p)$ timings (with the rental rate and the co-computed quality for each row), one pass produces every rubric number: speedup $S = T_1/T_p$, efficiency $E = S/p$, a fitted Amdahl serial fraction $s$ and its ceiling $1/s$, the quality-held-constant check, and the cost columns $C/W$ and $W/C$. Computing them together, from one set of measurements, is the construct-matched discipline made concrete: every number in the report comes from the same configuration, so the comparison between rows is valid. The code below takes a strong-scaling table and prints the full rubric.
import numpy as np
# A measured strong-scaling run: same total work, same model, same quality,
# timed at p = 1, 2, 4, 8, 16, 32 workers. T1 is the baseline from Section 41.3.
p = np.array([1, 2, 4, 8, 16, 32], dtype=float)
Tp = np.array([3600.0, 1880.0, 1010.0, 560.0, 340.0, 250.0]) # seconds
rate = np.array([1.0, 2.0, 4.0, 8.0, 16.0, 32.0]) # $/hour, ~linear in p
# Quality co-computed in the SAME pass, on the SAME held-out split and seed.
# A scale-out win is only a win if this column does not move.
quality = np.array([0.871, 0.871, 0.871, 0.871, 0.870, 0.871]) # held-out accuracy
T1 = Tp[0]
S = T1 / Tp # speedup S = T1 / Tp
E = S / p # parallel efficiency E = S / p
W = 1_200_000.0 # fixed total work (items), strong scaling
cost_dollars = rate * (Tp / 3600.0) # $/hour * hours = dollars for the run
cost_per_work = cost_dollars / W # C / W
cost_efficiency = W / cost_dollars # W / C
# Fit Amdahl's serial fraction s by least squares on (1/S - 1/p) = s*(1 - 1/p).
x = 1.0 - 1.0 / p
yv = 1.0 / S - 1.0 / p
s_hat = float(np.sum(x * yv) / np.sum(x * x))
S_amdahl = 1.0 / (s_hat + (1.0 - s_hat) / p)
S_max = 1.0 / s_hat # asymptotic ceiling as p -> infinity
q0 = quality[0]
quality_held = bool(np.max(np.abs(quality - q0)) <= 0.005) # within tolerance
print("p Tp(s) speedup eff quality $/run $/work work/$")
for i in range(len(p)):
print(f"{int(p[i]):<5d} {Tp[i]:7.0f} {S[i]:6.2f} {E[i]:5.3f} {quality[i]:6.3f} "
f"{cost_dollars[i]:5.2f} {cost_per_work[i]:.3e} {cost_efficiency[i]:9.0f}")
print()
print(f"Amdahl serial fraction s : {s_hat:.4f}")
print(f"Amdahl speedup ceiling 1/s : {S_max:.1f}x")
print(f"speedup at p=32 vs Amdahl : measured {S[-1]:.2f}x, model {S_amdahl[-1]:.2f}x")
print(f"quality held constant : {quality_held} (max drift {np.max(np.abs(quality - q0)):.3f})")
knee = int(p[np.argmin(cost_per_work)])
print(f"cheapest config ($/work) : p = {knee}")
print(f"efficiency>=0.70 holds up to: p = {int(p[E >= 0.70][-1])}")
p Tp(s) speedup eff quality $/run $/work work/$
1 3600 1.00 1.000 0.871 1.00 8.333e-07 1200000
2 1880 1.91 0.957 0.871 1.04 8.704e-07 1148936
4 1010 3.56 0.891 0.871 1.12 9.352e-07 1069307
8 560 6.43 0.804 0.871 1.24 1.037e-06 964286
16 340 10.59 0.662 0.870 1.51 1.259e-06 794118
32 250 14.40 0.450 0.871 2.22 1.852e-06 540000
Amdahl serial fraction s : 0.0376
Amdahl speedup ceiling 1/s : 26.6x
speedup at p=32 vs Amdahl : measured 14.40x, model 14.77x
quality held constant : True (max drift 0.001)
cheapest config ($/work) : p = 1
efficiency>=0.70 holds up to: p = 8
Read the verdict the way an examiner will. The bare speedup of $14.4\times$ at thirty-two workers is the tempting hero number, and Output 41.6.1 shows precisely why it is misleading: efficiency at that point is $0.45$, meaning more than half the cluster is idle, and cost per work is more than double the single-machine figure. The honest recommendation is eight workers, where the project is $6.4\times$ faster, still using $80\%$ of every machine, with quality untouched. That is a defensible thesis, and it came from reading all four metrics together rather than quoting the one that flattered the system.
Code 41.6.1 assumes you already have a clean $(p, T_p)$ table. Collecting it by hand with time.perf_counter() is error-prone because you must exclude warm-up, isolate the work region, and match every timing to its quality measurement. PyTorch's profiler captures the per-region wall-clock automatically, and a one-line context manager replaces a dozen lines of manual stopwatch code while guaranteeing the timing and the evaluated batch come from the same run:
import torch
from torch.profiler import profile, record_function
with profile(activities=[torch.profiler.ProfilerActivity.CUDA]) as prof:
with record_function("train_step"):
loss = step(batch) # the timed work region
acc = evaluate(model, val_split) # quality, SAME run, SAME split, SAME seed
t_p = sum(e.cuda_time for e in prof.key_averages()) / 1e6 # seconds for this p
# pair (p, t_p, acc) and feed the table to Code 41.6.1
Who: A graduate student presenting a distributed image-classification training capstone.
Situation: The slide deck led with "32x cluster, 14x faster than one GPU" in bold on the title slide.
Problem: The reviewer asked two questions: what was the efficiency, and did the validation accuracy hold. The student had measured neither in the same run as the timings.
Dilemma: Re-run everything under the rubric and risk the number shrinking, or defend the $14\times$ figure as-is and hope the questions stopped.
Decision: They re-ran the full sweep once, co-computing speedup, efficiency, and held-out accuracy per worker count in a single pass, exactly as Code 41.6.1 does.
How: The single pass revealed efficiency falling to $0.45$ at thirty-two workers and a half-point accuracy dip caused by the oversized global batch, which a learning-rate warmup then repaired.
Result: The revised thesis recommended eight workers at $6.4\times$ speedup, $0.80$ efficiency, and unchanged accuracy, a quieter headline that survived every follow-up question.
Lesson: The hero number is the one that holds up under all four metrics jointly. A speedup quoted without efficiency and a quality gate is a number waiting to be retracted in the question period.
8. From Metrics to the Written Analysis Beginner
The rubric produces numbers; the capstone must turn them into an argument. Output 41.6.1 is not the deliverable, it is the evidence for the deliverable, which is a written claim of the form "this system, on this workload, scales to $p$ workers at efficiency $E$ with quality held at $q$ and cost $C/W$, and beyond that point the communication wall makes further scaling uneconomic." Section 41.7 takes exactly these rubric outputs and develops the analysis, the speedup and efficiency plots, the Amdahl fit, and the cost curve, into the results narrative your capstone defends. The metrics here are the input to that section; the discipline of co-computing them in one pass is what makes the analysis in Section 41.7 trustworthy rather than a collage of favorable fragments.
The cost corner of the triangle is broadening beyond dollars. A growing body of work argues that a scale-out result should report energy and carbon alongside wall-clock and cost, because a faster run on more accelerators can emit more even when it costs less. Tools in the lineage of CodeCarbon and the experiment-impact tracker, and reporting frameworks following the energy-and-policy analyses of Strubell et al. and the systematic measurements of Patterson et al. (2021 to 2022), make joules-per-token and grams-of-CO2-per-training-run first-class metrics co-computed with the timings. The MLPerf benchmark suites have added power measurement to their reporting, pushing energy efficiency toward the same construct-matched, same-run discipline this section demands of speedup and quality. A forward-looking capstone adds an energy column to the rubric of Code 41.6.1 and treats it as a fifth conjunct in the verdict.
A perennial pattern in capstone reviews: the bigger the cluster on the title slide, the more certain the reviewer is to ask for the efficiency number. A "1024-GPU" headline with no efficiency column is read by experienced examiners not as a boast but as a confession. The safest flex is a small cluster at high efficiency with quality nailed down. Nobody has ever lost points for reporting plainly that eight machines were enough.
Using the numbers in Output 41.6.1, explain in your own words why a capstone should recommend the eight-worker configuration rather than the thirty-two-worker one, even though the latter has the larger speedup. State precisely what the efficiency of $0.45$ means in dollars, and connect it to the rising cost-per-work column. Then argue why the quality column being flat is a precondition for the comparison to be meaningful at all, citing the gate in Section 4.
Extend Code 41.6.1 so that any row whose quality drifts more than a tolerance (say $0.005$) from the $p=1$ baseline is flagged and excluded from the cost and speedup ranking, because a faster-but-worse configuration is not a valid scale-out point. Then re-run with a deliberately degraded row (for example set the $p=32$ accuracy to $0.860$) and show that the rubric now refuses to recommend it regardless of its speedup. Print a one-line verdict that names the largest worker count which both clears the quality gate and keeps $E \geq 0.70$.
From the fitted serial fraction $s = 0.0376$ in Output 41.6.1, compute the Amdahl speedup ceiling $1/s$ and the worker count at which the model predicts efficiency would fall to $0.50$. Compare the measured speedup at $p = 32$ to the Amdahl prediction and explain the gap in terms of the communication wall from Section 5: which effect, the serial fraction or the growing communication cost, accounts for more of the shortfall from the ideal $32\times$? State what additional measurement you would need to separate the two, and reference the $\alpha$-$\beta$ cost model of Chapter 3.