Part I: Foundations of Distributed AI
Chapter 3: Scalability and Performance Models

Amdahl's Law and Gustafson's Law

"They added a thousand of me to the cluster, and I sped up by a factor of nine. The other nine hundred and ninety-one of us spent the day waiting on a barrier."

A Worker That Hit the Serial Fraction
Big Picture

Two short formulas decide whether adding machines to an AI workload will help, and they disagree on purpose because they answer two different questions. Amdahl's law fixes the problem size and asks how fast a fixed job can be made by adding workers; its answer is a hard ceiling set by the part of the work that cannot be parallelized. Gustafson's law lets the problem grow with the machine count and asks how much more work the same time budget can absorb; its answer is an almost straight line that keeps rising. For distributed training the non-parallelizable part has a concrete identity: it is the communication and synchronization that every worker must perform no matter how the data is split. Amdahl explains why throwing more GPUs at a fixed model eventually stops helping, and Gustafson explains why training a bigger model on more data still scales. Holding both laws in mind at once is what lets you size a cluster instead of guessing at it.

The previous section measured speedup and efficiency empirically: you run a job on $K$ workers, divide the single-worker time by the $K$-worker time, and read off how close you came to a factor of $K$. That gives you a number after the fact. This section gives you the two models that predict the number in advance, from a single property of the workload, the fraction of it that refuses to run in parallel. Both models start from the same decomposition of the work and reach opposite-sounding conclusions, and the apparent contradiction dissolves once you see that they hold different things fixed. We derive each in full, then read both through the lens that matters for this book: in distributed AI, the serial fraction is rarely idle code waiting its turn; it is the network traffic and the synchronization barriers that scale-out introduces and that Section 3.8 turns into a cost model you can compute.

worker count K (problem size grows with K for Gustafson) speedup S(K) Amdahl ceiling: S -> 1/s ideal linear S = K Gustafson: S = K - s(K-1) Amdahl: S = 1 / (s + (1-s)/K) 1
Figure 3.5.1: The two laws on one pair of axes. Amdahl's law (orange) holds the problem fixed; its speedup bends over and presses against the horizontal ceiling $1/s$, so beyond some worker count more machines buy almost nothing. Gustafson's law (green) lets the problem grow with $K$; its speedup stays close to the ideal diagonal $S=K$ (gray dashed) because the added workers are filled with new parallel work rather than competing for a fixed amount. The serial fraction $s$ is the same quantity in both; the laws differ only in what they hold constant.

1. The Fixed-Size World: Amdahl's Law Intermediate

Take a single fixed job and split its total work into two parts by how it responds to parallelism. A fraction $s$ of the work is inherently serial: it must run on one worker, in sequence, and adding machines cannot speed it up. The remaining fraction $1-s$ is perfectly parallelizable: given $K$ workers it runs in $1/K$ of the time it took on one. Normalize the single-worker runtime to $1$, so the serial part costs $s$ and the parallel part costs $1-s$. On $K$ workers the total time is the serial part, unchanged, plus the parallel part divided across the workers,

$$T(K) = s + \frac{1 - s}{K}.$$

Speedup is the single-worker time over the $K$-worker time, and with the single-worker time normalized to $1$ that is simply $1/T(K)$,

$$S(K) = \frac{T(1)}{T(K)} = \frac{1}{\,s + \dfrac{1 - s}{K}\,}.$$

This is Amdahl's law. Its decisive feature appears when you let the worker count run away to infinity. The parallel term $\tfrac{1-s}{K}$ vanishes, the serial term $s$ does not, and the speedup runs into a wall,

$$\lim_{K \to \infty} S(K) = \frac{1}{s}.$$

No matter how many machines you add, a job that is two percent serial cannot run more than fifty times faster, and a job that is ten percent serial cannot beat a factor of ten. The serial fraction, however small, sets a ceiling that all the hardware in the world cannot lift. This is the most sobering arithmetic in parallel computing: the cost of the part you cannot parallelize comes to dominate, and it does so quickly.

Key Insight: A Small Serial Fraction Caps a Large Cluster

Amdahl's ceiling $1/s$ depends only on the serial fraction, never on the worker count. A workload that is just five percent serial is capped at a twentyfold speedup, so the thousandth GPU you add to it is doing the work of a rounding error. The practical reading is brutal and useful: before you buy more machines for a fixed job, estimate $s$, because $1/s$ is the most speedup those machines could ever deliver, and you are usually closer to it than you think.

2. The Scaled-Size World: Gustafson's Law Intermediate

Amdahl's law quietly assumes the problem stays the same size as you add workers. In practice that is rarely why people build big clusters. You do not buy a thousand machines to run yesterday's small job a thousand times faster; you buy them to run a job a thousand times bigger in the same wall-clock time. Gustafson's insight is to fix the runtime on the parallel machine and ask how much work fits inside it, letting the parallel portion grow with the worker count while the serial portion stays put.

Set the runtime on $K$ workers to $1$, split as before into a serial fraction $s$ and a parallel fraction $1-s$ of that observed time. The serial part is $s$ and does not change with $K$. The parallel part is $1-s$ as it ran on $K$ workers, so doing the same parallel work on a single worker would take $K$ times as long, namely $K(1-s)$. The hypothetical single-worker time for this scaled problem is therefore the serial part plus the inflated parallel part, and the speedup is that single-worker time divided by the $K$-worker time of $1$,

$$S(K) = \frac{s + K(1 - s)}{1} = s + K(1-s) = K - s(K - 1).$$

This is Gustafson's law. It is a straight line in $K$ with slope $1-s$, offset by the constant serial cost. Where Amdahl curved over and flattened, Gustafson keeps climbing: every worker you add is handed a fresh slice of parallel work, so the speedup grows almost linearly and never hits a ceiling. The same serial fraction that doomed the fixed job becomes a fixed, shrinking-in-relative-terms overhead once the parallel work is allowed to expand around it.

Key Insight: The Two Laws Are the Same Algebra Seen From Two Chairs

Amdahl fixes the work and watches time shrink toward a floor; Gustafson fixes the time and watches the work grow without bound. Neither is wrong and neither overrides the other. The question you actually face decides which applies: "make this exact job faster" is Amdahl, and you will meet a ceiling; "do more in the same time" is Gustafson, and you will scale. Most AI scaling stories are Gustafson stories wearing Amdahl's clothes, which is why bigger models on more data keep paying off while re-running one fixed benchmark on more GPUs stops paying off early.

3. Reading the Serial Fraction as Communication Advanced

In classical Amdahl examples the serial fraction is some bookkeeping that genuinely runs on one core: parsing a header, a sequential dependency, a critical section. In distributed AI the serial fraction has a sharper and more important identity. The compute, the forward and backward passes over a data shard, parallelizes beautifully; we proved its exactness in Section 1.1, where eight workers reproduced the single-machine gradient to floating-point rounding. What does not parallelize away is the step that follows: every worker must exchange and synchronize its partial result before the next iteration. That all-reduce, and the barrier where fast workers wait for slow ones, is work that does not shrink when you add machines; if anything it grows. It is the serial fraction of a data-parallel training step.

This reframing turns Amdahl's law from a dusty result into the central warning of distributed training. Let the per-step compute time be $T_{\text{comp}}$ and the communication time be $T_{\text{comm}}$. The serial fraction of one step is

$$s = \frac{T_{\text{comm}}}{T_{\text{comp}} + T_{\text{comm}}},$$

and Amdahl's ceiling on the achievable speedup is $1/s = 1 + T_{\text{comp}}/T_{\text{comm}}$. As you add workers, two things happen at once: each worker's compute time $T_{\text{comp}}$ shrinks because its shard is smaller, while the communication time $T_{\text{comm}}$ to combine a gradient of fixed size across more participants holds steady or rises. The serial fraction $s$ therefore climbs with the worker count, the ceiling $1/s$ drops, and you reach the regime where another GPU spends more time talking than computing. That is the precise, quantitative reason that throwing GPUs at a fixed model eventually stops helping, and it is why the communication-cost model of Section 3.8 and the communication-avoiding methods of Chapter 10 exist: they attack $T_{\text{comm}}$ directly to push the ceiling back up.

Gustafson's law is the reason scale-out has not collapsed under this warning. The way out of Amdahl's ceiling is not to run a fixed model on ever more workers but to grow the model and the data alongside the cluster. A larger model and larger batches increase $T_{\text{comp}}$, which drives the serial fraction $s$ back down for a fixed communication cost, and the bigger problem keeps the added workers usefully busy. This is the Gustafson regime, and it is exactly how foundation-model training scales: the problem grows to match the machine. The two laws, taken together, give the operating rule of distributed deep learning. Do not push a fixed problem past its Amdahl ceiling; instead grow the problem so you stay in Gustafson's near-linear region.

Thesis Thread: The Serial Fraction Is the Communication Tax

This book's recurring claim is that scale-out is paid for in communication, and Amdahl's law is where that claim becomes a number. The serial fraction $s$ of a distributed training step is the all-reduce introduced in Chapter 4 and made exact in Section 1.1; the ceiling $1/s$ it imposes is the same tax, now expressed as a hard limit on speedup. Every later technique that overlaps communication with computation, compresses gradients, or lets workers take local steps is, in the language of this section, an effort to lower $s$ and lift the ceiling. When you meet those methods in Chapter 10, read each as a move in this one equation.

Research Frontier: Lowering the Serial Fraction (2024 to 2026)

Because the Amdahl ceiling on a fixed model is set entirely by the communication share $s$, an active research line attacks that share directly. Local-update schemes let workers run several optimizer steps between synchronizations, cutting how often the all-reduce fires; the DiLoCo line (Douillard et al., 2024) and its streaming successor (Douillard et al., 2025) push this far enough to train across data centers and over the open internet, where the latency that inflates the communication time is brutal. In parallel, communication-and-computation overlap has been driven into the compiler and kernel layer: production stacks decompose collectives so that all-reduce hides almost entirely behind the backward pass, and fused-collective work reported through 2024 and 2025 shrinks the exposed communication toward zero on a fat interconnect. Gradient compression in the PowerSGD and low-bit lineage reduces the bytes per step instead. Each of these is, in the vocabulary of this section, a way to drive $s$ down and lift the $1/s$ ceiling; Chapter 10 gives the machinery to compare them on equal footing.

4. Both Laws on One Plot Intermediate

The contrast is easiest to trust when you compute it. The program below evaluates both laws over a range of worker counts for three serial fractions and prints the speedup tables side by side, then reports each Amdahl ceiling and how much of it a thousand-worker run has already consumed. The serial fractions are chosen to span the regimes a distributed-training engineer actually meets: a heavily communication-bound step ($s=0.5$), a moderate one ($s=0.1$), and a well-overlapped one ($s=0.02$).

def amdahl(s, K):
    # fixed problem: time = serial part + parallel part spread over K workers
    return 1.0 / (s + (1.0 - s) / K)

def gustafson(s, K):
    # scaled problem: parallel work grows with K, serial part stays fixed
    return K - s * (K - 1)

serials = [0.50, 0.10, 0.02]                 # serial (communication) fractions
Ks = [1, 2, 4, 8, 16, 64, 256, 1024]         # worker counts

print(f"{'K':>6} | " + " | ".join(f"Amd s={s:<4}" for s in serials)
      + " || " + " | ".join(f"Gus s={s:<4}" for s in serials))
print("-" * 96)
for K in Ks:
    amd = " | ".join(f"{amdahl(s, K):8.2f}" for s in serials)
    gus = " | ".join(f"{gustafson(s, K):8.2f}" for s in serials)
    print(f"{K:>6} | {amd} || {gus}")

print()
print("Amdahl ceilings (K -> infinity), speedup bounded by 1/s:")
for s in serials:
    print(f"  s = {s:<5}  ->  max speedup = {1.0/s:7.2f}x")

print()
print("At K = 1024, fraction of Amdahl's ceiling already reached:")
for s in serials:
    print(f"  s = {s:<5}  ->  {amdahl(s,1024)/(1.0/s)*100:5.1f}% of the {1.0/s:.0f}x ceiling")
Code 3.5.1: Amdahl and Gustafson speedup tabulated over a worker range for three serial fractions, with the Amdahl ceiling $1/s$ and the fraction of it consumed at $K=1024$. The two functions are one-liners; the lesson is entirely in how their columns diverge as $K$ grows.
     K | Amd s=0.5  | Amd s=0.1  | Amd s=0.02 || Gus s=0.5  | Gus s=0.1  | Gus s=0.02
------------------------------------------------------------------------------------------------
     1 |     1.00 |     1.00 |     1.00 ||     1.00 |     1.00 |     1.00
     2 |     1.33 |     1.82 |     1.96 ||     1.50 |     1.90 |     1.98
     4 |     1.60 |     3.08 |     3.77 ||     2.50 |     3.70 |     3.94
     8 |     1.78 |     4.71 |     7.02 ||     4.50 |     7.30 |     7.86
    16 |     1.88 |     6.40 |    12.31 ||     8.50 |    14.50 |    15.70
    64 |     1.97 |     8.77 |    28.32 ||    32.50 |    57.70 |    62.74
   256 |     1.99 |     9.66 |    41.97 ||   128.50 |   230.50 |   250.90
  1024 |     2.00 |     9.91 |    47.72 ||   512.50 |   921.70 |  1003.54

Amdahl ceilings (K -> infinity), speedup bounded by 1/s:
  s = 0.5    ->  max speedup =    2.00x
  s = 0.1    ->  max speedup =   10.00x
  s = 0.02   ->  max speedup =   50.00x

At K = 1024, fraction of Amdahl's ceiling already reached:
  s = 0.5    ->   99.9% of the 2x ceiling
  s = 0.1    ->   99.1% of the 10x ceiling
  s = 0.02   ->   95.4% of the 50x ceiling
Output 3.5.1: Real output from the program above. The Amdahl columns saturate: at $s=0.1$, going from 64 to 1024 workers (a sixteenfold increase) lifts speedup from $8.77$ to only $9.91$, almost all of the $10\times$ ceiling spent. The Gustafson columns keep rising nearly linearly, reaching $921.70$ at $K=1024$ for the same $s=0.1$, because the workload grew with the cluster.

The two halves of the table tell the whole story of this section in numbers. Read down an Amdahl column and the speedup crawls toward $1/s$ and stalls; the last bottom row confirms that a thousand workers have already burned 95 to 99 percent of every ceiling, so the next thousand are nearly wasted. Read down the matching Gustafson column and the speedup tracks the worker count almost exactly. Same serial fraction, same algebra, opposite outcome, decided entirely by whether the problem was allowed to grow.

Fun Note: Amdahl Lost the Argument He Was Right About

Gene Amdahl presented his law in 1967 as an argument against parallel computers: if real workloads carry a stubborn serial fraction, massively parallel machines were a dead end. The algebra was impeccable and the conclusion did not age well, because John Gustafson pointed out in 1988 that nobody runs a fixed problem on a bigger machine; they run a bigger problem. Both men were right about their own question. The lasting joke is that the law most often cited to explain why parallelism works is the one its author wrote to argue that it would not.

5. Using the Laws to Size a Cluster Intermediate

The laws are not just descriptive; they are a sizing tool. If you can measure the serial fraction of one step, by timing the compute and the communication separately as Section 3.4 describes, you can predict the speedup at any worker count before renting the workers. Suppose a profiling run on a single node shows that a training step spends 180 milliseconds in compute and 20 milliseconds in the gradient all-reduce. The serial fraction is $s = 20/200 = 0.10$, so Amdahl caps this fixed model at a tenfold speedup. Output 3.5.1 then tells you that 16 workers already deliver $6.4\times$, two thirds of the ceiling, and that scaling to 1024 workers would deliver only $9.9\times$, barely better, at sixty-four times the cost. The decision is no longer a matter of taste: for this fixed model, somewhere around 16 to 32 workers is the sensible stopping point, and beyond it you are paying for idle silicon.

What if tenfold is not enough? Amdahl says you cannot get there by adding workers to this model, so you change the problem instead. Either lower $s$ by attacking the 20 milliseconds of communication, the route of Chapter 10, or follow Gustafson and grow the model and batch so that compute rises and the same 20 milliseconds becomes a smaller slice. Both moves push you back into the near-linear region. This is the discipline the chapter has been building toward: measure $s$, compute the ceiling, and let the two laws tell you whether to buy machines, cut communication, or grow the problem.

Library Shortcut: Let the Profiler Measure the Serial Fraction

You do not estimate $s$ by hand on a real model; you read it off a profiler that already separates compute from communication. PyTorch's profiler tags NCCL collective time distinctly from kernel compute time, so the serial fraction falls out of one trace:

# Run inside a normal DDP training step to time compute vs. communication.
import torch
from torch.profiler import profile, ProfilerActivity

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
    loss = model(batch).loss
    loss.backward()          # DDP fires the gradient all-reduce here
    optimizer.step()

# Group events; NCCL collectives (nccl:all_reduce) are the communication time,
# the rest is compute. Their ratio is the serial fraction s of Amdahl's law.
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=12))
Code 3.5.2: One profiler block replaces the manual timing of Section 3.4: the NCCL collective rows are $T_{\text{comm}}$, everything else is $T_{\text{comp}}$, and their ratio is the $s$ that Code 3.5.1 turns into an Amdahl ceiling. The framework handles the instrumentation, kernel attribution, and CUDA-stream timing you would otherwise hand-roll.
Practical Example: The 256-GPU Run That Should Have Been 32

Who: An ML platform engineer supporting a research team training a mid-size vision transformer.

Situation: The team requested a 256-GPU reservation to "train faster," having seen a near-perfect speedup on an earlier 8-GPU run and assuming it would continue.

Problem: At 256 GPUs the per-step throughput was only about three times the 8-GPU number, not the thirty-two times the team expected, and the reservation was the most expensive on the cluster.

Dilemma: Grant the reservation and let an Amdahl ceiling waste most of it, push back and risk looking obstructive, or change what the run actually does.

Decision: The engineer profiled one step (the Code 3.5.2 pattern), found communication was 22 percent of step time at 256 GPUs, computed a serial fraction near $s=0.22$ and a ceiling of about $4.5\times$, and brought the numbers to the team.

How: Two changes followed. The fixed-model job was cut to 32 GPUs, where it already reached most of its ceiling; and for the runs that genuinely needed scale, the batch size and model width were grown (the Gustafson move) so the 256 GPUs stayed busy with real work.

Result: The fixed run finished at the same speed on 32 GPUs as on 256, freeing 224 GPUs, and the scaled-up run used the full reservation at near-linear efficiency, exactly as Output 3.5.1 predicts for a problem grown to match the cluster.

Lesson: A perfect small-scale speedup says nothing about the ceiling. Measure $s$, compute $1/s$, and decide whether the right answer is fewer machines or a bigger problem.

6. What the Laws Leave Out Advanced

Both laws are deliberately simple, and their simplicity hides assumptions worth naming so you do not over-trust the numbers. They model the parallel part as perfectly divisible and the workers as identical, so they ignore load imbalance and stragglers, the slow worker that holds up the barrier; Chapter 2 takes those up as coordination problems. They treat communication as either fully serial or fully parallel, when in practice good systems overlap it with computation so part of $T_{\text{comm}}$ hides behind $T_{\text{comp}}$; that overlap effectively lowers $s$ and is a major theme of later parallel-training chapters. And they say nothing about whether more workers change the statistical quality of the result, which for SGD with very large batches absolutely happens. The laws bound the speedup of the computation; they do not promise that a faster computation reaches the same accuracy.

Those caveats do not weaken the laws; they tell you where to point your engineering. Amdahl says find and shrink the serial fraction. Gustafson says grow the problem to keep the workers fed. The richer models that account for overlap, imbalance, and the full alpha-beta structure of communication are the subject of the rest of this chapter, and they all reduce, in the limit, to these two formulas. With the speedup ceiling now expressed in terms of a serial fraction, the next section refines the accounting by separating the parallel work itself into how much there is and how long its longest dependency chain runs, the work-and-depth view that Section 3.6 develops.

Exercise 3.5.1: Where the Two Laws Cross Conceptual

Using $S_{\text{Amdahl}}(K) = 1/(s + (1-s)/K)$ and $S_{\text{Gustafson}}(K) = K - s(K-1)$, explain in words why the two speedups are equal at $K=1$ for any $s$, and why Gustafson is at or above Amdahl for every $K \ge 1$. Then argue, without algebra, what it would mean physically if a real benchmark reported a speedup above its Gustafson line: which assumption of the model must have been violated, and name one mechanism (hint: think about caches or memory per worker) that produces such super-linear speedup in practice.

Exercise 3.5.2: Profile, Predict, Decide Analysis

A profiled training step spends 240 ms in compute and 60 ms in the gradient all-reduce on a single 8-GPU node, and you observe that communication time stays roughly constant as you add nodes while compute time scales down with the number of GPUs. Compute the serial fraction $s$ and the Amdahl ceiling $1/s$ for this fixed model. Estimate the speedup at 8, 64, and 512 GPUs, and state the smallest GPU count that already reaches 90 percent of the ceiling. Then describe the single change to either compute or communication that would most raise the ceiling, and justify it from the formula for $s$ in Section 3.

Exercise 3.5.3: Make the Plot Coding

Extend Code 3.5.1 so that, instead of a fixed serial fraction, $s$ grows with the worker count to model a fixed communication cost over a shrinking compute share: set $T_{\text{comm}} = 20$ ms constant and $T_{\text{comp}} = 1800/K$ ms, then compute $s(K) = T_{\text{comm}}/(T_{\text{comp}} + T_{\text{comm}})$ and feed it into Amdahl's formula. Tabulate the resulting speedup for $K \in \{1, 8, 64, 256, 1024\}$ and identify the worker count at which adding more GPUs first makes the step slower in wall-clock terms rather than faster. Explain why this "effective" Amdahl curve can actually turn over and decline, something the fixed-$s$ curve never does.