"They handed me sixty-three new colleagues and asked me to finish the same job sixty-four times faster. I finished it eight times faster and called that a triumph."
A Worker Measuring Its Own Efficiency
There are two completely different questions hiding inside "does it scale?", and confusing them is the most common way to misjudge a distributed system. The first question fixes the total amount of work and asks whether adding machines makes that fixed job finish faster; this is strong scaling, and it is hard, because the serial and communication parts of the job do not shrink when you add workers. The second question fixes the work per machine and asks whether a proportionally larger job still finishes in about the same time; this is weak scaling, and it is far more forgiving, because each worker keeps a constant slice of useful work. Almost every scaling decision in AI is secretly a choice between these two regimes: training with a fixed global batch is strong scaling, while growing the global batch as you add GPUs is weak scaling. This section defines both precisely, gives the efficiency metric for each, and shows with a runnable model why one curve collapses while the other stays flat.
In Section 3.2 we separated scaling up (a bigger machine) from scaling out (more machines) and agreed that this book leads with scaling out. That distinction is about how you add resources. The present section is about a sharper and more practical question: once you have decided to add machines, what exactly are you holding fixed as you add them? The answer splits the world cleanly in two, and the two halves behave so differently that a system which looks like a failure under one definition can be a complete success under the other.
Suppose you have a job that takes one hour on a single machine and you grow your cluster to sixty-four machines. If the job is the same job, you are asking it to finish in well under a minute, and you will almost certainly be disappointed. If instead you let the job grow sixty-four times larger while keeping each machine as busy as it was before, you are asking only that the larger job still finish in about an hour, and that is a request the cluster can often honor. Same hardware, same software, opposite expectations. The vocabulary that keeps these straight is strong scaling and weak scaling, and we take them one at a time.
1. Strong Scaling: Fix the Total Problem, Add Machines Beginner
Strong scaling holds the total problem size constant and increases the number of workers $K$, hoping the wall-clock time drops in proportion. This is the intuitive meaning of "make it faster": the dataset, the model, the number of training steps are all fixed, and you throw hardware at the existing job to shorten it. Let $T(1)$ be the time on one worker and $T(K)$ the time on $K$ workers. The two quantities that matter are the speedup and the strong-scaling efficiency,
$$S(K) = \frac{T(1)}{T(K)}, \qquad E_{\text{strong}}(K) = \frac{S(K)}{K} = \frac{T(1)}{K \, T(K)}.$$Perfect strong scaling means $S(K) = K$ and $E_{\text{strong}}(K) = 1$: sixteen machines finish the fixed job sixteen times faster. Reality falls short because not all of the work parallelizes. Write the single-worker time as a serial part that every worker must redo (or that only one worker can do) plus a parallel part that splits cleanly, and add a communication term that, far from shrinking, grows with the number of workers because more participants means more coordination:
$$T(K) = \underbrace{s}_{\text{serial}} + \underbrace{\frac{1 - s}{K}}_{\text{parallel, shrinks}} + \underbrace{c \, \log_2 K}_{\text{communication, grows}}.$$Here $s$ is the fraction of one unit of work that cannot be parallelized, and $c$ scales a communication cost that rises with $K$ (a $\log_2 K$ term is a good stand-in for tree-structured collectives, which we cost precisely in Chapter 4). The parallel part is the only term that benefits from more workers. The serial part $s$ is a floor the speedup can never break through, and the communication term eventually turns increasing $K$ into a net loss. This floor is exactly Amdahl's law, which Section 3.5 treats as a law in its own right; here it is simply the reason strong scaling is the harder of the two regimes.
When you add workers to a fixed problem, only the parallel fraction is divided among them. The serial fraction stays the same size, and the communication cost actually grows. So the useful work per worker keeps falling while the overhead per worker keeps rising, and efficiency drops monotonically. Beyond some point, the next machine you add removes less time from the parallel part than it adds to communication, and the job gets slower. Strong-scaling efficiency tells you where that point is, and it is almost always at fewer machines than you expected.
2. Weak Scaling: Fix the Work Per Machine, Grow the Problem Beginner
Weak scaling holds the work per worker constant and grows the total problem in step with $K$, hoping the wall-clock time stays roughly flat. You are no longer trying to finish a fixed job faster; you are trying to solve a $K$-times-larger job in the same time you used to solve the original. This is the right model whenever the ambition grows with the resources: a bigger corpus, a finer simulation, a larger global batch. The relevant metric is the weak-scaling efficiency, the ratio of the one-worker time to the $K$-worker time on the proportionally larger problem,
$$E_{\text{weak}}(K) = \frac{T(1)}{T(K)},$$where now $T(K)$ is measured on a problem $K$ times the size of the one used for $T(1)$. Perfect weak scaling means $E_{\text{weak}}(K) = 1$: the time does not move as the cluster and the problem grow together. In the same decomposition as before, the parallel work per worker is now constant (each worker always holds one worker's share), the serial part is replicated rather than divided, and only the communication term still grows:
$$T(K) = \underbrace{s}_{\text{serial, replicated}} + \underbrace{(1 - s)}_{\text{parallel, constant per worker}} + \underbrace{c \, \log_2 K}_{\text{communication, grows}}.$$The decisive difference from strong scaling is that the parallel term no longer carries a $1/K$ that you are fighting to exploit; it is fixed. The only thing eroding weak-scaling efficiency is the communication term, which grows slowly (logarithmically here). That is why weak scaling so often stays near $100\%$ out to large $K$ while strong scaling has long since collapsed. The cost of weak scaling is not in the timing but in the meaning: you have changed the problem, and you must be sure the larger problem is one you actually wanted to solve.
Amdahl's law, the gloom behind strong scaling, says a small serial fraction caps your speedup forever. John Gustafson's 1988 observation was that nobody buys a thousand-machine cluster to run their old small problem; they run a thousand-times-bigger one. Under that assumption the serial fraction becomes a vanishing slice of a growing job, and the pessimism evaporates. Both men were right; they were just answering the two different questions in this section. We let them debate properly in Section 3.5.
3. Seeing Both Curves at Once Intermediate
The two efficiency definitions are easiest to trust when you watch them diverge on the same cost model. The code below implements the two time formulas above with one shared serial fraction $s = 0.05$ and one shared communication coefficient $c = 0.01$, then sweeps $K$ from $1$ to $128$ and prints a strong-scaling table and a weak-scaling table side by side. Nothing here is parallel; it is a deterministic model of the timing, which is exactly what you want when the goal is to see the structure rather than to benchmark a particular machine.
import math
# A single workload step has a serial fraction s that cannot be parallelized,
# plus a communication term that GROWS with the number of workers K.
# Strong scaling fixes the TOTAL problem (size 1.0): the parallel part is divided.
# T_strong(K) = s + (1 - s) / K + c * log2(K)
# Weak scaling fixes the PER-WORKER problem: the parallel part stays constant.
# T_weak(K) = s + (1 - s) + c * log2(K)
s = 0.05 # serial / non-parallelizable fraction of one unit of work
c = 0.010 # per-step communication coefficient (grows with log2 K)
Ks = [1, 2, 4, 8, 16, 32, 64, 128]
def t_strong(K):
return s + (1.0 - s) / K + c * math.log2(K)
def t_weak(K):
return s + (1.0 - s) + c * math.log2(K)
T1_strong, T1_weak = t_strong(1), t_weak(1)
print("STRONG SCALING (fixed TOTAL problem; hope time drops as K grows)")
print(f"{'K':>5} {'time':>9} {'speedup':>9} {'efficiency':>11}")
for K in Ks:
tK = t_strong(K)
speedup = T1_strong / tK # S(K) = T(1) / T(K)
print(f"{K:>5} {tK:>9.4f} {speedup:>9.3f} {speedup / K:>10.1%}") # E = S/K
print()
print("WEAK SCALING (fixed PER-WORKER problem; hope time stays constant)")
print(f"{'K':>5} {'time':>9} {'work':>9} {'efficiency':>11}")
for K in Ks:
tK = t_weak(K)
print(f"{K:>5} {tK:>9.4f} {K:>9} {T1_weak / tK:>10.1%}") # E = T(1)/T(K)
t_strong and t_weak is whether the parallel term carries a / K; everything downstream follows from that single change.STRONG SCALING (fixed TOTAL problem; hope time drops as K grows)
K time speedup efficiency
1 1.0000 1.000 100.0%
2 0.5350 1.869 93.5%
4 0.3075 3.252 81.3%
8 0.1988 5.031 62.9%
16 0.1494 6.695 41.8%
32 0.1297 7.711 24.1%
64 0.1248 8.010 12.5%
128 0.1274 7.848 6.1%
WEAK SCALING (fixed PER-WORKER problem; hope time stays constant)
K time work efficiency
1 1.0000 1 100.0%
2 1.0100 2 99.0%
4 1.0200 4 98.0%
8 1.0300 8 97.1%
16 1.0400 16 96.2%
32 1.0500 32 95.2%
64 1.0600 64 94.3%
128 1.0700 128 93.5%
Read the two tables together and the lesson is stark. In the strong table the same fixed job is being divided ever more finely; the speedup stalls near $8\times$ and then reverses, because past $K = 64$ the communication term $c \log_2 K$ is adding more time than the parallel term $(1 - s)/K$ is removing. In the weak table the job grows with the cluster, each worker keeps a full unit of parallel work, and the only erosion comes from the slowly growing communication term, so $128\times$ the work costs just $7\%$ more time. The figure below traces both efficiency curves so the divergence is visible at a glance.
4. Why This Is the Central Choice in Data-Parallel Training Intermediate
The strong-versus-weak distinction is not a borrowed piece of high-performance-computing folklore that we are visiting out of duty; it is the exact shape of the most important decision in distributed deep learning. Recall from Section 1.1 that data-parallel training splits a batch of examples across $K$ GPUs, each GPU computes a gradient on its shard, and an all-reduce averages the gradients. The question that decides everything about how this scales is what happens to the global batch size as you add GPUs.
If you hold the global batch fixed and split it across more GPUs, you are doing strong scaling. The total work of one training step (one fixed batch, one model update) is constant, and each added GPU gets a thinner slice of that batch. The per-GPU computation shrinks as $1/K$ while the all-reduce cost grows with $K$, so you slide straight down the red curve in Figure 3.3.1: the per-step time drops at first, then flattens as communication dominates, and the local batch eventually becomes too small to keep the GPU busy. This is why a fixed-batch training job stops getting faster long before you run out of GPUs to buy.
If instead you grow the global batch in proportion to the number of GPUs, keeping the per-GPU batch fixed, you are doing weak scaling. Each GPU always processes a full local batch, its computation stays constant, and only the all-reduce grows, so you ride the flat green curve. This is the regime that makes thousand-GPU training jobs possible, and it is precisely why large-batch training is a field of its own. The catch is the one that weak scaling always carries: you have changed the problem. A larger global batch means fewer gradient updates per epoch and a different optimization trajectory, so you must adjust the learning rate (the linear-scaling rule and warmup of Goyal et al.) and often the optimizer itself (LARS, LAMB) to keep the larger-batch run converging to the same quality. We develop these large-batch optimization techniques in Chapter 10 and assemble them into a working data-parallel training loop in Chapter 15.
The thesis of this book is that intelligence at scale is built by distributing work across machines, and this section explains which way of distributing actually pays. Strong scaling, the instinct to make a fixed job finish faster, runs into a wall set by the serial fraction and the communication tax, and that wall is close. Weak scaling, the discipline of growing the ambition with the hardware, is how every record-setting model was trained: more GPUs were used to train on more tokens with a larger batch, not to grind a fixed batch ever finer. When a later chapter shows a method scaling to thousands of devices, look for the weak-scaling assumption underneath; it is almost always there.
You do not implement strong or weak scaling by hand; you choose it through how you set the batch size when launching a data-parallel job. The cost model of Code 3.3.1 becomes a one-line decision in PyTorch's DistributedDataParallel, where the same training script is strong-scaling or weak-scaling depending on whether per_gpu_batch is held constant as the world size grows:
# Run with: torchrun --nproc_per_node=K train.py
import torch.distributed as dist
from torch.utils.data import DataLoader, DistributedSampler
dist.init_process_group("nccl")
K = dist.get_world_size()
# WEAK scaling: fix the PER-GPU batch, so the global batch grows with K.
per_gpu_batch = 256
global_batch = per_gpu_batch * K # grows: 256, 512, 1024, ...
# (remember to scale the learning rate with global_batch; see Chapter 10)
# STRONG scaling would instead fix global_batch and shrink the per-GPU slice:
# per_gpu_batch = global_batch // K # shrinks: 256, 128, 64, ...
loader = DataLoader(dataset, batch_size=per_gpu_batch,
sampler=DistributedSampler(dataset)) # shards data across K ranks
per_gpu_batch or global_batch is the constant. The library handles the gradient all-reduce and data sharding; you only decide which quantity stays fixed as $K$ grows.Who: An ML platform engineer onboarding a research team onto a new sixty-four-GPU training cluster.
Situation: The team benchmarked their existing image model by holding the global batch at $1{,}024$ and increasing GPUs from $8$ to $64$, expecting an $8\times$ speedup.
Problem: Wall-clock per epoch improved by only about $2.5\times$ going from $8$ to $64$ GPUs, and the platform was declared "not scaling" in a status report.
Dilemma: Buy faster interconnect to rescue the strong-scaling curve (expensive, with diminishing returns), or accept that a fixed batch of $1{,}024$ simply cannot keep $64$ GPUs busy and reframe the goal.
Decision: Measure weak scaling instead: hold the per-GPU batch at $128$ so the global batch grew to $8{,}192$ at $64$ GPUs, then tune the learning rate with linear scaling and warmup.
How: They flipped a single config flag from a fixed global batch to a fixed per-GPU batch, applied the linear learning-rate rule, and re-ran the sweep.
Result: Weak-scaling efficiency held above $90\%$ to $64$ GPUs; the larger-batch run reached the same validation accuracy in far less wall-clock per token, and the "broken" cluster was reclassified as healthy.
Lesson: A system that fails strong scaling can ace weak scaling on the same hardware. Decide which question you are actually asking before you spend money answering the wrong one.
5. Choosing the Right Metric for the Job Intermediate
Neither regime is "correct"; each answers a different real question, and Table 3.3.1 lines them up so you can pick deliberately. Use strong scaling when the problem is genuinely fixed and you need the answer sooner: a nightly retraining job on a frozen dataset, a hyperparameter trial that must finish before the next one starts, an inference batch of a fixed size. Use weak scaling when the resources are what is fixed and the ambition is free to grow: pretraining a foundation model where more compute simply means more tokens and a bigger batch, or a simulation whose fidelity you would happily increase.
| Dimension | Strong scaling | Weak scaling |
|---|---|---|
| Held constant | Total problem size | Work per worker |
| Grows with $K$ | Nothing (job is fixed) | Total problem size |
| Hoped-for outcome | Time drops as $1/K$ | Time stays constant |
| Efficiency metric | $E_{\text{strong}} = T(1) / (K\,T(K))$ | $E_{\text{weak}} = T(1) / T(K)$ |
| Limited mainly by | Serial fraction and communication | Communication only |
| Difficulty | Hard; efficiency falls fast | Easier; efficiency stays high |
| AI example | Fixed global batch over more GPUs | Global batch grows with GPUs |
| The hidden catch | Hits a hard speedup ceiling | You solved a different problem |
One discipline ties the two together: always report which regime you measured. A speedup number with no statement of what was held fixed is uninterpretable, because a glowing weak-scaling result and a dismal strong-scaling result can come from the identical job on the identical cluster, as the platform engineer above discovered. When a vendor or a paper quotes "near-linear scaling to a thousand GPUs", the first question is whether the batch grew with the GPUs; if it did, you are reading a weak-scaling claim, and it tells you nothing about how fast your own fixed job will run. We return to honest scaling reports, including cost-per-result rather than raw speedup, in Section 3.9, and the scheduling decisions that follow from these curves appear in Chapter 33.
Weak scaling in training is ultimately limited by how large a global batch still trains well, which is why batch-size and learning-rate research is the live edge of this regime. The linear-scaling rule with warmup (Goyal et al., 2017) and the layer-wise adaptive optimizers LARS and LAMB (You et al., 2017; You et al., 2020) first pushed useful batches into the tens of thousands. More recent work characterizes a "critical batch size" beyond which adding examples per step stops buying faster convergence, and ties it to the data and model scale: McCandlish et al.'s gradient-noise-scale analysis, and 2024 to 2025 studies measuring how the critical batch size shifts with token budget and model size for large language models. The practical upshot is a moving ceiling on weak scaling: more GPUs help only up to the batch the optimizer can still exploit, after which you need additional parallelism axes (the model, pipeline, and sharded parallelism of Chapter 16) rather than a bigger batch. We quantify the critical batch size and these optimizers in Chapter 10.
We now have two precise definitions, two efficiency metrics, and a runnable model showing why one curve survives to large $K$ while the other does not. Both, though, have so far been about throughput: how much work per unit time. They say nothing about how long a single request waits, which is the other half of performance and the subject of the next section. We turn to latency, throughput, and the long tail of slow responses in Section 3.4.
For each scenario, state whether it is strong scaling or weak scaling, name what is held constant, and predict whether efficiency will stay high or fall: (a) a fixed $50{,}000$-image validation set is scored on $4$, then $16$, then $64$ GPUs; (b) a language model is pretrained on $8$, then $64$, then $512$ GPUs, with the global batch and token budget growing in proportion; (c) a fixed Monte Carlo simulation of exactly $10^9$ samples is divided across a growing cluster; (d) a weather model adds more grid cells as more nodes become available, keeping cells-per-node constant. Explain why mislabeling (a) as weak scaling would make a slow result look like a success.
Extend Code 3.3.1 to find, for the strong-scaling model, the value of $K$ that minimizes t_strong(K) (the point past which adding workers makes the fixed job slower). Then sweep the serial fraction $s$ over $\{0.01, 0.05, 0.10\}$ and the communication coefficient $c$ over $\{0.005, 0.02\}$, and tabulate the optimal $K$ for each combination. Describe how the best worker count responds to a larger serial fraction and to a more expensive network, and connect your finding to the claim that strong scaling "tops out at fewer machines than you expected".
Using the weak-scaling formula $T(K) = s + (1 - s) + c \log_2 K$, derive the per-worker work $w$ (currently fixed at $1 - s$) that you would have to assign each worker to hold the weak-scaling efficiency exactly constant as $K$ grows, given that communication grows as $c \log_2 K$. This is the iso-efficiency function. Argue from your result how the per-worker problem must grow with $K$ to keep efficiency pinned, and explain why a workload whose communication grew linearly in $K$ rather than logarithmically would be far harder to weak-scale. We make the communication terms behind $c$ precise in Chapter 4.