Part I: Foundations of Distributed AI
Chapter 3: Scalability and Performance Models

Strong and Weak Scaling

"They handed me sixty-three new colleagues and asked me to finish the same job sixty-four times faster. I finished it eight times faster and called that a triumph."

A Worker Measuring Its Own Efficiency
Big Picture

There are two completely different questions hiding inside "does it scale?", and confusing them is the most common way to misjudge a distributed system. The first question fixes the total amount of work and asks whether adding machines makes that fixed job finish faster; this is strong scaling, and it is hard, because the serial and communication parts of the job do not shrink when you add workers. The second question fixes the work per machine and asks whether a proportionally larger job still finishes in about the same time; this is weak scaling, and it is far more forgiving, because each worker keeps a constant slice of useful work. Almost every scaling decision in AI is secretly a choice between these two regimes: training with a fixed global batch is strong scaling, while growing the global batch as you add GPUs is weak scaling. This section defines both precisely, gives the efficiency metric for each, and shows with a runnable model why one curve collapses while the other stays flat.

In Section 3.2 we separated scaling up (a bigger machine) from scaling out (more machines) and agreed that this book leads with scaling out. That distinction is about how you add resources. The present section is about a sharper and more practical question: once you have decided to add machines, what exactly are you holding fixed as you add them? The answer splits the world cleanly in two, and the two halves behave so differently that a system which looks like a failure under one definition can be a complete success under the other.

Suppose you have a job that takes one hour on a single machine and you grow your cluster to sixty-four machines. If the job is the same job, you are asking it to finish in well under a minute, and you will almost certainly be disappointed. If instead you let the job grow sixty-four times larger while keeping each machine as busy as it was before, you are asking only that the larger job still finish in about an hour, and that is a request the cluster can often honor. Same hardware, same software, opposite expectations. The vocabulary that keeps these straight is strong scaling and weak scaling, and we take them one at a time.

1. Strong Scaling: Fix the Total Problem, Add Machines Beginner

Strong scaling holds the total problem size constant and increases the number of workers $K$, hoping the wall-clock time drops in proportion. This is the intuitive meaning of "make it faster": the dataset, the model, the number of training steps are all fixed, and you throw hardware at the existing job to shorten it. Let $T(1)$ be the time on one worker and $T(K)$ the time on $K$ workers. The two quantities that matter are the speedup and the strong-scaling efficiency,

$$S(K) = \frac{T(1)}{T(K)}, \qquad E_{\text{strong}}(K) = \frac{S(K)}{K} = \frac{T(1)}{K \, T(K)}.$$

Perfect strong scaling means $S(K) = K$ and $E_{\text{strong}}(K) = 1$: sixteen machines finish the fixed job sixteen times faster. Reality falls short because not all of the work parallelizes. Write the single-worker time as a serial part that every worker must redo (or that only one worker can do) plus a parallel part that splits cleanly, and add a communication term that, far from shrinking, grows with the number of workers because more participants means more coordination:

$$T(K) = \underbrace{s}_{\text{serial}} + \underbrace{\frac{1 - s}{K}}_{\text{parallel, shrinks}} + \underbrace{c \, \log_2 K}_{\text{communication, grows}}.$$

Here $s$ is the fraction of one unit of work that cannot be parallelized, and $c$ scales a communication cost that rises with $K$ (a $\log_2 K$ term is a good stand-in for tree-structured collectives, which we cost precisely in Chapter 4). The parallel part is the only term that benefits from more workers. The serial part $s$ is a floor the speedup can never break through, and the communication term eventually turns increasing $K$ into a net loss. This floor is exactly Amdahl's law, which Section 3.5 treats as a law in its own right; here it is simply the reason strong scaling is the harder of the two regimes.

Key Insight: Strong Scaling Is Hard Because the Stubborn Parts Do Not Shrink

When you add workers to a fixed problem, only the parallel fraction is divided among them. The serial fraction stays the same size, and the communication cost actually grows. So the useful work per worker keeps falling while the overhead per worker keeps rising, and efficiency drops monotonically. Beyond some point, the next machine you add removes less time from the parallel part than it adds to communication, and the job gets slower. Strong-scaling efficiency tells you where that point is, and it is almost always at fewer machines than you expected.

2. Weak Scaling: Fix the Work Per Machine, Grow the Problem Beginner

Weak scaling holds the work per worker constant and grows the total problem in step with $K$, hoping the wall-clock time stays roughly flat. You are no longer trying to finish a fixed job faster; you are trying to solve a $K$-times-larger job in the same time you used to solve the original. This is the right model whenever the ambition grows with the resources: a bigger corpus, a finer simulation, a larger global batch. The relevant metric is the weak-scaling efficiency, the ratio of the one-worker time to the $K$-worker time on the proportionally larger problem,

$$E_{\text{weak}}(K) = \frac{T(1)}{T(K)},$$

where now $T(K)$ is measured on a problem $K$ times the size of the one used for $T(1)$. Perfect weak scaling means $E_{\text{weak}}(K) = 1$: the time does not move as the cluster and the problem grow together. In the same decomposition as before, the parallel work per worker is now constant (each worker always holds one worker's share), the serial part is replicated rather than divided, and only the communication term still grows:

$$T(K) = \underbrace{s}_{\text{serial, replicated}} + \underbrace{(1 - s)}_{\text{parallel, constant per worker}} + \underbrace{c \, \log_2 K}_{\text{communication, grows}}.$$

The decisive difference from strong scaling is that the parallel term no longer carries a $1/K$ that you are fighting to exploit; it is fixed. The only thing eroding weak-scaling efficiency is the communication term, which grows slowly (logarithmically here). That is why weak scaling so often stays near $100\%$ out to large $K$ while strong scaling has long since collapsed. The cost of weak scaling is not in the timing but in the meaning: you have changed the problem, and you must be sure the larger problem is one you actually wanted to solve.

Fun Note: Gustafson's Quiet Rebuttal

Amdahl's law, the gloom behind strong scaling, says a small serial fraction caps your speedup forever. John Gustafson's 1988 observation was that nobody buys a thousand-machine cluster to run their old small problem; they run a thousand-times-bigger one. Under that assumption the serial fraction becomes a vanishing slice of a growing job, and the pessimism evaporates. Both men were right; they were just answering the two different questions in this section. We let them debate properly in Section 3.5.

3. Seeing Both Curves at Once Intermediate

The two efficiency definitions are easiest to trust when you watch them diverge on the same cost model. The code below implements the two time formulas above with one shared serial fraction $s = 0.05$ and one shared communication coefficient $c = 0.01$, then sweeps $K$ from $1$ to $128$ and prints a strong-scaling table and a weak-scaling table side by side. Nothing here is parallel; it is a deterministic model of the timing, which is exactly what you want when the goal is to see the structure rather than to benchmark a particular machine.

import math

# A single workload step has a serial fraction s that cannot be parallelized,
# plus a communication term that GROWS with the number of workers K.
# Strong scaling fixes the TOTAL problem (size 1.0): the parallel part is divided.
#   T_strong(K) = s + (1 - s) / K + c * log2(K)
# Weak scaling fixes the PER-WORKER problem: the parallel part stays constant.
#   T_weak(K)   = s + (1 - s)     + c * log2(K)

s = 0.05          # serial / non-parallelizable fraction of one unit of work
c = 0.010         # per-step communication coefficient (grows with log2 K)
Ks = [1, 2, 4, 8, 16, 32, 64, 128]

def t_strong(K):
    return s + (1.0 - s) / K + c * math.log2(K)

def t_weak(K):
    return s + (1.0 - s) + c * math.log2(K)

T1_strong, T1_weak = t_strong(1), t_weak(1)

print("STRONG SCALING  (fixed TOTAL problem; hope time drops as K grows)")
print(f"{'K':>5} {'time':>9} {'speedup':>9} {'efficiency':>11}")
for K in Ks:
    tK = t_strong(K)
    speedup = T1_strong / tK            # S(K) = T(1) / T(K)
    print(f"{K:>5} {tK:>9.4f} {speedup:>9.3f} {speedup / K:>10.1%}")  # E = S/K

print()
print("WEAK SCALING    (fixed PER-WORKER problem; hope time stays constant)")
print(f"{'K':>5} {'time':>9} {'work':>9} {'efficiency':>11}")
for K in Ks:
    tK = t_weak(K)
    print(f"{K:>5} {tK:>9.4f} {K:>9} {T1_weak / tK:>10.1%}")  # E = T(1)/T(K)
Code 3.3.1: One cost model, two scaling regimes. The only structural difference between t_strong and t_weak is whether the parallel term carries a / K; everything downstream follows from that single change.
STRONG SCALING  (fixed TOTAL problem; hope time drops as K grows)
    K      time   speedup  efficiency
    1    1.0000     1.000     100.0%
    2    0.5350     1.869      93.5%
    4    0.3075     3.252      81.3%
    8    0.1988     5.031      62.9%
   16    0.1494     6.695      41.8%
   32    0.1297     7.711      24.1%
   64    0.1248     8.010      12.5%
  128    0.1274     7.848       6.1%

WEAK SCALING    (fixed PER-WORKER problem; hope time stays constant)
    K      time      work  efficiency
    1    1.0000         1     100.0%
    2    1.0100         2      99.0%
    4    1.0200         4      98.0%
    8    1.0300         8      97.1%
   16    1.0400        16      96.2%
   32    1.0500        32      95.2%
   64    1.0600        64      94.3%
  128    1.0700       128      93.5%
Output 3.3.1: Strong-scaling efficiency falls from $100\%$ to $6.1\%$ by $K = 128$, and the time even ticks back up after $K = 64$ as communication overtakes the shrinking parallel work. Weak-scaling efficiency stays above $93\%$ across the same range, with $128$ times the work done in $1.07$ times the time.

Read the two tables together and the lesson is stark. In the strong table the same fixed job is being divided ever more finely; the speedup stalls near $8\times$ and then reverses, because past $K = 64$ the communication term $c \log_2 K$ is adding more time than the parallel term $(1 - s)/K$ is removing. In the weak table the job grows with the cluster, each worker keeps a full unit of parallel work, and the only erosion comes from the slowly growing communication term, so $128\times$ the work costs just $7\%$ more time. The figure below traces both efficiency curves so the divergence is visible at a glance.

number of workers K (log scale) efficiency 100% 75% 50% 25% 1 2 4 8 16 32 64 128 weak scaling (stays flat) strong scaling (collapses)
Figure 3.3.1: The two efficiency curves from Output 3.3.1 on one set of axes, with workers $K$ on a logarithmic horizontal axis. Weak-scaling efficiency (green) clings to the top of the chart because each worker keeps a constant slice of useful work. Strong-scaling efficiency (red) slides toward the floor because the fixed parallel work is split ever thinner while the serial and communication costs refuse to shrink.

4. Why This Is the Central Choice in Data-Parallel Training Intermediate

The strong-versus-weak distinction is not a borrowed piece of high-performance-computing folklore that we are visiting out of duty; it is the exact shape of the most important decision in distributed deep learning. Recall from Section 1.1 that data-parallel training splits a batch of examples across $K$ GPUs, each GPU computes a gradient on its shard, and an all-reduce averages the gradients. The question that decides everything about how this scales is what happens to the global batch size as you add GPUs.

If you hold the global batch fixed and split it across more GPUs, you are doing strong scaling. The total work of one training step (one fixed batch, one model update) is constant, and each added GPU gets a thinner slice of that batch. The per-GPU computation shrinks as $1/K$ while the all-reduce cost grows with $K$, so you slide straight down the red curve in Figure 3.3.1: the per-step time drops at first, then flattens as communication dominates, and the local batch eventually becomes too small to keep the GPU busy. This is why a fixed-batch training job stops getting faster long before you run out of GPUs to buy.

If instead you grow the global batch in proportion to the number of GPUs, keeping the per-GPU batch fixed, you are doing weak scaling. Each GPU always processes a full local batch, its computation stays constant, and only the all-reduce grows, so you ride the flat green curve. This is the regime that makes thousand-GPU training jobs possible, and it is precisely why large-batch training is a field of its own. The catch is the one that weak scaling always carries: you have changed the problem. A larger global batch means fewer gradient updates per epoch and a different optimization trajectory, so you must adjust the learning rate (the linear-scaling rule and warmup of Goyal et al.) and often the optimizer itself (LARS, LAMB) to keep the larger-batch run converging to the same quality. We develop these large-batch optimization techniques in Chapter 10 and assemble them into a working data-parallel training loop in Chapter 15.

Thesis Thread: Scale-Out Rewards the Weak-Scaling Mindset

The thesis of this book is that intelligence at scale is built by distributing work across machines, and this section explains which way of distributing actually pays. Strong scaling, the instinct to make a fixed job finish faster, runs into a wall set by the serial fraction and the communication tax, and that wall is close. Weak scaling, the discipline of growing the ambition with the hardware, is how every record-setting model was trained: more GPUs were used to train on more tokens with a larger batch, not to grind a fixed batch ever finer. When a later chapter shows a method scaling to thousands of devices, look for the weak-scaling assumption underneath; it is almost always there.

Library Shortcut: PyTorch Picks Your Scaling Regime by One Argument

You do not implement strong or weak scaling by hand; you choose it through how you set the batch size when launching a data-parallel job. The cost model of Code 3.3.1 becomes a one-line decision in PyTorch's DistributedDataParallel, where the same training script is strong-scaling or weak-scaling depending on whether per_gpu_batch is held constant as the world size grows:

# Run with: torchrun --nproc_per_node=K train.py
import torch.distributed as dist
from torch.utils.data import DataLoader, DistributedSampler

dist.init_process_group("nccl")
K = dist.get_world_size()

# WEAK scaling: fix the PER-GPU batch, so the global batch grows with K.
per_gpu_batch = 256
global_batch  = per_gpu_batch * K          # grows: 256, 512, 1024, ...
# (remember to scale the learning rate with global_batch; see Chapter 10)

# STRONG scaling would instead fix global_batch and shrink the per-GPU slice:
#   per_gpu_batch = global_batch // K       # shrinks: 256, 128, 64, ...

loader = DataLoader(dataset, batch_size=per_gpu_batch,
                    sampler=DistributedSampler(dataset))   # shards data across K ranks
Code 3.3.2: The strong-versus-weak choice from this section, expressed as whether per_gpu_batch or global_batch is the constant. The library handles the gradient all-reduce and data sharding; you only decide which quantity stays fixed as $K$ grows.
Practical Example: The Cluster That Looked Broken Until We Changed the Question

Who: An ML platform engineer onboarding a research team onto a new sixty-four-GPU training cluster.

Situation: The team benchmarked their existing image model by holding the global batch at $1{,}024$ and increasing GPUs from $8$ to $64$, expecting an $8\times$ speedup.

Problem: Wall-clock per epoch improved by only about $2.5\times$ going from $8$ to $64$ GPUs, and the platform was declared "not scaling" in a status report.

Dilemma: Buy faster interconnect to rescue the strong-scaling curve (expensive, with diminishing returns), or accept that a fixed batch of $1{,}024$ simply cannot keep $64$ GPUs busy and reframe the goal.

Decision: Measure weak scaling instead: hold the per-GPU batch at $128$ so the global batch grew to $8{,}192$ at $64$ GPUs, then tune the learning rate with linear scaling and warmup.

How: They flipped a single config flag from a fixed global batch to a fixed per-GPU batch, applied the linear learning-rate rule, and re-ran the sweep.

Result: Weak-scaling efficiency held above $90\%$ to $64$ GPUs; the larger-batch run reached the same validation accuracy in far less wall-clock per token, and the "broken" cluster was reclassified as healthy.

Lesson: A system that fails strong scaling can ace weak scaling on the same hardware. Decide which question you are actually asking before you spend money answering the wrong one.

5. Choosing the Right Metric for the Job Intermediate

Neither regime is "correct"; each answers a different real question, and Table 3.3.1 lines them up so you can pick deliberately. Use strong scaling when the problem is genuinely fixed and you need the answer sooner: a nightly retraining job on a frozen dataset, a hyperparameter trial that must finish before the next one starts, an inference batch of a fixed size. Use weak scaling when the resources are what is fixed and the ambition is free to grow: pretraining a foundation model where more compute simply means more tokens and a bigger batch, or a simulation whose fidelity you would happily increase.

Table 3.3.1: Strong and weak scaling compared along the dimensions that decide which one governs a given system.
DimensionStrong scalingWeak scaling
Held constantTotal problem sizeWork per worker
Grows with $K$Nothing (job is fixed)Total problem size
Hoped-for outcomeTime drops as $1/K$Time stays constant
Efficiency metric$E_{\text{strong}} = T(1) / (K\,T(K))$$E_{\text{weak}} = T(1) / T(K)$
Limited mainly bySerial fraction and communicationCommunication only
DifficultyHard; efficiency falls fastEasier; efficiency stays high
AI exampleFixed global batch over more GPUsGlobal batch grows with GPUs
The hidden catchHits a hard speedup ceilingYou solved a different problem

One discipline ties the two together: always report which regime you measured. A speedup number with no statement of what was held fixed is uninterpretable, because a glowing weak-scaling result and a dismal strong-scaling result can come from the identical job on the identical cluster, as the platform engineer above discovered. When a vendor or a paper quotes "near-linear scaling to a thousand GPUs", the first question is whether the batch grew with the GPUs; if it did, you are reading a weak-scaling claim, and it tells you nothing about how fast your own fixed job will run. We return to honest scaling reports, including cost-per-result rather than raw speedup, in Section 3.9, and the scheduling decisions that follow from these curves appear in Chapter 33.

Research Frontier: How Far Can Weak Scaling Push the Batch? (2024 to 2026)

Weak scaling in training is ultimately limited by how large a global batch still trains well, which is why batch-size and learning-rate research is the live edge of this regime. The linear-scaling rule with warmup (Goyal et al., 2017) and the layer-wise adaptive optimizers LARS and LAMB (You et al., 2017; You et al., 2020) first pushed useful batches into the tens of thousands. More recent work characterizes a "critical batch size" beyond which adding examples per step stops buying faster convergence, and ties it to the data and model scale: McCandlish et al.'s gradient-noise-scale analysis, and 2024 to 2025 studies measuring how the critical batch size shifts with token budget and model size for large language models. The practical upshot is a moving ceiling on weak scaling: more GPUs help only up to the batch the optimizer can still exploit, after which you need additional parallelism axes (the model, pipeline, and sharded parallelism of Chapter 16) rather than a bigger batch. We quantify the critical batch size and these optimizers in Chapter 10.

We now have two precise definitions, two efficiency metrics, and a runnable model showing why one curve survives to large $K$ while the other does not. Both, though, have so far been about throughput: how much work per unit time. They say nothing about how long a single request waits, which is the other half of performance and the subject of the next section. We turn to latency, throughput, and the long tail of slow responses in Section 3.4.

Exercise 3.3.1: Name the Regime Conceptual

For each scenario, state whether it is strong scaling or weak scaling, name what is held constant, and predict whether efficiency will stay high or fall: (a) a fixed $50{,}000$-image validation set is scored on $4$, then $16$, then $64$ GPUs; (b) a language model is pretrained on $8$, then $64$, then $512$ GPUs, with the global batch and token budget growing in proportion; (c) a fixed Monte Carlo simulation of exactly $10^9$ samples is divided across a growing cluster; (d) a weather model adds more grid cells as more nodes become available, keeping cells-per-node constant. Explain why mislabeling (a) as weak scaling would make a slow result look like a success.

Exercise 3.3.2: Find the Crossover Coding

Extend Code 3.3.1 to find, for the strong-scaling model, the value of $K$ that minimizes t_strong(K) (the point past which adding workers makes the fixed job slower). Then sweep the serial fraction $s$ over $\{0.01, 0.05, 0.10\}$ and the communication coefficient $c$ over $\{0.005, 0.02\}$, and tabulate the optimal $K$ for each combination. Describe how the best worker count responds to a larger serial fraction and to a more expensive network, and connect your finding to the claim that strong scaling "tops out at fewer machines than you expected".

Exercise 3.3.3: The Iso-Efficiency Question Analysis

Using the weak-scaling formula $T(K) = s + (1 - s) + c \log_2 K$, derive the per-worker work $w$ (currently fixed at $1 - s$) that you would have to assign each worker to hold the weak-scaling efficiency exactly constant as $K$ grows, given that communication grows as $c \log_2 K$. This is the iso-efficiency function. Argue from your result how the per-worker problem must grow with $K$ to keep efficiency pinned, and explain why a workload whose communication grew linearly in $K$ rather than logarithmically would be far harder to weak-scale. We make the communication terms behind $c$ precise in Chapter 4.