Part I: Foundations of Distributed AI
Chapter 3: Scalability and Performance Models

Work, Depth, and Parallelism

"You can hire a thousand workers, but you cannot make the ninth month of pregnancy finish in three days. Some chains just have to happen in order, and I am one of them."

A Critical Path That Will Not Be Hurried
Big Picture

How fast a computation can possibly run on infinitely many machines is not set by how much work it contains, but by the length of its longest chain of steps that must happen in order. Amdahl's law in the previous section told you that a fixed serial fraction caps your speedup; this section gives you the structural reason behind that fraction and a way to read it straight off the algorithm. We separate two numbers: the total operations a computation performs (its work) and the length of its longest dependency chain (its depth or span). Their ratio is the parallelism the algorithm actually offers, and Brent's theorem turns those two numbers into a usable bound on running time for any processor count. The payoff for distributed AI is direct: the reason a tree all-reduce over thousands of workers stays cheap is that its depth grows only as the logarithm of the worker count, so synchronization never becomes the wall.

The previous section, Section 3.5, measured the limit on speedup through a single number, the serial fraction in Amdahl's law. That number is honest but opaque: it tells you a ceiling exists without telling you where it comes from or how to lower it. The work-depth model opens the box. It describes a parallel computation as a directed acyclic graph of operations, where an edge means "this operation needs that one's result first," and it reads two quantities off that graph. The first is how big the graph is; the second is how long it is. The first bounds the work you must pay for; the second bounds the time you cannot escape no matter how many machines you bring. Keeping these two apart is the cleanest way to reason about whether an algorithm will scale before you write a line of distributed code.

A reduction tree over eight values x1 x2 x3 x4 x5 x6 x7 x8 + + + + + + sum round 1 round 2 round 3 depth D = ceil(log2 8) = 3 rounds, while the work W = 7 additions matches a sequential sum
Figure 3.6.1: Summing eight values as a balanced reduction tree. Each round halves the number of live operands, so the critical path is three rounds even though seven additions are performed; a sequential loop would do the same seven additions but stack all seven on one chain. The same shape, read as a communication pattern, is the tree all-reduce of Chapter 4, whose depth over $K$ workers is $O(\log K)$.

1. Two Numbers: Work and Depth Beginner

Model a parallel computation as a directed acyclic graph. Each node is one elementary operation; a directed edge from $u$ to $v$ means $v$ consumes a result that $u$ produces, so $u$ must finish before $v$ starts. Two quantities summarize this graph. The work $W$ is the total number of operations, the count of all nodes, which is exactly the time the computation would take on a single processor running one operation per step. The depth $D$, also called the span, is the number of operations on the longest directed path through the graph, the critical path. Depth is the time the computation would take on unlimited processors, because every operation on that longest chain must wait for its predecessor and nothing you add can shorten a chain of strict dependencies.

These two numbers are independent in a precise sense. Two algorithms that compute the same answer with the same work can have wildly different depths, and the one with smaller depth is the one that scales. Summing $N$ numbers is the cleanest illustration. Done as a left-to-right loop, each partial sum depends on the one before it, so the work is $N-1$ additions and the depth is also $N-1$: the entire computation is one long chain, and no number of helpers speeds it up. Done as a balanced tree, you add disjoint pairs in parallel, then pairs of pair-sums, and so on; the work is still $N-1$ additions, but the depth collapses to $\lceil \log_2 N \rceil$, the number of halving rounds, which is what Figure 3.6.1 draws for $N = 8$.

Key Insight: Same Work, Different Depth, Different Destiny

Work tells you the total bill; depth tells you the soonest you could possibly be done. A sequential sum and a tree sum perform the identical $N-1$ additions, so they cost the same work, yet the loop has depth $N-1$ and the tree has depth $\log_2 N$. When you go looking for parallelism in an AI workload, you are not hunting for less work, you are hunting for a way to arrange the same work into a shorter critical path. Reorganizing a chain into a tree is the single most common move, and it is why reductions and prefix scans, the low-depth primitives, sit underneath almost every scalable algorithm in this book.

2. Available Parallelism and Brent's Theorem Intermediate

The ratio of the two numbers has a name and a meaning. The available parallelism of a computation is

$$\Pi = \frac{W}{D},$$

the average number of operations that are ready to run at each step along the critical path. It is the largest processor count that the algorithm can keep usefully busy: beyond $W/D$ machines, some must idle because there are not enough independent operations to feed them. A sequential sum has $\Pi = (N-1)/(N-1) = 1$, no parallelism at all, while a tree sum has $\Pi = (N-1)/\log_2 N$, which for a million elements is roughly fifty thousand. That ratio, not the raw size of the problem, is what tells you how many workers are worth deploying.

Knowing $W$ and $D$ lets you bound the running time on any finite number of processors, which is what Brent's theorem delivers. If a computation has work $W$ and depth $D$, then a greedy schedule on $K$ processors finishes in time

$$T(K) \;\le\; \frac{W}{K} + D.$$

The two terms are exactly the two regimes you care about. The $W/K$ term is the work spread perfectly across $K$ machines, the speedup you hope for; the $D$ term is the irreducible chain you pay no matter how many machines you have. When $K$ is small relative to $\Pi = W/D$, the first term dominates and you get near-linear speedup. When $K$ grows past $\Pi$, the $D$ term takes over and adding machines stops helping, which is Amdahl's law of Section 3.5 re-derived from structure rather than from a hand-set serial fraction. The bound also reads as a lower limit, $T(K) \ge \max(W/K, D)$, so depth is a floor you can never break.

Fun Note: Nine Women, One Month

The old project-management quip that nine women cannot make a baby in one month is a statement about depth. Gestation is a dependency chain roughly nine months long: month seven cannot begin until month six finishes. Its work might be enormous, but its depth is fixed, so $W/K$ shrinks toward zero while the nine-month $D$ term sits there unmoved. Brent's bound says the same thing with less biology: when the depth term dominates, more workers buy you nothing.

3. Why Depth Is the Ultimate Limit Intermediate

Work can always be diluted by adding machines; depth cannot. This asymmetry is the whole reason depth, not work, is the quantity that decides scalability. Take the speedup of a computation on $K$ processors against one processor, $S(K) = T(1)/T(K)$. Since $T(1) = W$ and $T(K) \ge D$, the speedup obeys

$$S(K) \;\le\; \frac{W}{D} \;=\; \Pi,$$

no matter how large $K$ becomes. The available parallelism is a hard ceiling on speedup that depends only on the algorithm's structure, set entirely by its depth once the work is fixed. You can pour processors at a low-depth computation and ride the $W/K$ term for a long time; you can pour the same processors at a high-depth computation and stall almost immediately, because its critical path refuses to shrink. Designing for scale therefore means designing for low depth: every time you turn an $O(N)$-depth chain into an $O(\log N)$-depth tree, you raise the ceiling on how many machines can ever help you.

The demo below makes the contrast concrete. It computes the work, depth, and parallelism $W/D$ for summing $N$ numbers two ways, as a sequential loop and as a balanced tree, and then prints the Brent bound $W/K + D$ for several processor counts so you can watch the depth term take over as $K$ grows.

import math

def parallel_sum_cost(values):
    """Return (work, depth) for summing a list two ways.

    Sequential loop: W = N-1 additions, all on the critical path, so D = N-1.
    Balanced tree:    W = N-1 additions still, but the critical path is the
                      height of the tree, D = ceil(log2 N).
    """
    n = len(values)

    # --- Sequential loop: a chain of N-1 dependent additions ---
    acc = 0
    for v in values:
        acc = acc + v        # each add depends on the previous one
    seq_work  = n - 1        # number of real combine operations
    seq_depth = n - 1        # every add sits on the one long chain

    # --- Balanced reduction tree: combine neighbours in parallel rounds ---
    level = list(values)
    tree_work, tree_depth = 0, 0
    while len(level) > 1:
        nxt = []
        for i in range(0, len(level) - 1, 2):
            nxt.append(level[i] + level[i + 1])   # these adds are independent
            tree_work += 1
        if len(level) % 2 == 1:
            nxt.append(level[-1])                  # carry the odd one up
        level = nxt
        tree_depth += 1                            # one parallel round = +1 depth

    assert acc == level[0]                         # both compute the same sum
    return seq_work, seq_depth, tree_work, tree_depth, acc

for N in [16, 1024, 1_000_000]:
    vals = list(range(1, N + 1))
    sw, sd, tw, td, total = parallel_sum_cost(vals)
    print(f"N = {N:>9}  sum = {total}")
    print(f"  sequential loop : work W = {sw:>9}   depth D = {sd:>9}   parallelism W/D = {sw/sd:>12.4f}")
    print(f"  balanced tree   : work W = {tw:>9}   depth D = {td:>9}   parallelism W/D = {tw/td:>12.4f}")
    print(f"  log2(N) = {math.log2(N):.4f}   (tree depth tracks ceil(log2 N))")
    for K in [1, 8, 64]:                            # Brent bound T(K) <= W/K + D, tree version
        print(f"    Brent bound on K={K:>3}: T <= W/K + D = {tw/K:.1f} + {td} = {tw/K + td:.1f} time units")
    print()
Code 3.6.1: The work-depth model on a parallel sum. The same $N-1$ additions are arranged first as a dependent chain and then as a balanced tree; the function returns both work and depth for each, and the loop prints the Brent bound across processor counts.
N =        16  sum = 136
  sequential loop : work W =        15   depth D =        15   parallelism W/D =       1.0000
  balanced tree   : work W =        15   depth D =         4   parallelism W/D =       3.7500
  log2(N) = 4.0000   (tree depth tracks ceil(log2 N))
    Brent bound on K=  1: T <= W/K + D = 15.0 + 4 = 19.0 time units
    Brent bound on K=  8: T <= W/K + D = 1.9 + 4 = 5.9 time units
    Brent bound on K= 64: T <= W/K + D = 0.2 + 4 = 4.2 time units

N =      1024  sum = 524800
  sequential loop : work W =      1023   depth D =      1023   parallelism W/D =       1.0000
  balanced tree   : work W =      1023   depth D =        10   parallelism W/D =     102.3000
  log2(N) = 10.0000   (tree depth tracks ceil(log2 N))
    Brent bound on K=  1: T <= W/K + D = 1023.0 + 10 = 1033.0 time units
    Brent bound on K=  8: T <= W/K + D = 127.9 + 10 = 137.9 time units
    Brent bound on K= 64: T <= W/K + D = 16.0 + 10 = 26.0 time units

N =   1000000  sum = 500000500000
  sequential loop : work W =    999999   depth D =    999999   parallelism W/D =       1.0000
  balanced tree   : work W =    999999   depth D =        20   parallelism W/D =   49999.9500
  log2(N) = 19.9316   (tree depth tracks ceil(log2 N))
    Brent bound on K=  1: T <= W/K + D = 999999.0 + 20 = 1000019.0 time units
    Brent bound on K=  8: T <= W/K + D = 124999.9 + 20 = 125019.9 time units
    Brent bound on K= 64: T <= W/K + D = 15625.0 + 20 = 15645.0 time units
Output 3.6.1: Real output. The loop's parallelism is pinned at 1.0 for every $N$, so no processor count helps it; the tree's parallelism reaches nearly 50,000 at a million elements because its depth is only 20. In the tree's Brent bounds the $W/K$ term shrinks with $K$ while the depth term ($+20$ at the largest $N$) stays put, which is the depth floor made visible.

Read the million-element block from the bottom up. The sequential sum has depth 999,999, so its parallelism is exactly 1 and Brent's bound never improves past the work itself: a thousand machines would leave 999 of them idle. The tree has depth 20, so up to about fifty thousand machines can stay busy, and only past that point does the modest $+20$ floor begin to matter. The structural change from chain to tree, at no extra work, is what moved the ceiling from "one machine is enough" to "tens of thousands of machines can help."

4. The AI Connection: Low-Depth Collectives Advanced

This is not an abstraction that stays in the textbook; it is the reason distributed training scales at all. A data-parallel training step, built in Section 3.5 and proved exact back in Section 1.1, ends every iteration by summing one gradient vector per worker and sharing the result, the collective called all-reduce. If that sum were done as a sequential chain, each of $K$ workers adding to a running total in turn, its depth would be $O(K)$ and synchronization would slow down linearly as you added workers, defeating the entire point of scaling out. Done as a tree, exactly the shape of Figure 3.6.1 but with workers at the leaves, the reduction has depth $O(\log K)$. Doubling the worker count adds one level, not one worker's worth of latency.

That logarithmic depth is why the synchronization cost of data parallelism grows so slowly that thousands of workers remain practical. Reductions (combine many values into one) and scans (compute all running prefixes) are the canonical low-depth primitives, both achievable in $O(\log K)$ depth, and they are the building blocks the communication library reaches for. Chapter 4 develops the all-reduce collective and its tree, ring, and hierarchical variants in full, each a different way of keeping the depth low while also respecting bandwidth; Chapter 15 wires that collective into the training loop, and Chapter 16 splits it into the reduce-scatter and all-gather pair that sharded training depends on. Every one of those methods is, at heart, an effort to keep the depth of the combine step logarithmic in the number of machines.

Library Shortcut: The Collective Library Already Picks a Low-Depth Tree

You never hand-build the reduction tree of Code 3.6.1 for a real cluster. When you call dist.all_reduce, the NCCL backend underneath inspects the interconnect topology and selects a tree, ring, or hybrid algorithm whose depth is logarithmic in the worker count, so the synchronization you saw collapse from $O(K)$ to $O(\log K)$ happens automatically:

# Run with: torchrun --nproc_per_node=8 thisfile.py
import torch, torch.distributed as dist

dist.init_process_group("nccl")            # NCCL chooses a low-depth all-reduce algorithm
grad = compute_local_gradient()            # this worker's gradient, a tensor

dist.all_reduce(grad, op=dist.ReduceOp.SUM)   # O(log K) depth tree/ring under the hood
grad /= dist.get_world_size()                 # finish the mean
Code 3.6.2: A dozen lines of manual tree construction become one all_reduce call. The library's choice of a logarithmic-depth algorithm is exactly the $O(K) \to O(\log K)$ improvement the work-depth model predicts, applied to the gradient sum rather than a list of integers.
Practical Example: The All-Reduce That Stopped Scaling

Who: A systems engineer on a platform team running large-batch image-model training across many GPU nodes.

Situation: A custom gradient-aggregation routine worked fine on 8 GPUs but each doubling of the cluster made the per-step synchronization noticeably slower, not faster.

Problem: Profiling showed the aggregation time growing linearly with the worker count, swamping the compute the extra workers were supposed to accelerate.

Dilemma: Keep the simple aggregation and accept that the cluster could not grow past a few dozen nodes, or rework the combine step whose depth was the actual bottleneck.

Decision: They reworked the combine, because the work-depth reading was unambiguous: the routine summed gradients into a single accumulator in worker order, a chain of depth $O(K)$.

How: They replaced the hand-rolled accumulation with the framework's tree-based all_reduce, turning the depth-$O(K)$ chain into a depth-$O(\log K)$ reduction with no change to the math.

Result: Per-step synchronization time grew only logarithmically with the cluster, so scaling from 8 to 256 GPUs added roughly five levels of latency instead of a 32-fold increase, and large-cluster training became practical.

Lesson: When more machines make a step slower, suspect the depth of the combine, not the work. A chain that should have been a tree is the most common cause, and the fix changes the asymptotics, not just the constant.

Research Frontier: Lowering the Depth of Synchronization (2024 to 2026)

Because depth is the floor on how much parallelism helps, current systems research keeps pushing the depth of the combine step down or hiding it entirely. Topology-aware and hierarchical collectives in recent NCCL releases compose intra-node and inter-node trees so the effective depth tracks the network's own hierarchy rather than a flat $O(\log K)$. The communication-avoiding line of work, including local-update schemes such as DiLoCo (Douillard et al., 2024), lowers depth a different way: by taking several local steps between synchronizations, it cuts how often the logarithmic-depth all-reduce runs at all, trading a little statistical efficiency for far fewer dependency-chained collectives. A parallel thread on overlapping communication with the backward pass aims to slide the depth of the reduction entirely behind computation, so the critical path of a training step is set by the compute, not the combine. We return to these with the cost machinery to evaluate them in Section 3.8 and Chapter 10; the common thread is that the field treats the depth of synchronization as a quantity to be engineered down, exactly as the work-depth model says it must be.

Exercise 3.6.1: Read the Depth Off the Algorithm Conceptual

For each computation on $N$ inputs, state the work $W$, the depth $D$, and the available parallelism $W/D$ in big-O terms, and say whether it scales well: (a) a prefix sum (every running total $s_k = x_1 + \cdots + x_k$) computed by the naive left-to-right loop; (b) the same prefix sum computed by a Blelloch-style scan; (c) the dot product of two length-$N$ vectors; (d) Newton's method run for a fixed number of iterations where each iteration depends on the last. For the case that scales worst, explain in one sentence why no processor count rescues it.

Exercise 3.6.2: Brent's Bound Against a Measured Curve Coding

Extend Code 3.6.1 so that, for the balanced tree on $N = 10^6$, it sweeps $K$ over $\{1, 2, 4, 8, \ldots, 2^{20}\}$ and prints both the Brent upper bound $W/K + D$ and the ideal $W/K$. Plot or tabulate the ratio of the two and identify the $K$ at which the depth term $D$ first contributes more than half the bound. Confirm that this crossover sits near the available parallelism $W/D$, and explain why that is the processor count past which adding machines stops paying off.

Exercise 3.6.3: From Depth to a Worker-Count Decision Analysis

A data-parallel training step does $W$ units of gradient work and ends with a tree all-reduce of depth proportional to $\log_2 K$ over $K$ workers, where one all-reduce level costs $c$ time units. Write the per-step time as the compute term $W/K$ plus the synchronization term $c \log_2 K$, then find the $K$ that minimizes it by differentiating. Show that the optimal worker count grows like $W/c$ up to a logarithmic factor, and interpret what a larger per-level cost $c$ (a slower interconnect) does to the number of workers worth deploying. Relate your answer to the speedup ceiling $W/D$ from Section 3.

Work and depth give you the structural reading of scalability that Amdahl's law only summarized: depth is the serial fraction made concrete, and lowering it is what every scalable algorithm in this book is really doing. The next model trades this view of dependency structure for a view of the hardware itself, asking when a computation is limited by arithmetic throughput and when it is limited by the rate of moving data to and from memory. That is the roofline model, and it begins in Section 3.7.