"They gave me sixteen nodes and asked why it was slower than eight. I pointed at the one stage that ate the afternoon, and the room went quiet."
A Profiler, Pointing at the One Stage That Ate the Afternoon
A capstone is not finished when the distributed system runs; it is finished when you can explain, with numbers, why it runs the way it does and whether the machines it consumes are worth what they buy. Throwing more nodes at a problem produces a speedup curve that bends, flattens, and eventually turns back down, and the dollar cost climbs the whole time. The work of this section is to turn raw per-stage timings into an argument: measure where the wall-clock goes, attribute the loss to a named cause (communication, load imbalance, the serial fraction Amdahl's law forbids you to parallelize), and then locate the operating point where one more node stops paying for itself. The deliverable is not a single big number but a defensible choice of scale, reported in full, including the scales at which scaling broke.
Section 41.6 told you which metrics to collect: wall-clock per run, throughput, speedup, parallel efficiency, and dollar cost. This section is about reading those metrics as evidence. A distributed AI system that achieves a 3.9-fold speedup on eight nodes has not earned the number until you can say where the missing 4.1-fold went, and a system that costs two dollars per run has not earned a deployment decision until you can say what a one-dollar run would have cost you in latency. Performance analysis is the discipline of attributing the gap between the speedup you hoped for and the speedup you measured, and cost analysis is the discipline of pricing that gap. Both rest on the performance models of Chapter 3 and the communication-cost accounting of Chapter 4; here we put them to work on a system you built.
1. Where the Time Actually Goes Beginner
The first question of any performance analysis is not "how fast is it?" but "what is it doing while the clock runs?" A single-node program spends its wall-clock in compute and in I/O, and a profiler that attributes time to functions is enough to find the slow part. A distributed program adds two categories that do not exist on one machine, and they are usually where the surprises hide. The wall-clock of a run on $p$ nodes decomposes, to a first approximation, into time spent computing, time spent moving data between machines, and time spent waiting,
$$T_p = T_{\text{comp}} + T_{\text{comm}} + T_{\text{idle}},$$where $T_{\text{idle}}$ collects every form of doing nothing useful: a fast worker blocked at a barrier for a slow one (the straggler tax of Chapter 3), a node stalled on a disk or object-store read (the I/O of Chapter 8), or a pipeline bubble while one stage waits for another. The decomposition matters because each term has a different remedy. Compute-bound time wants a faster kernel or a bigger batch; communication-bound time wants a better collective schedule or gradient compression; idle time wants load balancing or overlap. Attributing your wall-clock to the wrong term sends you optimizing the part that was never the bottleneck.
Figure 41.7.1 shows why the average is a misleading summary. Every worker finishes its own compute at a different moment, but the synchronous all-reduce of Chapter 15 forces them to a common barrier, so the step takes as long as the slowest worker plus the collective. The grey idle bands are pure waste: capacity you rented and did not use. A profiler is the instrument that exposes those bands. PyTorch's profiler, Nsight Systems, and Ray's timeline view all record per-stage spans you can sum into the three buckets of the decomposition; the library shortcut below shows the smallest version that already separates compute from communication.
A speedup that falls short of linear is a quantity of lost time, and that time has a single dominant owner: communication, idle waiting, or an unparallelizable serial fraction. The entire value of a performance analysis is naming that owner before you touch the code. Optimizing communication on an idle-bound system, or balancing shards on a communication-bound one, spends your effort where the time was not. Measure the breakdown first; the breakdown chooses the fix.
You do not instrument the three buckets by hand. The PyTorch profiler records every CUDA and NCCL kernel as a span and tags collective operations, so a few lines separate compute from communication for a real training step and export a trace you can open in a timeline viewer:
import torch
from torch.profiler import profile, ProfilerActivity, schedule
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
schedule=schedule(wait=1, warmup=1, active=3), # skip startup, profile 3 steps
record_shapes=True,
) as prof:
for step in range(5):
train_one_step() # forward + backward + DDP all-reduce
prof.step()
# Communication shows up as nccl:all_reduce / nccl:all_gather rows; compute as
# the matmul/attention kernels. Sort by total CUDA time to find the dominant bucket.
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=12))
prof.export_chrome_trace("trace.json") # open in chrome://tracing or Perfetto
2. Reading the Scaling Curve and Finding the Knee Intermediate
Once the per-node breakdown is in hand, the across-node view is the scaling curve: speedup $S_p = T_1 / T_p$ plotted against node count $p$, and its companion, parallel efficiency $E_p = S_p / p$. Ideal scaling is the line $S_p = p$, which no real system reaches. What you actually measure is a curve that tracks the ideal line at small $p$, peels away from it as communication and idle time grow, flattens at a maximum, and on many systems bends back down once added nodes cost more in coordination than they contribute in compute. The shape is not noise; it is the signature of the three terms in the decomposition fighting the one term that shrinks. Compute time falls roughly as $T_{\text{comp}}(p) \approx T_{\text{comp}}(1)/p$, but communication and idle time grow with $p$, so the total has a minimum.
The most decision-relevant feature of the curve is its knee: the node count past which efficiency drops sharply, where each new node buys far less than the one before it. Amdahl's law, from Chapter 3, gives the ceiling that creates the knee. If a fraction $f$ of the work is irreducibly serial, the best speedup any number of nodes can reach is
$$S_p = \frac{1}{f + \dfrac{1 - f}{p}}, \qquad \lim_{p \to \infty} S_p = \frac{1}{f},$$so a system with even five percent serial work cannot exceed a twenty-fold speedup no matter how many nodes you rent. The knee is where you collide with that asymptote, or, on a synchronous system, where communication and straggler idle time overtake the shrinking compute. Attributing the knee is the heart of the analysis: read it off the curve, then look back at the per-node breakdown of Figure 41.7.1 to say whether it is Amdahl serial fraction, a communication wall, or load imbalance that bent the line. Honest reporting names the cause, not just the node count.
Who: A graduate student presenting a distributed image-segmentation capstone.
Situation: The training pipeline scaled cleanly from one to eight nodes, and the student assumed sixteen would be faster still and booked the larger allocation for the final run.
Problem: At sixteen nodes the wall-clock per epoch went up, not down, and the efficiency reading fell below a quarter.
Dilemma: Report the embarrassing sixteen-node number and look like the scaling failed, or quietly present only the clean eight-node result and hope no reviewer asked about larger scales.
Decision: They reported the full curve, knee included, and used the per-node breakdown to attribute the regression.
How: The profiler trace showed the all-reduce time per step had grown to exceed the now-tiny per-node compute, and the smallest shards finished so fast that idle-at-barrier time dominated; doubling nodes had doubled communication while halving an already-small compute term.
Result: The analysis turned a failed run into the strongest slide in the talk: a measured demonstration of the communication wall, with eight nodes identified as the efficient operating point and a clear, quantified reason not to go further.
Lesson: The scale at which scaling breaks is a result, not a failure to hide. A curve that bends down, explained, is more convincing than a curve that stops conveniently at its best point.
3. Pricing the Speedup: Cost and Diminishing Returns Intermediate
Speedup is bought with money, and the conversion is direct. If a run on $p$ nodes takes wall-clock $T_p$ at a price of $c$ dollars per node-hour, the dollar cost of that run is
$$\text{Cost}(p) = c \cdot p \cdot \frac{T_p}{3600}.$$Because $T_p$ falls more slowly than $1/p$ once you pass the knee, the product $p \cdot T_p$, and therefore the cost, rises with scale even while the run gets faster. You are paying a strictly increasing amount for a speedup that grows sublinearly and eventually not at all. The decision-grade quantity is therefore not speedup alone but marginal speedup per dollar: how much additional speedup the next increment of nodes buys for its additional cost,
$$\frac{\Delta S}{\Delta \text{Cost}} = \frac{S_{p_2} - S_{p_1}}{\text{Cost}(p_2) - \text{Cost}(p_1)}.$$This ratio collapses toward zero, and can go negative, exactly at the knee. A capstone that reports it has converted raw timings into an economic argument: there is a scale below which more nodes are an obvious bargain, a scale above which they are an obvious waste, and a narrow band in between where the right answer depends on whether you are buying speed or buying frugality. The cost-optimal node count $p^{\star}$ minimizes $\text{Cost}(p)$; the latency-optimal count maximizes $S_p$; and these are rarely the same number, which is the whole tension a deployment decision must resolve.
$$p^{\star} = \arg\min_{p} \; c \cdot p \cdot \frac{T_p}{3600}.$$The runnable demo below carries out exactly this analysis end to end. It takes a measured time breakdown per scale and a node price, computes the wall-clock, speedup, efficiency, and dollar cost at each scale, then reports the marginal speedup per dollar across the ladder and names the cost-optimal node count.
import numpy as np
# A measured time breakdown per scale: for each node count p, the wall-clock
# seconds spent in compute, communication, and idle/straggler waiting for ONE run.
# These come from a profiler (Ch 3/4 cost model in practice). T_p = comp + comm + idle.
breakdown = {
1: {"comp": 800.0, "comm": 0.0, "idle": 0.0},
2: {"comp": 400.0, "comm": 18.0, "idle": 12.0},
4: {"comp": 200.0, "comm": 34.0, "idle": 26.0},
8: {"comp": 100.0, "comm": 58.0, "idle": 48.0},
16: {"comp": 50.0, "comm": 95.0, "idle": 90.0},
}
price_per_node_hour = 3.20 # on-demand dollars per node-hour
T1 = sum(breakdown[1].values()) # single-node wall-clock, the speedup baseline
rows = []
for p, b in breakdown.items():
Tp = b["comp"] + b["comm"] + b["idle"]
speedup = T1 / Tp
efficiency = speedup / p
cost = price_per_node_hour * p * (Tp / 3600.0) # $ = price * nodes * hours
rows.append({"p": p, "Tp": Tp, "speedup": speedup, "eff": efficiency, "cost": cost})
print(f"baseline T1 = {T1:.1f} s price = ${price_per_node_hour:.2f}/node-hour\n")
hdr = f"{'p':>3} {'T_p(s)':>8} {'speedup':>8} {'eff':>6} {'cost($)':>8} {'comm%':>6} {'idle%':>6}"
print(hdr)
print("-" * len(hdr))
for r, (p, b) in zip(rows, breakdown.items()):
commp = 100.0 * b["comm"] / r["Tp"]
idlep = 100.0 * b["idle"] / r["Tp"]
print(f"{r['p']:>3} {r['Tp']:>8.1f} {r['speedup']:>8.2f} {r['eff']:>6.2f} "
f"{r['cost']:>8.3f} {commp:>6.1f} {idlep:>6.1f}")
# Marginal speedup-per-dollar: how much extra speedup each extra dollar buys as we
# step up the scale ladder. Diminishing returns show up as this number collapsing.
print("\nmarginal analysis (each step up the ladder):")
print(f"{'p_lo->p_hi':>12} {'d_speedup':>10} {'d_cost($)':>10} {'speedup/$':>10}")
for lo, hi in zip(rows[:-1], rows[1:]):
d_speed = hi["speedup"] - lo["speedup"]
d_cost = hi["cost"] - lo["cost"]
ratio = d_speed / d_cost if d_cost > 0 else float("inf")
print(f"{str(lo['p'])+'->'+str(hi['p']):>12} {d_speed:>10.2f} "
f"{d_cost:>10.3f} {ratio:>10.2f}")
# Cost-optimal node count: the scale with the lowest dollar cost per run.
best = min(rows, key=lambda r: r["cost"])
fastest = max(rows, key=lambda r: r["speedup"])
print(f"\ncost-optimal scale : p = {best['p']} (${best['cost']:.3f}/run, "
f"speedup {best['speedup']:.2f}x)")
print(f"fastest scale : p = {fastest['p']} (${fastest['cost']:.3f}/run, "
f"speedup {fastest['speedup']:.2f}x)")
print(f"paying for speed : {fastest['cost']/best['cost']:.2f}x the cost "
f"to go {fastest['speedup']/best['speedup']:.2f}x faster")
baseline T1 = 800.0 s price = $3.20/node-hour
p T_p(s) speedup eff cost($) comm% idle%
---------------------------------------------------
1 800.0 1.00 1.00 0.711 0.0 0.0
2 430.0 1.86 0.93 0.764 4.2 2.8
4 260.0 3.08 0.77 0.924 13.1 10.0
8 206.0 3.88 0.49 1.465 28.2 23.3
16 235.0 3.40 0.21 3.342 40.4 38.3
marginal analysis (each step up the ladder):
p_lo->p_hi d_speedup d_cost($) speedup/$
1->2 0.86 0.053 16.13
2->4 1.22 0.160 7.60
4->8 0.81 0.540 1.49
8->16 -0.48 1.877 -0.26
cost-optimal scale : p = 1 ($0.711/run, speedup 1.00x)
fastest scale : p = 8 ($1.465/run, speedup 3.88x)
paying for speed : 2.06x the cost to go 3.88x faster
Because cost rises with $p$ while speedup saturates, the node count that minimizes dollars per run sits well to the left of the node count that minimizes wall-clock. Reporting only the fastest configuration hides this entirely. A complete capstone analysis states both endpoints and the exchange rate between them: in Output 41.7.2, twice the money buys you under four times the speed, and beyond eight nodes it buys negative speed. That exchange rate, not a single speedup figure, is the deliverable.
4. Total Cost of Ownership: Cloud, Local, and Spot Advanced
The per-run cost of Code 41.7.2 is the right unit for a single job, but a deployed system runs many jobs over months, and the deployment decision turns on total cost of ownership rather than the price of one run. The cluster-economics treatment of Chapter 33 draws the three regimes a capstone should weigh. On-demand cloud charges the full $c$ per node-hour with zero commitment, which is correct for bursty or one-off work and is exactly what Code 41.7.2 prices. Owned local hardware front-loads a capital cost and then runs near the marginal cost of electricity, so it wins only above a utilization threshold: a cluster busy enough that the amortized purchase price per node-hour drops below the cloud rate. Spot and preemptible instances cut the hourly price by often two-thirds to four-fifths in exchange for the risk of preemption, which is acceptable precisely when the elastic and checkpointing machinery of Chapter 18 lets a job survive losing a node mid-run.
A capstone cost analysis is honest only when it states which regime its numbers assume and re-prices the cost-optimal scale under the alternatives. The same eight-node run that costs one dollar forty-seven on-demand might cost forty cents on spot, shifting $p^{\star}$ outward because cheaper nodes change where the marginal-dollar argument crosses zero, or it might cost effectively nothing on an already-owned cluster you are sharing, in which case wall-clock alone decides. The discipline is to make the assumption explicit, because a speedup curve priced under the wrong regime recommends the wrong scale. The breakdown of Figure 41.7.1 is invariant across regimes; the dollar figures are not.
The whole book leads with scale-out over scale-up, and this section is where that thesis is finally tested with your own measurements. The capstone asked you to pick and defend a distribution axis; cost and performance analysis is how you defend it. A scaling curve with its knee attributed to communication, a marginal-speedup-per-dollar table that locates the efficient operating point, and a total-cost-of-ownership statement under the regime you will actually deploy in, together turn "we distributed the work" into "we distributed the work, and here is the operating point at which that choice pays, and here is where it stops paying." That argument, not the raw speedup, is what makes the capstone a defense rather than a demonstration.
5. An Analysis Template: Measure, Attribute, Decide Beginner
The preceding sections compose into a fixed procedure you can apply to any distributed AI system, and reusing the same three steps keeps an analysis honest and comparable. Table 41.7.1 lays out the template: a measure phase that produces the per-node breakdown and the scaling curve, an attribute phase that names the cause of every gap from ideal, and a decide phase that converts the attributed curve into a recommended operating point with its cost stated under a named regime. The output of this template is what Section 41.9 folds into the capstone report.
| Phase | What you do | Artifact it produces |
|---|---|---|
| 1. Measure | Profile per stage at several scales; record compute, communication, idle, and I/O time per node, plus wall-clock per run. | Per-node breakdown (Figure 41.7.1) and the $S_p$, $E_p$ scaling curve. |
| 2. Attribute | Find the knee; decompose the gap from ideal into Amdahl serial fraction, communication wall, and load imbalance using the breakdown. | A named cause for every lost fold of speedup, and the scale at which scaling broke. |
| 3. Decide | Price each scale; compute marginal speedup per dollar; pick $p^{\star}$; re-price under cloud, spot, and local regimes. | A recommended operating point with its cost and its exchange rate between speed and dollars. |
The template's discipline is that the decide phase may never run ahead of the attribute phase. A recommended node count without a named reason for the curve's shape is a guess dressed as a result, and a reviewer will ask the question the analysis skipped. Run the three phases in order, report the breakdown that motivated each conclusion, and the recommendation defends itself. Section 41.6 supplied the metrics that feed phase one; Section 41.9 consumes the phase-three artifact into the written report.
Measuring a scaling curve means renting every scale on the curve, which is itself expensive. A active research line learns to predict the curve from a few cheap measurements. Performance models in the lineage of Calculon and the analytical cost models behind Megatron-style 3D-parallelism planners (from the trade-off accounting of Chapter 16) estimate compute and communication time per configuration from hardware specs and model shape, so the knee can be located on paper before any large run. In parallel, the carbon-and-cost accounting community has pushed beyond dollars to joules and grams of CO2 per run, with tools such as CodeCarbon and the energy-aware schedulers studied alongside the cluster work of Chapter 33, so a 2026 capstone is increasingly expected to report an environmental cost beside the financial one. The frontier is an analysis that recommends the operating point without first buying every point on the curve.
The grey idle bands in Figure 41.7.1 are the most expensive empty space in computing: you rented the node, the node is on, the node is doing nothing, and the meter is running. An engineer once described a poorly balanced cluster as "a very fast way to pay for waiting," which is the entire content of this section compressed into six words.
Three capstone systems all show a knee at eight nodes. System A's per-node breakdown shows communication time growing linearly with $p$ while idle stays near zero; System B shows idle time dominating because one shard is twice the size of the others; System C shows a fixed serial preprocessing stage that never shrinks. For each, name which mechanism from Section 2 (communication wall, load imbalance, Amdahl serial fraction) created the knee, and state the single change that would move the knee rightward. Explain why applying System A's fix to System C would accomplish nothing.
Extend Code 41.7.2 to compute the cost of each scale under three regimes: on-demand at the given price, spot at thirty percent of that price, and an owned local cluster whose amortized price is sixty percent of the on-demand rate but only when utilization exceeds a threshold you fix. Report the cost-optimal node count $p^{\star}$ under each regime. Confirm with your numbers that cheaper per-node pricing can move $p^{\star}$ outward, and explain in one sentence why, referring to where the marginal-dollar argument crosses zero.
Using Output 41.7.2, you must recommend a single node count for two different deployments: a nightly batch job with a loose deadline where dollars dominate, and an interactive service with a hard latency budget where speed dominates. State the node count you would recommend for each, justify it from the marginal-speedup-per-dollar column and the speedup column respectively, and quantify exactly what each deployment gives up by not choosing the other's operating point. Then state, in one sentence, what additional measurement would change your recommendation for the latency-bound case.