Section 41.9: Final Report | Building Scalable AI

"They ran me on sixteen machines and got an answer. The report is where I get to say whether that answer was actually faster, actually correct, and actually worth the bill."
A Result, Learning to Tell Its Own Story

Big Picture

A capstone is not finished when the cluster prints a number; it is finished when a skeptical reader can follow your report from a stated problem to a measured scale-out result and find every claim backed by a co-computed number they could reproduce. The report is the artifact that survives the project. It must state which ceiling forced distribution, which axis you chose, what the single-machine baseline did, how the distributed implementation worked, and then deliver the one claim everything else serves: a speedup at a known efficiency with quality held constant and cost accounted for. This section gives you the report's structure, its evidence standard, the three figures it cannot omit, an honest treatment of where scaling broke, and a template you can fill in line by line. The discipline here is the same one Chapter 5 set for the whole book: one configuration, metrics co-computed in one pass, no unbacked speedup.

By this point in the chapter you have a problem worth distributing (Section 41.1 through 41.3), a chosen axis and design, a working distributed implementation, and a body of measurements produced under the metric definitions of Section 41.6 and analyzed in Section 41.7. You have also assembled the reproducibility package of Section 41.8: the pinned environment, the launch scripts, the seeds, and the raw result files. What remains is to write the document that turns that work into a claim a reader can trust. A report is credible not because it reports good numbers but because it reports backed numbers, and because it is honest about the configurations where the numbers stopped being good. This section is about writing that document.

Figure 41.9.1: The report skeleton. Blocks 1 through 5 flow top to bottom; the evaluation block (4) feeds the single headline claim that the whole document exists to support, and block 6 makes that claim reproducible and situates it against the book's chapters and case studies. Every other sentence in the report is in service of the orange box.

1. The Report's Spine: Problem, Axis, Baseline, Result Beginner

A technical report on a distributed system has a fixed spine, and following it is not a stylistic preference; it is what lets a reader audit your claim. The spine has six load-bearing parts, shown in Figure 41.9.1. It opens with the problem and the justification for distributing it at all: name the resource ceiling that bound (data too large, model too large, throughput too low), exactly as the diagnosis discipline of Section 1.1 teaches, because a reader who is not convinced distribution was necessary will discount everything that follows. Then state the distribution axis you chose and the design that realized it: which of the six axes from Section 1.2, which collective from Chapter 4, which partition of the work.

Next comes the baseline, and it deserves a paragraph of its own because reports most often fail here. A speedup is a ratio, and a ratio with no denominator is not a measurement. The baseline is the single-machine (or smallest-configuration) reference that the distributed runs are compared against, measured on the same task, the same data, and the same quality target. Only after the baseline is pinned can you present the distributed implementation and then the evaluation: the headline result, its supporting table, and its three figures. The report closes with analysis of where scaling broke and a reproducibility section that cites the package from Section 41.8. Keep the spine in this order; a reader should never have to hunt backward for the denominator of a number you already quoted.

Key Insight: The Report Exists to Support One Backed Claim

Everything in a capstone report converges on a single sentence: this axis of distribution achieved a speedup of $S(p)$ at efficiency $E(p)$ with quality held constant, at a known cost. The problem statement justifies attempting it, the baseline gives the speedup a denominator, the implementation explains how, the evaluation measures it, and the limitations bound it. If a paragraph does not help a reader trust or reproduce that one sentence, it does not belong in the report. Strong reports are not longer than weak ones; they are more disciplined about what serves the claim.

2. The Evidence Standard: Every Number Backed, Co-Computed on One Config Intermediate

The evidence standard is the rule that separates a credible report from a hopeful one, and it is inherited directly from Section 5.5: every comparative number you report must be co-computed in one pass on one configuration. Concretely, the speedup, the efficiency, the quality delta, and the cost that appear in your headline must come from the same panel of runs, the same model, the same data split, and the same random seed. The trap the standard guards against is the invalid comparison: quoting a wall-clock from one run, a quality number from a different run that used a larger batch, and a cost estimate from a third run on a cheaper instance, then assembling them into a "result" that no single experiment ever produced.

Define the quantities precisely so the report can state them without ambiguity. For $p$ workers with single-worker baseline time $T(1)$ and parallel time $T(p)$, the speedup and parallel efficiency are

$$S(p) = \frac{T(1)}{T(p)}, \qquad E(p) = \frac{S(p)}{p} = \frac{T(1)}{p\,T(p)}.$$

The quality-held-constant condition is a measured near-zero delta against a stated gate $\tau$, and the cost is the price ratio against the baseline run,

$$|\Delta Q| = \bigl|\, Q(p) - Q(1) \,\bigr| \le \tau, \qquad R_{\text{cost}} = \frac{C(p)}{C(1)} = \frac{p\,T(p)\,c}{T(1)\,c} = \frac{p\,T(p)}{T(1)} = \frac{1}{E(p)},$$

where $c$ is the per-worker hourly price (identical across runs) and the last identity makes a useful point explicit: when every worker costs the same per hour, the cost ratio is exactly the reciprocal of efficiency. A report that quotes $S(p)$ without also quoting $E(p)$ is hiding the bill. The headline claim is admissible only when $|\Delta Q| \le \tau$ holds for the same $p$ that produced $S(p)$; a speedup obtained at the price of a quality drop outside the gate is not a scale-out result, it is a different model. This is the rule the runnable demo in Section 6 enforces mechanically.

Thesis Thread: The Capstone Claim Is the Book's Claim, Measured

The whole book argues that scale-out, distributing the essential work across machines, can preserve correctness while breaking a single-machine ceiling. The capstone report is where you stop asserting that thesis and start backing it with your own numbers. The headline sentence, "$S(p)$ at $E(p)$ with quality held constant," is the thesis instantiated on your problem, your axis, your cluster. The exactness that Section 1.1 proved for the data-parallel gradient is what makes "quality held constant" a defensible claim rather than a hope: the math says the answer need not change, and your $\Delta Q$ near zero is the measurement that confirms it did not.

3. Figures That Communicate: Scaling Curve, Time Breakdown, Cost Curve Intermediate

Three figures carry the quantitative story, and a report that omits any of them leaves a question a reader will ask. The first is the scaling curve: $S(p)$ against $p$, plotted with the ideal line $S(p) = p$ for reference, so the gap between your curve and the diagonal shows efficiency loss at a glance. The second is the time breakdown: a stacked bar per configuration splitting wall-clock into computation, communication, and idle or straggler time, so a reader can see why efficiency fell as $p$ grew. The third is the cost curve: total dollars against $p$, which usually turns upward exactly where the scaling curve flattens, making the economic stopping point visible. Figure 41.9.2 sketches the three side by side.

Figure 41.9.2: The three figures every scale-out report must contain. Left: the scaling curve bends below the ideal diagonal, and the gap is the efficiency you lost. Center: the time breakdown shows communication and idle time growing with $p$, explaining the gap. Right: the cost curve turns upward where efficiency collapses, marking the economic stopping point. The three together answer "how much faster," "why not faster still," and "at what price."

These three are the report-grade descendants of the diagnostic plots in Section 5.4 and the analysis you produced in Section 41.7. Two rules keep them honest. First, every figure must be co-computed from the same panel as the headline table, never assembled from runs at different settings. Second, label the axes with units and the configuration in the caption (workers, instance type, seed), so a figure can be audited without reading the surrounding text. The case studies in this part model exactly this: see the evaluation figures in Section 38.7 and the reporting in Section 36.7, which present speedup, breakdown, and cost as a co-computed set rather than three unrelated charts.

4. Honest Limitations: Where It Did Not Scale, and Why Advanced

A report that claims everything scaled perfectly is not impressive; it is suspicious. Every real scale-out result has a configuration where efficiency falls below a useful threshold, where communication overtakes computation, where a straggler dominates, or where the quality gate finally breaks. Reporting that boundary is not an admission of failure; it is the most useful thing in the document, because it tells the reader how far your result generalizes. State the limitation as a measured fact tied to a number from your own table: "efficiency falls to $E(16) = 0.60$ and the quality delta at sixteen workers exceeds the gate, so the result is reported only through eight workers." That sentence is stronger than any unbounded claim, because it is exactly the kind of statement a reader can reproduce and trust.

Diagnose the cause with the cost models you already have. If efficiency falls while the time breakdown shows communication rising, the binding limit is the collective, and the $\alpha$-$\beta$ analysis of Chapter 3 predicts where it will dominate. If a single worker's bar is taller than the rest, you have a straggler, and the mitigation discussion belongs with Chapter 18. If the quality delta grows with $p$ because the effective batch size grew past what the optimizer tolerates, that is a statistical-efficiency limit, not a systems one, and naming it correctly protects the reader from drawing the wrong lesson. The honest report does not hide the configuration that broke; it explains which term in the cost model broke it.

Practical Example: The Sixteen-Worker Row That Made the Report Credible

Who: A graduate student presenting a capstone on data-parallel retraining of a ranking model.

Situation: The scaling panel ran from one to sixteen workers, and the sixteen-worker run was the fastest by wall-clock.

Problem: At sixteen workers the validation quality dropped past the stated gate, because the effective batch had grown too large for the fixed learning-rate schedule.

Dilemma: Headline the eye-catching $9.6\times$ wall-clock number at sixteen workers, or headline the eight-worker number where quality still held.

Decision: They headlined eight workers at $6.3\times$ with the quality delta inside the gate, and kept the sixteen-worker row in the table marked as failing the gate.

How: The report stated the rule up front ("speedup is reported only where $|\Delta Q| \le \tau$") and let the failing row stand as the measured boundary of the result.

Result: Reviewers trusted the eight-worker claim precisely because the report showed the configuration where it stopped working; the visible failure row was the credibility, not a blemish.

Lesson: The configuration where your result breaks is evidence of a careful measurement. Report it inside the same table, do not quietly truncate the panel.

5. Related Work: Positioning Against the Book and the Case Studies Beginner

A capstone is not invented in a vacuum, and a short related-work paragraph that situates it against the book earns trust by showing you know where your design sits. Position along three lines. First, the axis: name the chapter that owns the distribution axis you used, so a reader can see your design is a known technique applied, not a novelty asserted (data parallelism from Chapter 15, sharded parallelism from Chapter 16, distributed serving from Chapter 23, and so on). Second, the closest case study: each case study in this part is a full worked report, and naming the one nearest to your problem lets the reader calibrate scope and method.

Third, the difference: state in one sentence what your capstone does that the cited work does not, even if the difference is only "applied to a new dataset under a tighter quality gate." Modest, precise positioning is more credible than grand claims of originality. The federated medical case study in Section 37.7 and the recommendation case study in Section 38.7 are good models for how much related-work text a system report actually needs: a paragraph, not a survey. The reproducibility section that follows then cites your Section 41.8 package directly, so the reader can move from "this is where it sits in the literature" to "and here is how to run it" without leaving the document.

Library Shortcut: Let the Experiment Tracker Assemble the Table

You do not have to transcribe numbers from logs into a results table by hand, which is exactly how the one-config rule gets broken. Experiment trackers such as Weights & Biases, MLflow, or a plain pandas frame over your run records will join speedup, efficiency, quality, and cost on the run id, guaranteeing every cell of a row came from the same configuration:

import pandas as pd

# one row per run; columns logged by the SAME run, so a row is internally consistent
runs = pd.read_parquet("capstone_runs.parquet")     # workers, wall_s, ndcg, price_h, seed
base = runs.loc[runs.workers == 1].iloc[0]

runs["speedup"]    = base.wall_s / runs.wall_s
runs["efficiency"] = runs.speedup / runs.workers
runs["quality_dq"] = runs.ndcg - base.ndcg
runs["cost_usd"]   = runs.workers * runs.wall_s / 3600 * runs.price_h
table = runs[["workers", "speedup", "efficiency", "quality_dq", "cost_usd"]]

Code 41.9.1: The whole results table derived from one run record. Because every metric is computed from columns logged by the same run, a row cannot mix configurations; the join key is the run, which is what the evidence standard of Section 2 demands. A hand-built table offers no such guarantee.

6. A Report Template and a Backed-Headline Generator Intermediate

The checklist in Table 41.9.1 is the template: work down it, and the report has its spine, its evidence standard, its three figures, and its honesty clause. Treat the right column as a gate; a row you cannot satisfy is a hole a reader will find.

Table 41.9.1: The capstone final-report checklist. Each section of the report maps to a required ingredient and the evidence that makes it credible.

Report section	Must contain	Evidence that backs it
Problem & justification	The ceiling that bound; the axis chosen	A measured single-machine limit (time, memory, or throughput)
Baseline	Single-machine reference on the same task	$T(1)$, $Q(1)$, $C(1)$ from one run
Distributed implementation	Design, collective, partition	Launch script and config in the Section 41.8 package
Evaluation	Headline claim + results table	$S(p)$, $E(p)$, $\Delta Q$, cost co-computed on one config
Figures	Scaling curve, time breakdown, cost curve	All three from the same panel as the table
Limitations	The configuration where scaling broke	The failing row, with the cost-model term that broke it
Related work	Axis chapter + nearest case study + the difference	Hyperlinks to the cited book sections
Reproducibility	How to rerun and get the same numbers	Pinned env, seeds, raw results (Section 41.8)

The script in Code 41.9.2 turns a measured panel into the report's headline line and results table. It computes every number from one config, applies the quality gate, and selects as the headline the fastest configuration that holds quality, so an out-of-gate row can never become the claim. This is the evidence standard of Section 2 made executable: there is no path through the script by which an unbacked speedup reaches the headline.

config = {"task": "ranking-model nightly retrain", "axis": "data parallelism",
          "seed": 0, "quality_metric": "validation NDCG@10",
          "quality_gate": 1.0e-3, "price_per_worker_hour": 2.50}

# ONE measured panel: (workers, wall_clock_seconds, validation_NDCG@10), same model/data/seed.
panel = [(1, 39600.0, 0.5824), (2, 20520.0, 0.5823), (4, 10980.0, 0.5825),
         (8, 6300.0, 0.5822), (16, 4140.0, 0.5807)]

base_p, base_t, base_q = panel[0]
rows = []
for p, t, q in panel:
    speedup = base_t / t                                    # S(p) = T(1)/T(p)
    efficiency = speedup / p                                # E(p) = S(p)/p
    quality_delta = q - base_q                              # measured, not assumed
    cost = (p * t / 3600.0) * config["price_per_worker_hour"]
    rows.append((p, t, speedup, efficiency, quality_delta, cost))

base_cost = rows[0][5]
# Headline = fastest config whose quality drop stays inside the gate.
eligible = [r for r in rows if abs(r[4]) <= config["quality_gate"] and r[0] > 1]
p, t, s, e, dq, cost = max(eligible, key=lambda r: r[2])

print("HEADLINE")
print(f"  On {config['task']} via {config['axis']}, {p} workers deliver")
print(f"  S(p)={s:.1f}x speedup at E(p)={e:.2f} efficiency with "
      f"quality delta={dq:+.4f} {config['quality_metric']}")
print(f"  (gate {config['quality_gate']:.0e}, HELD), cost ratio "
      f"{cost/base_cost:.2f}x baseline. seed={config['seed']}.")
print("\nRESULTS TABLE (one config, co-computed)")
hdr = f"{'p':>3} {'wall_s':>8} {'S(p)':>6} {'E(p)':>6} {'dQ':>9} {'cost$':>9} {'gate':>5}"
print(hdr); print("-" * len(hdr))
for p, t, s, e, dq, cost in rows:
    held = "ok" if abs(dq) <= config["quality_gate"] else "FAIL"
    print(f"{p:>3} {t:>8.0f} {s:>5.1f}x {e:>6.2f} {dq:>+9.4f} {cost:>9.2f} {held:>5}")

HEADLINE
  On ranking-model nightly retrain via data parallelism, 8 workers deliver
  S(p)=6.3x speedup at E(p)=0.79 efficiency with quality delta=-0.0002 validation NDCG@10
  (gate 1e-03, HELD), cost ratio 1.27x baseline. seed=0.

RESULTS TABLE (one config, co-computed)
  p   wall_s   S(p)   E(p)        dQ     cost$  gate
----------------------------------------------------
  1    39600   1.0x   1.00   +0.0000     27.50    ok
  2    20520   1.9x   0.96   -0.0001     28.50    ok
  4    10980   3.6x   0.90   +0.0001     30.50    ok
  8     6300   6.3x   0.79   -0.0002     35.00    ok
 16     4140   9.6x   0.60   -0.0017     46.00  FAIL

Code 41.9.2: A backed-headline generator. The fastest raw run is sixteen workers at $9.6\times$, but its quality delta of $-0.0017$ exceeds the $10^{-3}$ gate, so the script refuses it as the headline and reports eight workers at $6.3\times$, $E(8)=0.79$, with quality held. The failing sixteen-worker row stays in the table as the measured boundary, exactly the honesty Section 4 requires. The cost ratio $1.27\times$ equals $1/E(8)$ rounded, the identity derived in Section 2.

Research Frontier: Reproducibility Standards for Systems Reports (2024 to 2026)

The discipline this section teaches is being formalized by the community you are writing for. The ML reproducibility checklists adopted by NeurIPS and the artifact-evaluation tracks at systems venues such as MLSys, OSDI, and SOSP now require that reported speedups ship with the launch scripts, environment manifests, and raw logs needed to recompute them, the same package Section 41.8 assembles. Recent work on benchmarking distributed training (for example the MLPerf Training reporting rules and the 2024 to 2025 push for cost-normalized leaderboards) makes the cost curve a first-class reported quantity rather than an afterthought, and tightens exactly the failure this section guards against: a speedup quoted without the efficiency, the quality, and the price that produced it. Writing your capstone to this standard is not gold-plating; it is the current bar for a credible systems result.

With the report written to this standard, the capstone is a complete, defensible artifact: a stated problem, a justified axis, a measured scale-out result with quality held constant, the three figures that explain it, an honest boundary, and a package that reproduces every number. The chapter's closing section steps back from any single project to the meta-question of how to evaluate a capstone and what makes one excellent rather than merely complete, which is where Section 41.10 takes us.

Exercise 41.9.1: Find the Unbacked Claim Conceptual

A draft report states: "Our distributed pipeline achieves a $12\times$ speedup with no loss in accuracy at a cost comparable to the baseline." Identify every place this sentence fails the evidence standard of Section 2. What denominator, what configuration, what gate, and what cost number must the report supply before this sentence is admissible? Rewrite the sentence so that each of its three claims (speed, quality, cost) is backed by a co-computed quantity, using the headline format from Code 41.9.2 as your model.

Exercise 41.9.2: Extend the Generator With the Figures' Data Coding

Extend Code 41.9.2 so each panel entry also carries a time breakdown $(t_{\text{comp}}, t_{\text{comm}}, t_{\text{idle}})$ summing to the wall-clock. Have the script (a) verify that the three parts sum to $T(p)$ for every row and refuse to emit a headline if any row violates this, and (b) print, for the headline configuration, the fraction of wall-clock spent in communication. Then state how that fraction relates to the efficiency $E(p)$ you already report, and which figure from Section 3 it feeds.

Exercise 41.9.3: Locate the Economic Stopping Point Analysis

Using the table from Code 41.9.2, the cost ratio at each $p$ is $1/E(p)$. Define the "economic stopping point" as the largest $p$ at which marginal speedup per added dollar is still positive, and the "quality stopping point" as the largest $p$ at which $|\Delta Q| \le \tau$. For this panel, compute both. Which one binds first, and what does that tell you about whether your capstone is limited by communication cost or by statistical efficiency? Tie your answer to the cost-model terms of Chapter 3.