Part VIII: Case Studies and Capstone Projects
Chapter 41: Capstone Project Design

Reproducibility Package

"I am a beautiful speedup. I exist only in a screenshot, on a cluster nobody can find, behind a seed nobody wrote down."

A Result, Hoping Someone Can Run It Twice and Believe It
Big Picture

A distributed-AI result is not the number you printed; it is the number plus everything required to obtain it again, and in a distributed system that "everything" includes the cluster itself. A single-machine experiment is reproducible when you pin the code, the environment, the data, and the seed. A scale-out experiment adds a fifth dependency that single-machine work never has to declare: the hardware. Your speedup, your throughput, your tokens per second, and even your final loss can shift when the node type, the worker count, or the interconnect changes, because nondeterministic reductions and asynchronous communication are baked into how the result was computed. The reproducibility package is the artifact that makes your capstone's headline numbers (the ones earned in Sections 41.6 and 41.7) defensible: a reader clones it, runs one command on the stated cluster, and recovers your numbers within their reported variance. This section specifies what that package contains and why each part is load-bearing.

By this point in the capstone you have a system that works and numbers that say how well it works. The remaining task is to make those numbers survive contact with a skeptical reader, which is the same standard a credible distributed-systems result must meet anywhere: a published speedup that cannot be reproduced is an anecdote, not a measurement. Reproducibility is not a courtesy you extend to others; it is the property that converts your private observation into a public claim. The reproducibility package is the deliverable that carries that property. It travels with the capstone report (Section 41.9 cites it directly), and it is the difference between "our distributed pipeline reached a 6.2x speedup" and "our distributed pipeline reached a 6.2x speedup, and here is the directory that proves it on the cluster we ran it on."

The reproducibility package (six inputs) 1. Code pinned commit, no uncommitted diff 2. Pinned environment container image + lockfile 3. Config + experiment tracking every run's config and metrics logged 4. Versioned data dataset hash / snapshot id 5. Cluster spec node type, count, interconnect 6. README: one command reproduces the headline numbers make reproduce deterministic harness Verifiable result baseline vs distributed, co-computed in ONE pass reproduced number matches reported mean within std
Figure 41.8.1: The reproducibility package as a pipeline. Six inputs (code at a pinned commit, a pinned environment, config plus experiment tracking, versioned data, the cluster spec, and a README whose one command runs a deterministic harness) feed a single reproduced result. The result is credible only when the recovered number matches the reported headline number from Sections 41.6 and 41.7 within its stated variance. In a distributed setting the cluster spec is an input, not an afterthought, because the same code on a different cluster can produce a different number.

1. Why Reproducibility Is Harder for a Distributed Result Intermediate

The reproducibility checklist that suffices for a notebook on a laptop is necessary but not sufficient for a scale-out system. On one machine, fixing the code, the library versions, the data, and the random seed pins the computation completely; rerun it and you get the same bytes. A distributed run breaks that guarantee in two distinct ways, both of which the capstone must declare openly rather than hide. The first is the cluster: throughput, latency, and speedup are properties of the hardware as much as the code, so the same program that scales to 6x on eight nodes with a fast interconnect may scale to 3x on eight nodes connected by ordinary Ethernet. A speedup reported without the cluster it was measured on is unfalsifiable, because no reader can set up the conditions under which it could be checked.

The second is nondeterminism in the math itself. Floating-point addition is not associative, so when an all-reduce sums one partial gradient per worker, the order in which the network happens to combine them can change the last bits of the result; across many steps those last bits diverge into visibly different trajectories. Asynchronous updates (Chapter 10) make this worse, because the order in which stale gradients arrive is a function of timing that no seed controls. Elastic and fault-tolerant training (Chapter 18) adds another source: a job that loses and regains workers mid-run takes a different path than one that does not. A distributed result is therefore reproducible in distribution, not byte for byte, and the package must say which it is claiming.

Key Insight: In a Distributed System, the Cluster Spec Is Part of the Result

Single-machine reproducibility pins code, environment, data, and seed. Scale-out reproducibility pins a fifth thing the laptop never had to: the hardware. Node type, node count, and interconnect determine throughput, latency, and speedup directly, and nondeterministic reductions plus asynchronous communication mean two runs on different clusters can differ even when code, data, and seed are identical. Report the cluster spec with the same prominence as the headline number, because without it the number cannot be reproduced or refuted. A speedup is a measurement on a machine, not a property of the code alone.

2. Pinned Environment: Containers and Lockfiles Beginner

The first input is an environment so completely specified that a stranger can rebuild it. "Install the requirements" is not a specification, because numpy without a version is a moving target and a CUDA driver mismatch silently changes numerics. Two artifacts pin it. A lockfile records the exact resolved version of every direct and transitive dependency, so the dependency graph is frozen rather than re-solved. A container image goes further and freezes the operating system, the system libraries, the CUDA and NCCL versions, and the Python interpreter, so the environment is identical bit for bit regardless of the host. For a distributed-AI capstone you want both: the lockfile is human-readable and reviewable, and the image is what actually runs on every node so that all workers are provably running the same software. Pinning the container by digest, not by a mutable tag like :latest, is what makes "the same image" a claim a reader can verify.

Library Shortcut: A Dockerfile and a Lockfile That Pin Everything

You do not need a bespoke build system. A short Dockerfile pinned to a base image digest plus a generated lockfile freezes the entire stack, and the same image is launched on every node so the cluster is software-homogeneous by construction:

# Pin the base by DIGEST, never by a mutable tag like :latest.
FROM nvcr.io/nvidia/pytorch:24.10-py3@sha256:8f1c...e2a9
WORKDIR /capstone
COPY requirements.lock .
RUN pip install --no-deps --require-hashes -r requirements.lock
COPY . .
# One command reproduces the headline numbers (see Section 7).
ENTRYPOINT ["bash", "reproduce.sh"]
Code 41.8.1: A Dockerfile that pins the base image by digest and installs from a hash-checked lockfile. --require-hashes makes pip refuse any package whose contents do not match the recorded hash, so a tampered or re-released wheel cannot slip in. The requirements.lock is generated once with a resolver such as pip-compile --generate-hashes or uv pip compile and committed alongside the code.

The reduction is the point: roughly ten lines replace a paragraph of "first install CUDA 12.x, then the matching PyTorch, then ..." that no reader will execute correctly. The container runtime handles driver compatibility, library paths, and interpreter selection that an ad-hoc setup gets wrong on the second machine.

3. Seeds and the Limits of Determinism Intermediate

A seed pins the pseudo-random number stream: weight initialization, data shuffling, dropout masks, and augmentation all become a deterministic function of one integer. Recording the seed for every run is mandatory, and on a single device a fixed seed plus deterministic library flags (for example, deterministic algorithm selection and disabled autotuning) usually buys you bit-for-bit reproducibility. The honest scale-out story is that the seed controls less than it does on a laptop. The seed governs the random draws; it does not govern the order in which an all-reduce combines floating-point partials across workers, nor the arrival order of asynchronous updates, nor which worker a fault evicted. These are the residual nondeterminism that no seed reaches, and a credible package names them rather than pretending the run is bit-exact when it is not.

The right response is not to claim false determinism but to budget the nondeterminism and report through it. If a run is exactly reproducible, you report one number. If it is reproducible only in distribution, you run it under several seeds and report the headline number as a mean with a standard deviation, so the reader knows the spread your reproduction should fall within. For $R$ runs producing metric values $m_1, \dots, m_R$, report

$$\bar{m} = \frac{1}{R} \sum_{r=1}^{R} m_r, \qquad s = \sqrt{\frac{1}{R-1} \sum_{r=1}^{R} (m_r - \bar{m})^2},$$

and present the result as $\bar{m} \pm s$. A reproduction succeeds when the reader's recovered value lands inside that interval; a deterministic run is simply the special case $s = 0$. This turns "did I get the same number?" into a falsifiable test instead of a hope. The determinism budget is the explicit statement of which sources you eliminated (seed, deterministic kernels) and which you only bounded (reduction order, async arrival), so a reader knows whether to expect $s = 0$ or $s > 0$ before they run anything.

Practical Example: The Speedup That Could Not Be Reproduced

Who: A capstone team submitting a distributed data-pipeline project, reviewed by a second team tasked with reproducing it.

Situation: The report claimed a 6.2x throughput speedup from scaling a deduplication job from one node to eight.

Problem: The reproducing team ran the committed code on eight cloud nodes and measured 3.1x, half the claimed figure, and the report had no explanation for the gap.

Dilemma: Was the original number wrong, or was the reproduction running under different conditions that the package failed to pin?

Decision: They treated the cluster as a suspect input and compared specifications instead of arguing about the code.

How: The original run used nodes with a high-bandwidth interconnect; the reproduction used general-purpose nodes on standard networking, so the communication-bound shuffle was far slower, exactly the cost the models of Chapter 3 predict.

Result: Once the package recorded node type, count, and interconnect, both teams reproduced 6.2x on the stated cluster and 3.1x on the cheaper one, and both numbers became defensible because each named its hardware.

Lesson: The original number was not wrong; the package was incomplete. A distributed speedup without a cluster spec is irreproducible by construction, because the reader cannot recreate the conditions that produced it.

4. Config and Experiment Tracking Intermediate

Every run in the capstone must log its full configuration and its resulting metrics to a tracking system, so that the relationship between "what I set" and "what I got" is recorded rather than remembered. This is the experiment-tracking and lineage discipline of MLOps (Chapter 26), applied to your own project: the config (hyperparameters, seed, dataset version, worker count, code commit) and the metrics (loss, throughput, latency, speedup) are written together for each run, keyed by a run id. The headline numbers in the report are then not loose claims but pointers into this log; a reader can ask "which run produced the 6.2x figure?" and get a row with every input that produced it. The same lineage that makes a production model auditable (Section 35.7) makes a capstone result traceable: you can show provenance from the reported number back to the exact configuration and code that generated it.

Library Shortcut: Logging a Run's Config and Metrics with MLflow

A tracking library turns "I think I used learning rate 0.05" into a recorded fact. The pattern is the same across MLflow, Weights and Biases, and similar tools: open a run, log the config, log the metrics, and let the system store the lineage:

import mlflow

with mlflow.start_run(run_name="distributed-8node") as run:
    mlflow.log_params({          # the config: every knob that affects the result
        "seed": 20260616, "workers": 8, "lr": 0.05,
        "node_type": "a10g", "interconnect": "100gbe",
        "data_version": "dedup-2026-06-10",
        "code_commit": "a1b2c3d",
    })
    mlflow.log_metrics({         # the metrics: the headline numbers, logged not screenshotted
        "throughput_speedup": 6.2, "final_loss_mean": 0.0101,
        "final_loss_std": 0.0001,
    })
    # The run id is what Section 41.9's report cites for each reported number.
    print("run_id:", run.info.run_id)
Code 41.8.2: An MLflow log of one run's config and metrics. Because the cluster fields (node_type, interconnect) and provenance fields (data_version, code_commit) are logged alongside the metrics, the headline number carries its full context. The report in Section 41.9 cites the run_id so each claim resolves to one logged row rather than a remembered setting.

5. The Cluster Spec as Part of the Result Beginner

Because the cluster determines the numbers, the package records it as a first-class artifact, not a sentence in the methods section. A complete cluster spec states the node type (accelerator model and memory, host CPU and memory), the node count, and the interconnect (the network fabric and its bandwidth, since the communication-bound parts of the pipeline live or die on it). It also records the topology where it matters: whether workers share a rack or span availability zones changes collective-communication cost, as Chapter 4 makes precise. The spec belongs in the package as machine-readable metadata so a reader can either match it or, when they cannot, state exactly how their cluster differed. Table 41.8.1 contrasts what a single-machine package pins with what a distributed package must add.

Table 41.8.1: What scale-out reproducibility adds on top of single-machine reproducibility. The left two columns are necessary for any result; the right column is what a distributed capstone must additionally pin because the result depends on it.
DependencyPinned byDistributed-specific concern
CodeGit commit, no uncommitted diffSame commit launched on every node
EnvironmentContainer digest + lockfileSoftware-homogeneous across all workers
DataDataset version / content hashSame shards, same partitioning across workers
RandomnessSeed + deterministic flagsDoes not control reduction or async arrival order
HardwareCluster spec (the result depends on it)Node type, count, interconnect, topology
Thesis Thread: The Cluster Is Not a Detail, It Is the Subject

The whole book argues that scale-out behavior is a property of how work is distributed across machines, not of the code in isolation. The reproducibility package is where that thesis becomes a concrete obligation: because the speedup, the throughput, and even the loss trajectory depend on the cluster, the cluster must be recorded with the same care as the algorithm. A capstone that pins the code but waves at the hardware has reproduced the single-machine habit on a distributed result, and the number it reports cannot be checked. Reproducibility in distributed AI captures the cluster, not just the code.

6. Data Versioning and the One-Pass Comparison Artifact Advanced

The data must be versioned so that "the dataset" names one specific, retrievable object. A content hash or a snapshot id pins it; recording that the run used dedup-2026-06-10 with a known hash means a reader can confirm they have the same bytes before they trust any number derived from them. Data versioning also makes the baseline-versus-distributed comparison honest, which is the single most scrutinized artifact in the package. The headline claim of a scale-out capstone is almost always a comparison: the distributed system is faster than, or as accurate as, a baseline. That comparison is credible only when both sides are construct-matched and co-computed in one pass on one configuration: the same data, the same metric definition, the same evaluation harness, the same seed regime, run together and saved as one artifact. A speedup computed by pairing today's distributed throughput against a single-node number measured last month on different data is not a measurement; it is two unrelated numbers placed next to each other.

Concretely, the comparison artifact is produced by one script that runs the baseline and the distributed system back to back and emits both numbers and their ratio into a single file, so that a number-by-number audit of the report passes because every compared pair came from one run. If the baseline and the distributed result live in separate files generated by separate invocations under conditions that drifted, the comparison is invalid even when each number is individually correct. The one-pass artifact is the defense against that failure mode, and it is what Section 41.9's report should quote directly.

7. A README That Reproduces the Numbers with One Command Beginner

The final input ties the others together: a README whose single command rebuilds the result. The standard to meet is that a reader clones the repository, provisions the cluster the spec describes, and runs one command, after which the package recomputes the headline numbers and reports them. There must be no manual sequence of undocumented steps, because every undocumented step is a place reproduction fails silently. The command launches the pinned image on the stated cluster, pulls the versioned data, runs the one-pass comparison under the recorded seed regime, and prints the baseline-versus-distributed numbers with their variance. The runnable demonstration below shows the heart of that harness at miniature scale: it runs a tiny experiment twice under one fixed seed to prove the headline number is exactly reproducible, then varies the seed to show how a result that is only reproducible in distribution is reported as a mean with a standard deviation.

import numpy as np, hashlib

def run_experiment(seed):
    """A tiny 'distributed' training proxy: each of K workers does SGD on a
    shard, results are all-reduced (averaged). Returns the final loss."""
    rng = np.random.default_rng(seed)
    N, d, K, steps, lr = 4096, 16, 8, 200, 0.05
    X = rng.standard_normal((N, d))
    w_true = rng.standard_normal(d)
    y = X @ w_true + 0.1 * rng.standard_normal(N)
    shards = np.array_split(np.arange(N), K)
    w = np.zeros(d)
    for _ in range(steps):
        grads = [2.0 / len(s) * (X[s].T @ (X[s] @ w - y[s])) for s in shards]
        w = w - lr * (np.sum(grads, axis=0) / K)        # all-reduce mean
    return float(np.mean((X @ w - y) ** 2))

def digest(value):
    return hashlib.sha256(f"{value!r}".encode()).hexdigest()[:16]

# 1. Same seed, run twice: the headline number must be bit-for-bit identical.
a = run_experiment(seed=20260616)
b = run_experiment(seed=20260616)
print("seed=20260616  run A loss :", repr(a))
print("seed=20260616  run B loss :", repr(b))
print("bitwise identical         :", a == b)
print("sha256(A)[:16]            :", digest(a))
print("sha256(B)[:16]            :", digest(b))

# 2. Vary the seed: report the headline number as mean +/- std over runs.
losses = np.array([run_experiment(seed=s) for s in range(10)])
print()
print("seeds 0..9 final losses   :", np.round(losses, 6).tolist())
print("mean over 10 seeds        :", f"{losses.mean():.6f}")
print("std  over 10 seeds        :", f"{losses.std(ddof=1):.6f}")
print("reported headline number  :", f"{losses.mean():.4f} +/- {losses.std(ddof=1):.4f}")
Code 41.8.3: The reproducibility harness in miniature. The same seed run twice must yield a bit-for-bit identical loss (the deterministic case, $s = 0$); varying the seed exposes the run-to-run spread that a distributed result reports as $\bar{m} \pm s$. A real capstone harness wraps this in the pinned image and runs it on the stated cluster, but the reporting logic is exactly this.
seed=20260616  run A loss : 0.010111798421239302
seed=20260616  run B loss : 0.010111798421239302
bitwise identical         : True
sha256(A)[:16]            : 9645e4eeddac0477
sha256(B)[:16]            : 9645e4eeddac0477

seeds 0..9 final losses   : [0.010046, 0.010129, 0.010032, 0.010348, 0.009866, 0.010117, 0.009905, 0.009922, 0.01012, 0.010111]
mean over 10 seeds        : 0.010060
std  over 10 seeds        : 0.000141
reported headline number  : 0.0101 +/- 0.0001
Output 41.8.3: Real output. Under one fixed seed the loss is identical to the last digit and the two SHA-256 digests match, proving exact reproducibility. Across ten seeds the same headline number is reported as $0.0101 \pm 0.0001$, the form a reproduction must land inside. This is the reporting standard the README's one command should produce for every claimed number.
Research Frontier: Reproducibility Standards for Distributed AI (2024 to 2026)

Reproducibility has moved from a virtue to an enforced norm. Major venues now run reproducibility checklists and artifact-evaluation tracks (the NeurIPS and MLRC reproducibility efforts) that require exactly the package this section describes: pinned environments, logged configs, and a one-command path to the headline numbers. The harder open problem is bitwise determinism under distribution: libraries expose deterministic-algorithm modes and deterministic collective implementations, but guaranteeing identical results across different worker counts and interconnects remains an active engineering frontier, because the nondeterminism lives in the reduction order and the network, not only in the code. Work on reproducible distributed training and on content-addressed data and environment provenance (lockfiles, container digests, dataset hashing in tools like DVC and lakeFS) is converging on the practice that a result is the tuple of code, environment, data, seed, and cluster, recorded together. The frontier is making that tuple verifiable automatically rather than by trust.

Exercise 41.8.1: Name the Residual Nondeterminism Conceptual

A teammate pins the code commit, the container digest, the dataset hash, and a single random seed, then claims the distributed training run is "fully reproducible, bit for bit." For a synchronous data-parallel run, name two sources of nondeterminism the seed does not control and explain why each can change the final loss across otherwise-identical runs. Then state what changes if the run is asynchronous (tie your answer to Chapter 10) or elastic (tie it to Chapter 18). Finally, say which reporting form, a single number or $\bar{m} \pm s$, each case warrants.

Exercise 41.8.2: Build the One-Pass Comparison Artifact Coding

Extend Code 41.8.3 into a single script that runs a "baseline" (one worker, $K = 1$) and the "distributed" version ($K = 8$) back to back on the same generated data and the same seed regime, and writes one JSON file containing both final losses, both wall-clock times, the speedup ratio, and the seed list. The artifact must be produced by one invocation so that the baseline and distributed numbers are co-computed. Then write a second script that reads the JSON and prints the headline comparison line. Explain why generating the two numbers in separate runs, even with the same code, would make the comparison invalid.

Exercise 41.8.3: Audit a Cluster-Dependent Speedup Analysis

A capstone reports a 5.0x speedup at eight workers but records no interconnect in its cluster spec. Using the communication-cost reasoning of Chapter 3, argue why the missing interconnect makes the number irreproducible, and estimate qualitatively how the speedup would shift if the original run used a high-bandwidth fabric and the reproduction used commodity networking on a communication-bound workload. Specify the minimum set of cluster fields you would require the package to record so that the 5.0x figure becomes a falsifiable claim, and relate this to the lineage and provenance practice of Chapter 26 and Section 35.7.