"My benchmark was perfectly reproducible. It reproduced a different number every single time, with admirable consistency."
An All-Reduce That Has Seen Some Gradients
A measurement you cannot reproduce is a story, not a result; on a cluster, reproducibility is something you engineer, because the same job run twice on the same code will differ unless you pin, record, and report the things that vary. A single-machine experiment is reproducible almost by accident: fix the seed, rerun, get the same loss curve. The cluster takes that comfort away. Reductions sum partial results in whatever order the network delivers them, so the arithmetic itself shifts run to run. Neighbors on a shared cluster steal bandwidth at unpredictable moments. Autoscalers change the worker count mid-job. The hardware under your job is heterogeneous in ways you did not choose. This closing section of Chapter 5, and of Part I, turns the evaluation discipline of the previous six sections into a practice that survives being handed to someone else: name every source of cluster nonreproducibility, pin or record what you can, report configuration alongside every number, and ship a package that lets a stranger recover your result.
The previous sections of this chapter built the vocabulary of evaluation: throughput and latency, scaling efficiency, cost per unit of useful work, and the statistics needed to tell a real improvement from noise. All of it assumes the underlying measurement can be trusted, which in turn assumes it can be reproduced. On one machine that assumption is cheap. On a cluster it is the single hardest property to maintain, because distribution introduces sources of variation that have no single-machine analogue. Before we can certify any of Chapter 5's numbers, we have to confront why two runs of the same distributed job disagree, and what it takes to make them agree, or at least to make their disagreement small, bounded, and documented.
1. Why Two Runs of the Same Cluster Job Disagree Beginner
Single-machine nonreproducibility has familiar, controllable causes: an unset random seed, an uncontrolled thread count, a dependency that silently upgraded. The cluster adds five sources that are structural, not accidental, and each one resists the seed-and-rerun reflex that works on a laptop. Naming them is the first half of the cure, because a source of variation you have not named is a source you cannot pin or report.
The first is nondeterministic reductions. A distributed sum, the all-reduce at the heart of data-parallel training (Chapter 4), combines one partial result per worker. Floating-point addition is not associative, so the order in which those partials are combined changes the last bits of the answer, and that order depends on which worker's bytes arrive first, which is a property of the network on that run, not of your code. The second is variable network conditions: bandwidth and latency between nodes fluctuate, so the wall-clock time of every communication step, and therefore your throughput number, moves between runs even when the computed answer is fixed. The third is multi-tenant interference: on a shared cluster, other jobs contend for the same network links, memory bandwidth, and storage, and their load is invisible to you and uncorrelated with your schedule. The fourth is autoscaling: an elastic job may run on eight workers one day and twelve the next, changing both the timing and, through the reduction order, the arithmetic. The fifth is hardware heterogeneity: a pool labeled with one accelerator name may contain several silicon revisions, firmware versions, and thermal states, each with slightly different numerics and markedly different speed.
Two different things are called "reproducible," and conflating them wastes effort. Numeric reproducibility asks whether you get the same answer (the same loss, the same accuracy, the same output bits). Performance reproducibility asks whether you get the same speed (the same throughput, the same latency distribution). Seeds and determinism flags target the first; they do nothing for the second. Network variance, interference, and heterogeneity attack performance reproducibility and are largely beyond a seed's reach, so you control them by measurement design (repeat runs, report distributions, pin the hardware pool) rather than by pinning a number. Decide which kind of reproducibility a given claim needs before you spend a day chasing the wrong one.
2. Collective Nondeterminism, Demonstrated Intermediate
The most surprising of the five sources, because it touches the answer and not merely the timing, is the nondeterministic reduction. It is worth seeing in its smallest possible form, with no cluster and no network, because the mechanism is pure arithmetic: summing the same numbers in a different order gives a different result. On a cluster, the order is chosen for you by message-arrival timing, so the effect appears whether or not you asked for it. The code below sums one million floating-point values, deliberately mixing large and tiny magnitudes so that rounding bites, in four different orders that stand in for four different worker-arrival orderings.
import numpy as np
# Simulate a floating-point all-reduce: the SAME partial gradients, summed in
# different orders, as when workers arrive in a different order on each run.
rng = np.random.default_rng(7)
P = 1_000_000
vals = rng.standard_normal(P).astype(np.float64)
vals[::3] *= 1e8 # mix magnitudes so rounding bites
def ordered_sum(x):
s = 0.0
for v in x: # left-to-right, like a real reducer
s += v
return s
ascending = ordered_sum(np.sort(vals)) # four worker-arrival orderings
descending = ordered_sum(np.sort(vals)[::-1])
shuffled1 = ordered_sum(rng.permutation(vals))
shuffled2 = ordered_sum(rng.permutation(vals))
exact = float(np.sum(vals, dtype=np.longdouble)) # higher-precision reference
orders = [ascending, descending, shuffled1, shuffled2]
print("sum, ascending order :", repr(ascending))
print("sum, descending order :", repr(descending))
print("sum, worker order A :", repr(shuffled1))
print("sum, worker order B :", repr(shuffled2))
print("max spread across orders:", f"{max(orders) - min(orders):.6e}")
print("relative spread :", f"{(max(orders) - min(orders)) / abs(exact):.2e}")
sum, ascending order : np.float64(-20189323766.140797)
sum, descending order : np.float64(-20189323766.132816)
sum, worker order A : np.float64(-20189323766.296597)
sum, worker order B : np.float64(-20189323766.29571)
max spread across orders: 1.637802e-01
relative spread : 8.11e-12
The disagreement is tiny, around eight parts in a trillion here, but it is not zero, and it is not noise you can seed away: each ordering is perfectly deterministic on its own, yet the cluster picks a different ordering each run. The right response is rarely to chase bit-exactness, which on many accelerators costs real speed and is sometimes impossible across hardware revisions. The right response is to report the effect: state that the reduction is nondeterministic, quantify the run-to-run spread by repeating the run, and let that spread set the resolution below which two results are indistinguishable. A measured difference smaller than the reduction-order spread is not an improvement; it is the arithmetic breathing. Relating this $O(10^{-12})$ floating-point spread to the statistical-significance machinery of Section 5.6 is exactly how you decide whether a reported gain clears the noise floor.
A predictable rite of passage on a new cluster is the panicked bug report titled "training is nondeterministic" filed against perfectly correct code. The loss curve wobbles in its eleventh decimal place, someone notices, and a day disappears into a hunt for the phantom. The culprit is almost always the all-reduce summing gradients in network-arrival order. The fix is not a fix; it is a sentence in the methods section admitting that floating-point addition has opinions about order.
3. Practices That Make a Cluster Measurement Reproducible Intermediate
Reproducibility on a cluster is the product of a handful of disciplined habits, each aimed at one of the five sources from Section 1. None is exotic; the discipline is in doing all of them, every run, and recording the result where a stranger can find it.
Pin software and container versions. The single largest source of "it worked last month" is a dependency that moved: a new CUDA driver, a patched collective-communication library, a framework minor release that changed a default. Ship the job as a container image referenced by digest, not by a mutable tag, so the exact bytes of every library are frozen. Fix seeds where possible, and acknowledge what they cannot fix. Set the seeds for every random number generator (the framework's, the data loader's, the language runtime's) and enable the framework's deterministic-algorithm mode where the workload tolerates its speed cost. Then state plainly that collective reductions remain nondeterministic, as Output 5.7.1 showed, so a reader does not expect bit-exactness you cannot deliver. Record the full environment. Capture the hardware (accelerator model and count, silicon revision), the interconnect topology, every library version, and the collective-library settings (for example the NCCL environment variables that select algorithm and protocol), because these change both numerics and speed. Report configuration alongside every number. A throughput figure without its worker count, batch size, precision, and the flags in force is uninterpretable; the configuration is part of the measurement, not metadata about it. Publish a reproducibility package. Bundle the code commit, the data version, the environment description, the configuration, and the seeds into one artifact, as Figure 5.7.1 lays out, so reproduction is a download and a single command rather than an archaeology project.
Who: A research engineer on an ML platform team publishing an internal scaling study of a training job.
Situation: Their report showed near-linear speedup from 8 to 64 GPUs and was used to justify a hardware purchase.
Problem: Three months later a colleague reran the job and got a visibly worse curve, and nobody could explain the gap or say which number was right.
Dilemma: Trust the original glossy curve and buy the hardware, or trust the worse rerun and delay; with no recorded environment there was no way to tell whether the code, the cluster, or a library had changed.
Decision: They stopped trusting either number and rebuilt the study around a reproducibility package: a pinned container digest, recorded NCCL settings and topology, the exact configuration per point, and three repeated runs per worker count with the spread reported.
How: The original run had used a faster, since-changed collective-library default and a less contended time of day; the rerun had neither. Both numbers were "correct" for their unrecorded conditions, which is precisely the failure.
Result: The repeatable curve was slightly below the original but came with error bars and a one-command reproduction, and it survived a second team's audit. The purchase was re-justified on numbers that held up.
Lesson: An unrecorded environment does not make a fast number; it makes an unfalsifiable one. The package is what converts a curve from a claim into evidence.
The pinning and recording described above is mostly one helper function, called once at job start, plus a one-line environment dump. The framework provides the determinism switch; you provide the discipline of calling it and logging what you set.
import os, json, subprocess, random, numpy as np, torch
def make_reproducible(seed: int):
random.seed(seed); np.random.seed(seed)
torch.manual_seed(seed); torch.cuda.manual_seed_all(seed)
torch.use_deterministic_algorithms(True, warn_only=True) # numeric repeatability
os.environ.setdefault("CUBLAS_WORKSPACE_CONFIG", ":4096:8")
def capture_environment(path="env.json"):
env = {
"git_commit": subprocess.check_output(["git", "rev-parse", "HEAD"]).decode().strip(),
"torch": torch.__version__,
"cuda": torch.version.cuda,
"world_size": int(os.environ.get("WORLD_SIZE", 1)),
"gpu": torch.cuda.get_device_name(0) if torch.cuda.is_available() else "cpu",
"nccl_env": {k: v for k, v in os.environ.items() if k.startswith("NCCL_")},
}
with open(path, "w") as f:
json.dump(env, f, indent=2) # ships inside the reproducibility package
return env
make_reproducible(20260616)
env = capture_environment()
make_reproducible pins every seed and the determinism mode, and capture_environment records the commit, library versions, world size, accelerator, and collective-library settings into the env.json that travels in the package. The collective nondeterminism of Output 5.7.1 still applies and belongs in the report text.Reproducibility has moved from a checklist to tooling. Major venues now attach a formal reproducibility track and badge: NeurIPS, ICML, and MLSys ask for a completed reproducibility checklist and increasingly an artifact evaluation, and community efforts such as the ML Reproducibility Challenge (2024 and 2025 editions) systematically re-run published distributed-training results. On the determinism side, deterministic collective and attention kernels have matured, and work on bitwise-reproducible training across runs and across hardware (including reproducible mixed-precision reductions) is active, trading throughput for repeatability where the science demands it. Provenance frameworks track code, data, and environment as a single signed lineage so that a number can be traced to the exact bytes that produced it. The consistent finding across these efforts is the one in this section: pin and record everything cheap to pin, and report, rather than pretend to eliminate, the residual nondeterminism that collectives impose.
4. From a Single Number to a Reproducible Package Advanced
The practices of Section 3 converge on one deliverable, drawn in Figure 5.7.1: a reproducibility package that bundles code version, data version, environment, configuration, and seeds so a second party can recover the result. Two of those five inputs reach beyond this chapter and tie reproducible measurement to the rest of the book. The data version is only meaningful if the dataset itself is versioned and addressable, which is the job of distributed storage and data loading; pinning "the training set" to an immutable snapshot or content hash, rather than a moving directory, is what we build in Chapter 8. The configuration and environment records are only useful if they are tracked, searchable, and attached to every run automatically rather than by hand, which is the province of experiment tracking and the broader operational discipline developed in Chapter 26. A reproducibility package, in other words, is not a one-off zip file; it is the point where the data-versioning of Part II and the experiment-tracking of Part V meet the evaluation discipline of Part I.
With the package in hand, the loop of Chapter 5 closes. You measure throughput, latency, scaling efficiency, and cost; you establish with the statistics of Section 5.6 that a difference is real; and you seal the whole apparatus into an artifact that lets a skeptic, including your future self, recover the number and the conditions that produced it. That is the difference between a system you can report on and a system you merely ran.
Part I has argued one thesis from five angles. Chapter 1 set it: modern AI is a distributed system, and scale-out is the discipline of splitting essential work across machines, exactly because data parallelism can be exact rather than approximate. Chapter 2 gave the distributed-systems concepts (partitioning, replication, coordination, failure) that any such split must respect. Chapter 3 made the cost of the split quantitative with performance models, so "should this be distributed?" became a calculation. Chapter 4 built the communication primitives, the collectives, that are the engine of every parallel method to come. Chapter 5 taught how to evaluate the result and, in this section, how to make that evaluation reproducible. The spine is in place: a distributed AI system splits work, pays a communication tax, tolerates failure, and must be measured with care. Every later part now picks one axis of distribution and builds on this foundation.
Evaluating a distributed AI system means measuring the right quantities (throughput, latency, scaling efficiency, cost), proving differences are real rather than noise, and making the whole measurement reproducible. On a cluster, reproducibility is engineered against five structural sources of variation: nondeterministic reductions, variable network conditions, multi-tenant interference, autoscaling, and hardware heterogeneity. You tame them by pinning software and container versions, fixing seeds while acknowledging that collectives stay nondeterministic, recording the full environment, reporting configuration with every number, and publishing a reproducibility package. This closes Part I, which laid the foundation for the rest of the book: the thesis that AI at scale is distributed (Chapter 1), the systems concepts it rests on (Chapter 2), the performance models that price it (Chapter 3), the communication primitives that power it (Chapter 4), and the evaluation and reproducibility discipline that keeps it honest (Chapter 5).
For each observation, state which of the five sources of cluster nonreproducibility from Section 1 is the most likely cause, and whether it threatens numeric reproducibility, performance reproducibility, or both: (a) the final validation accuracy differs in its sixth decimal place between two runs of identical code; (b) the same job reports 4,100 samples per second at 2am and 3,400 at 2pm; (c) a job that took 90 minutes last week takes 70 minutes today, and the worker count in the logs changed from 10 to 14; (d) one run's loss is reproducible to the bit on a development node but not on the shared training pool. For each, name the one practice from Section 3 that most directly addresses it.
Extend Code 5.7.1 into a reusable measurement. Sum the same array in 50 random orders, collect the 50 results, and report the mean, the standard deviation, and the full spread (max minus min) as both an absolute number and a fraction of the mean. Then change the magnitude mixing (try all values near 1.0, then a wider spread than the 1e8 used here) and show how the relative spread grows with the dynamic range of the summands. Conclude with the rule this gives you: for a measured difference to count as real, by how many multiples of this spread should it exceed it, and how does that connect to the significance testing of Section 5.6?
Take a published scaling or throughput claim (from a paper, a vendor benchmark, or one of your own past runs) and audit it against the five inputs in Figure 5.7.1: is the code pinned to a commit and container digest, the data pinned to a version or hash, the environment (hardware, topology, library and NCCL settings) recorded, the configuration reported with the number, and the seeds and determinism mode stated? List which inputs are present, which are missing, and for each missing one, give the specific way the reported number could be wrong or irreproducible because of the gap. Decide whether you could redraw the curve from what is provided, and if not, exactly what you would have to request.
Part I gives you everything needed to study a real distributed job end to end. Each idea below combines the performance models of Chapter 3, the collectives of Chapter 4, and the evaluation and reproducibility discipline of Chapter 5.
- A reproducible scaling study. Pick a real training or data-processing job you can run on 2, 4, and 8 workers. Produce a scaling-efficiency curve with three repeated runs per point and error bars, then ship a one-command reproducibility package (pinned container, recorded environment, configuration per point, seeds) so a classmate can redraw your exact curve. Write one paragraph naming which of the five nonreproducibility sources you controlled and which you could only report.
- Measure your cluster's reduction-order spread. On real hardware, run the same gradient all-reduce many times and record the run-to-run variation in the result, the way Code 5.7.1 does in simulation. Report the spread, relate it to your job's accuracy resolution, and state the smallest accuracy difference that would be meaningful given that floor.
- A reproducibility audit tool. Build a small script that, given a run, checks for the five package inputs of Figure 5.7.1 and emits a pass or fail per input with the missing items named. Run it over several of your team's past experiments and summarize how reproducible they actually were.
Where Part II begins. Part I established that AI at scale is a distributed system and gave the foundation to reason about it: the axes of distribution, the systems concepts, the performance models, the communication primitives, and the evaluation discipline you just sealed into a reproducible package. The recurring pressure that opened the book was that the data itself outgrew one machine first, before models and before request volume. Part II takes up that pressure directly. It builds the engines of distributed data processing for AI, starting with the model that made web-scale computation routine, MapReduce, whose shuffle is the distant ancestor of the all-reduce you have been studying. We begin in Chapter 6.