"On one machine a bug is a stack trace. On a thousand machines a bug is a thing that happened to one of us, last Tuesday, and now we are all standing at a barrier waiting to find out which."
An All-Reduce That Has Seen Some Gradients
A thousand-GPU training run is a distributed system that must be kept observable, because the failures that matter most are invisible from any single rank: a quiet straggler, a desynchronized collective, a corrupted shard, a loss curve about to diverge into wasted weeks of compute. Everything earlier in this chapter, checkpointing, clean restart, elasticity, straggler mitigation, spot-instance survival, and memory offload, assumes you can see what the run is doing. This closing section builds that sight. It names the small set of signals worth watching, shows how to compress telemetry from thousands of ranks into something a human and an alerting system can act on, and lays out the forensic moves for the bugs that exist only because the work is distributed. Then it folds the whole chapter into one operating doctrine for keeping large runs alive.
A single-machine training job is debugged the way any program is: you read the stack trace, attach a debugger, print a tensor. None of that survives contact with scale. When one rank out of a thousand stalls, the other 999 do not crash; they block at the next collective and wait, producing no error at all, just a run that has silently stopped making progress while still burning every GPU-hour. When a gradient quietly goes wrong on one shard, the loss does not announce it; it drifts, then spikes, then diverges, often hours after the root cause. The defining property of distributed-training failures is that the symptom and the cause are separated in both space (which rank) and time (which step), and the only bridge across that separation is telemetry you collected before you knew you would need it. Observability is therefore not an operational afterthought here; it is the precondition for every fault-tolerance mechanism in this chapter doing its job.
1. What to Monitor on a Large Run Beginner
The instinct at scale is to log everything, and it is the wrong instinct: a thousand ranks each emitting a hundred metrics per step is millions of points per minute, a firehose no human reads and no dashboard renders. The skill is choosing the small set of signals that, between them, answer the only three questions that matter during a run: is the cluster doing useful work, is the work actually learning, and is anything physically breaking. A compact, well-chosen panel beats an exhaustive one, because alerts fire on the signals you watch, not the ones you buried.
The first signal is throughput and its efficiency-normalized cousin, Model FLOPs Utilization (MFU): the fraction of the hardware's peak floating-point throughput that the run actually converts into useful model compute. MFU is the single number that says whether the expensive cluster is earning its keep, and it folds together GPU idle time, communication stalls, and pipeline bubbles into one ratio you can trend over hours. We defined MFU and its measurement carefully in Chapter 5; here it becomes the headline of the dashboard, because a slow drift downward in MFU is the earliest, cheapest warning that something in the cluster has degraded.
The second signal is the loss curve, watched not for its absolute value but for its shape. A healthy curve descends smoothly; the pathologies are a sudden spike (often a bad batch, an overflow, or a numerical instability) and a slow divergence (a learning-rate or data problem). A spike that recovers within a few steps is usually harmless; a spike that does not, or a curve that turns upward, is the signal to stop the run rather than checkpoint over good weights with bad ones. The third signal is per-rank step time and collective-wait, the straggler detector from Section 18.5: if one rank is consistently slower, every other rank pays for it at the next barrier, and the gap between the slowest rank and the median is the wasted GPU time, made visible.
The remaining signals are diagnostic rather than headline. Gradient norms and activation statistics per layer are the early-warning system for numerical health: a gradient norm climbing toward an overflow, or an activation distribution drifting, precedes a loss spike often by many steps and explains it after the fact. Hardware health, GPU temperature, ECC memory errors, NIC and link errors, is where silent failures originate; a GPU throttling on temperature or accumulating correctable ECC errors is a straggler-in-waiting, and an uncorrectable ECC error is a likely source of silent corruption. Finally the data pipeline: if the input pipeline cannot keep the GPUs fed, the symptom (low MFU, GPUs idle) looks identical to a compute problem but the fix is entirely different, so input-queue depth and loader stall time earn their place on the panel. Figure 18.8.1 arranges exactly these into the four-panel view an operator keeps open for the life of a run.
At scale, the actionable information is almost always in the change of a signal, not its level. MFU of 38% is neither good nor bad in the abstract; MFU that was 42% an hour ago and is 38% now is a degradation worth investigating. A gradient norm of 4.0 is fine; a gradient norm that doubled in twenty steps is a spike forming. Absolute thresholds generate false alarms across heterogeneous hardware and model phases; trends and sudden discontinuities are what separate a healthy thousand-GPU run from one quietly wasting a week of compute. Build your alerts on slopes and step-changes, anchored to each run's own moving baseline.
2. Aggregating Telemetry Across Thousands of Ranks Intermediate
A signal you choose to watch must still survive the trip from a thousand ranks to one dashboard, and naively gathering every rank's every metric to rank 0 recreates, in the monitoring plane, exactly the all-to-one bottleneck that good distributed design avoids in the compute plane. The escape is to aggregate metrics the same way we aggregate gradients: with operations that combine partial results cheaply and associatively, so the cost grows with the logarithm of the rank count rather than linearly. A metric is mergeable when two partial summaries combine into a summary of the union without revisiting the underlying data, and most signals that matter are mergeable by construction.
Sums and counts merge trivially (add them), which already gives means; minima and maxima merge by comparison, which catches the slowest rank and the hottest GPU. The harder cases are distributions: you want the median and the 99th percentile of step time across a thousand ranks, but exact percentiles are not mergeable without holding every value. The standard answer is a sketch, a small fixed-size data structure that approximates a quantile to a controlled error and merges associatively, so each rank summarizes its own stream, a tree of all-reduce-style merges folds the summaries together, and the final structure answers "what is the p99 step time" within a guaranteed tolerance from a few kilobytes instead of gigabytes. These mergeable aggregates and sketches are precisely the tools built for petabyte analytics in Chapter 6; monitoring a training run is a streaming-aggregation problem, and the distributed-monitoring patterns of Chapter 9 apply directly to it.
The architecture that follows is hierarchical. Each rank emits to a node-local agent; node agents pre-aggregate across the few ranks on their host; a tree of collectors merges node summaries into rack and cluster summaries; only the merged result, a handful of numbers and a few sketches, reaches the dashboard and the alerting system. The same topology-aware thinking that makes a collective fast (Chapter 15) makes telemetry aggregation scalable, and for the same reason: a balanced tree of associative merges turns a $O(N)$ gather into an $O(\log N)$ one. The cost of monitoring then stays a rounding error against the cost of training, even at a thousand ranks.
You do not build the telemetry plane from scratch. Weights & Biases and TensorBoard handle per-rank logging, server-side aggregation, and the dashboard; you log from rank 0 plus a sample of other ranks and they render the loss curve, throughput, and per-layer gradient-norm histograms automatically. For the distributed-only failures, PyTorch ships the NCCL flight recorder, a ring buffer that records every collective each rank started and finished, so a hang can be diagnosed after the fact rather than reproduced:
# Logging side: one rank drives the dashboard, sampled ranks add coverage.
import wandb
if rank == 0 or rank % 64 == 0: # rank-0 + a 1-in-64 sample
wandb.log({"loss": loss.item(),
"mfu": mfu, "step_time_s": step_time,
"grad_norm": total_grad_norm}, step=global_step)
# Flight recorder: set BEFORE launch so the buffer is armed when a hang occurs.
# TORCH_NCCL_TRACE_BUFFER_SIZE=2000 # keep the last 2000 collective records
# TORCH_NCCL_DUMP_ON_TIMEOUT=1 # dump every rank's buffer on a timeout
# On a hang, the dump shows which rank never reached the collective the
# others are blocked on, turning a silent stall into a named culprit.
wandb.log from rank 0 plus a sampled subset feeds the dashboard cheaply, while two environment variables arm the NCCL flight recorder so a desynchronized collective leaves a forensic trail instead of an unexplained barrier stall.3. Debugging the Bugs That Only Exist at Scale Advanced
Some bugs cannot occur on one machine, because they are bugs of coordination. The first and most common is the hang: the run stops making progress, no rank has crashed, and every GPU is busy doing nothing. The usual cause is a desynchronized collective, where the ranks disagree about what they are collectively doing. If a conditional makes rank 3 skip an all-reduce that the others execute, the others block forever waiting for a participant that will never arrive. A close relative is a shape mismatch: ranks enter the same all-reduce with tensors of different sizes, which either errors or, worse, silently produces garbage. The diagnostic move is to find the one rank whose recent history differs from the rest, which is exactly what the flight recorder of Code 18.8.1 makes possible: dump every rank's last collectives and look for the rank that is one call behind.
When a desynchronized collective is not caught by the flight recorder, it eventually surfaces as a NCCL timeout: a collective that does not complete within a configured window aborts the process group. The timeout is a symptom, not a diagnosis; the real questions are which rank failed to arrive and why, and the answer is almost always either a genuine hang upstream (a deadlocked data loader, a CPU-side stall) or a straggler so severe it crossed the timeout threshold. The third and most insidious class is silent data corruption: a flaky GPU or a bad NIC that does not fault but returns subtly wrong numbers. Nothing crashes; the loss merely behaves strangely. This is why uncorrectable ECC errors and NIC error counters earn dashboard space, and why some large runs periodically recompute a known quantity on all ranks and compare, treating a single disagreement as a corrupt-hardware signal to drain that node.
The practical doctrine that makes all of this tractable is rank-0-plus-sampled logging combined with flight recorders. You cannot afford full logs from every rank, but you can afford complete logs from rank 0, sampled logs from a fraction of the others, and a continuously-overwritten flight-recorder buffer on every rank that costs almost nothing until the moment it saves the investigation. The flight recorder is the black box of a training run: you hope never to read it, and when a hang strikes at hour forty of a forty-eight-hour job, it is the difference between a named root cause and a blind restart.
Aviation flight recorders are built to survive the very accident they document. A training flight recorder has the same job and an easier life: it just has to outlive a barrier stall. The cruel irony of a distributed hang is that the process holding the answer is the one that is stuck, so the recorder writes continuously to a buffer that a watchdog can dump even while the main thread blocks. You arm it before takeoff and forget it exists, right up until the run goes quiet and it tells you, calmly, that rank 5 never boarded the all-reduce.
4. Loss-Spike Forensics and Automated Babysitting Intermediate
The loss spike deserves its own forensic playbook, because at scale it is the most common reason a run dies expensively. When the loss jumps, the questions in order are: did the gradient norm spike with it (pointing at a numerical overflow or a pathological batch), did it correlate with a specific data shard (pointing at corrupt or anomalous input), and did it recover on its own within a few steps. A spike with a synchronized gradient-norm explosion that does not recover is the textbook case for the standard remedy: skip the offending batch, restore from the last good checkpoint, and lower the learning rate or tighten gradient clipping before resuming. The gradient-norm and activation statistics from Section 1 are the evidence that turns "the loss spiked" into "an overflow on a long-sequence batch spiked the gradient", and that distinction decides whether you resume or stop.
Because spikes arrive at unpredictable hours and a stalled run wastes money every second, large training operations run automated babysitting: a supervisor process that watches the live telemetry and acts without a human in the loop. It detects a spike against the run's own moving baseline, rolls back to the last good checkpoint, optionally skips the suspect batch, and restarts the workers, exactly the checkpoint-and-restart machinery built earlier in this chapter, now triggered by the monitor rather than by a crash. It also watches for the silent hang, declaring a run dead when throughput hits zero for longer than any healthy stall, and invoking the elastic restart of Section 18.4. The monitor, in other words, is what closes the loop between observing a failure and recovering from it; the demo below builds the detection core of exactly such a supervisor.
The monitor in Code 18.8.2 ingests per-rank step times and gradient norms each step, aggregates them with the cheap mergeable statistics of Section 2 (median, max, and a running mean and standard deviation), and raises two alerts an automated babysitter would act on: a straggler whose step time exceeds a multiple of the median, and a loss spike that breaks out of the run's own moving baseline. It injects a slow rank from step 7 and a spike at step 9 so both detectors fire on real, reproducible output.
import statistics, math
WORLD = 8
def rank_step_time(rank, step):
base = 0.50 # healthy per-step time (s)
if rank == 5 and step >= 7:
return base * 2.7 # rank 5 becomes a straggler
return base + 0.01 * ((rank * 7 + step) % 5) # mild healthy jitter
def loss_at(step):
curve = 3.0 * math.exp(-0.18 * step) + 0.20 # smooth descent
return curve + 1.9 if step == 9 else curve # injected spike at step 9
def grad_norm_at(step):
g = 0.9 + 0.05 * math.sin(step)
return g * 6.5 if step == 9 else g # spike inflates grad norm
STRAGGLER_FACTOR = 1.5 # flag a rank slower than 1.5x the median step time
SPIKE_SIGMA = 3.0 # flag loss above the window mean + 3 standard deviations
SPIKE_WINDOW = 4 # judge each loss against a short TRAILING window, not all
# history, so a smoothly descending curve does not inflate
# the baseline variance and hide the spike
hist_loss = []
print(f"{'step':>4} {'p50_t':>6} {'max_t':>6} {'slow_rank':>9} "
f"{'loss':>6} {'gnorm':>6} alerts")
for step in range(12):
times = [rank_step_time(r, step) for r in range(WORLD)]
p50 = statistics.median(times) # mergeable
slowest = max(range(WORLD), key=lambda r: times[r]) # mergeable max
max_t = times[slowest]
loss, gnorm = loss_at(step), grad_norm_at(step)
alerts = []
if max_t > STRAGGLER_FACTOR * p50: # straggler = slow rank vs median
wait = max_t - p50 # the collective-wait it imposes
alerts.append(f"STRAGGLER rank{slowest} ({max_t/p50:.1f}x p50, "
f"{wait:.2f}s collective-wait)")
if len(hist_loss) >= SPIKE_WINDOW: # need a baseline before judging
recent = hist_loss[-SPIKE_WINDOW:] # only the last few steps
mu, sd = statistics.mean(recent), statistics.pstdev(recent) or 1e-9
if loss > mu + SPIKE_SIGMA * sd: # spike vs the run's recent trend
alerts.append(f"LOSS_SPIKE {loss:.2f} > mu+{SPIKE_SIGMA:.0f}sigma "
f"({mu + SPIKE_SIGMA*sd:.2f}); gnorm {gnorm:.1f}")
hist_loss.append(loss)
slow_label = f"rank{slowest}" if any("STRAGGLER" in a for a in alerts) else "-"
print(f"{step:>4} {p50:>6.2f} {max_t:>6.2f} {slow_label:>9} "
f"{loss:>6.2f} {gnorm:>6.2f} {'; '.join(alerts) if alerts else 'ok'}")
step p50_t max_t slow_rank loss gnorm alerts
0 0.52 0.54 - 3.20 0.90 ok
1 0.52 0.54 - 2.71 0.94 ok
2 0.52 0.54 - 2.29 0.95 ok
3 0.52 0.54 - 1.95 0.91 ok
4 0.53 0.54 - 1.66 0.86 ok
5 0.52 0.54 - 1.42 0.85 ok
6 0.52 0.54 - 1.22 0.89 ok
7 0.53 1.35 rank5 1.05 0.93 STRAGGLER rank5 (2.6x p50, 0.83s collective-wait)
8 0.52 1.35 rank5 0.91 0.95 STRAGGLER rank5 (2.6x p50, 0.83s collective-wait)
9 0.53 1.35 rank5 2.69 5.98 STRAGGLER rank5 (2.6x p50, 0.83s collective-wait); LOSS_SPIKE 2.69 > mu+3sigma (1.72); gnorm 6.0
10 0.53 1.35 rank5 0.70 0.87 STRAGGLER rank5 (2.6x p50, 0.83s collective-wait)
11 0.53 1.35 rank5 0.61 0.85 STRAGGLER rank5 (2.6x p50, 0.83s collective-wait)
Who: An infrastructure engineer on a team pretraining a large language model on 1,024 GPUs.
Situation: A scheduled 48-hour run went silent at hour 40: throughput dropped to zero, no process crashed, every GPU showed 100% utilization.
Problem: With every rank apparently busy and no stack trace anywhere, there was nothing to read; restarting blindly risked losing forty hours of progress to a bug that would simply recur.
Dilemma: Restart immediately and hope the hang was transient, or spend time diagnosing a system with 1,024 identical-looking processes, none of which had reported an error.
Decision: They diagnosed first, because the flight recorder had been armed at launch and a blind restart of a deterministic bug would only waste another forty hours.
How: The NCCL timeout dumped every rank's trace buffer. 1,023 ranks were blocked entering the same all-reduce; one rank was one collective behind, stuck in a data-loader call that had deadlocked on a corrupt input file.
Result: They removed the bad shard, restored the last good checkpoint, and resumed; total lost time was under an hour instead of the run. The flight-recorder dump named the culprit rank in minutes.
Lesson: A hang with no error is not undebuggable; it is the case the flight recorder exists for. Arm it before takeoff, because you cannot add it after the run goes quiet.
The largest published runs have made observability a first-class subsystem. Meta's reporting on Llama 3 training (Dubey et al., 2024) documented that the dominant disruptions on a 16,000-GPU cluster were hardware faults, and that automated detection plus fast restart, not heroics, kept effective training time high. A parallel research thread targets silent data corruption: Google and others have published on detecting the flaky accelerators that return wrong numbers without faulting, since at frontier scale even a one-in-a-million per-operation corruption rate becomes a near-certain per-run event. PyTorch's NCCL flight recorder and the broader push toward standardized collective-trace tooling reflect the same shift: at tens of thousands of ranks, the bottleneck on training reliability is no longer the recovery mechanism but the detection of which rank, of which fault, at which step, and the field now treats fast, automatic fault localization as the frontier problem of fault-tolerant training.
5. Chapter Summary: The Operating Doctrine for Large Runs Beginner
This chapter began from a single, uncomfortable premise and built an engineering response to it. At a thousand GPUs and a multi-day run, failure is not an exception to plan around; it is the steady state, the expected weather. Every mechanism in the chapter is a response to that premise. Distributed checkpointing makes progress durable so a crash costs minutes, not days, by writing sharded snapshots asynchronously off the critical path. Clean restart and replay make recovery deterministic, so a resumed run is the run that crashed and not a subtly different one. Elastic training lets the job shrink and grow with available hardware instead of dying when a node leaves. Straggler detection and mitigation attack the slow rank that taxes everyone at the barrier. Preemption and spot-instance training turn cheap, interruptible hardware into a viable substrate by making interruption survivable. Memory offload across the hierarchy lets a model that does not fit in GPU memory train anyway by staging state through host memory and storage. And this section, monitoring, is what lets you see all of it working, or failing, in time to act.
Failure is the norm at thousand-GPU scale, so engineer for it deliberately: (1) checkpoint often and asynchronously so progress is durable; (2) restart cleanly and deterministically so recovery resumes the same run; (3) go elastic so the job survives nodes leaving and joining; (4) hunt stragglers because the slowest rank sets the pace for all the rest; (5) ride spot and preemptible capacity by making interruption survivable; (6) offload to fit when the model exceeds device memory; and underneath every one of these, (7) observe everything that matters, throughput and MFU, loss shape, per-rank step time, gradient norms, and hardware health, aggregated cheaply with mergeable sketches; so that (8) you can act, closing the loop from a flight-recorder dump or a loss-spike alert to an automatic rollback and restart. A large run does not stay alive because nothing breaks; it stays alive because everything that breaks is seen, localized, and recovered before it costs a week.
That doctrine is the foundation the next chapter stands on. Chapter 18 taught how to keep any large run alive; Chapter 19 turns to the hardest large run of all, training a foundation model from scratch, where the data construction, the orchestration of months of compute, and the economics all assume the elasticity, fault tolerance, and observability built here. Everything that follows in Chapter 19 presumes a run that can checkpoint, restart, scale, and be watched; this chapter is what makes that presumption safe.
Section 1 separates headline signals (throughput/MFU, loss shape, per-rank step time) from diagnostic ones (gradient norms, activation statistics, hardware health, data-pipeline depth). For each of the following observed symptoms, name which headline signal would alert first and which diagnostic signal would most likely explain the root cause: (a) MFU drifts from 41% to 34% over an hour with no loss change; (b) the loss spikes sharply and does not recover; (c) throughput is healthy but the loss curve plateaus far above expectations. Explain why mixing up the headline and the diagnostic role would send you debugging the wrong subsystem.
Starting from Code 18.8.2, add two capabilities a real babysitter needs. First, add a hang detector that raises an alert when the median step time is zero (or exceeds a hard ceiling) for more than two consecutive steps, simulating a stalled run. Second, replace the exact median with a small mergeable quantile sketch: maintain a fixed-size reservoir or bucketed histogram per step instead of all eight values, and report an approximate p50 and p99 step time from it. Verify both the straggler and the spike still fire, and discuss how the sketch's memory stays constant as you scale the simulated world size from 8 to 8,000 ranks.
Suppose every one of $N = 1024$ ranks emits $M = 100$ metrics per step, each a 4-byte float plus a small key, and the run does 2 steps per second. Estimate the points-per-second and the bytes-per-second if all ranks log everything to a central collector, then re-estimate under the rank-0-plus-1-in-64-sampled policy of Code 18.8.1 with node-local pre-aggregation reducing the per-step payload to a few merged summaries. Argue from the two numbers why naive full logging recreates the all-to-one bottleneck the rest of the chapter works to avoid, and relate the $O(N)$ versus $O(\log N)$ aggregation cost to the collective-communication trees of Chapter 15.
These projects turn the chapter's doctrine into something that runs. Each one is sized for a small team or a determined individual, and each exercises several of the eight moves from the Key Takeaway together.
1. A fault-tolerant training loop with checkpoint, resume, and straggler detection. Build a small multi-process data-parallel training loop (eight processes on one machine is enough) that checkpoints asynchronously every $K$ steps, detects and recovers from a killed worker by restoring the last checkpoint and restarting the group, and runs the monitor of Code 18.8.2 as a side thread that flags a straggler you inject by sleeping one rank. Measure the wasted GPU time the straggler causes and the recovery time after a kill, and report both as the metrics that justify each mechanism.
2. A loss-spike babysitter with automatic rollback. Wrap a real training run (a small transformer on a public dataset) in a supervisor that watches loss and gradient norm against a moving baseline, and on a spike automatically skips the suspect batch, restores the last good checkpoint, lowers the learning rate, and resumes, logging every intervention. Inject reproducible spikes (a deliberately bad batch) and report how much training the babysitter saves versus an unsupervised run that diverges.
3. A mergeable-telemetry aggregator at simulated scale. Implement the hierarchical aggregation of Section 2 with a quantile sketch (t-digest or a bucketed histogram), simulate 1,000 to 10,000 ranks each producing a step-time stream, and show that p50/p99 estimates stay within a target error while the per-step network payload and memory stay constant as the rank count grows. Benchmark it against a naive gather-everything baseline and plot the $O(N)$ versus $O(\log N)$ crossover.