Section 2.4: Fault Tolerance and Recovery

"On one machine, failure was an event. On ten thousand machines, failure became the weather, and I had to learn to forecast it."
A Coordinator That Stopped Counting on Luck

Big Picture

At the scale where distributed AI lives, individual components do not fail occasionally; something is broken almost all of the time, so correctness can no longer assume that every machine completes its work. A job on one server either runs to the end or it does not, and you simply rerun it. A job on a thousand servers runs inside a population where, at any given moment, a disk is full, a link is congested, or a process has just died. The arithmetic of independent failure makes this unavoidable: even very reliable nodes, multiplied by enough of them, guarantee that some node is down. This section establishes the failure models worth designing against, derives the probability that pushes failure from a rare event to the expected case, and surveys the four recovery mechanisms (checkpoint/restart, deterministic re-execution, lineage, and replication) that every later chapter reuses to keep large jobs alive.

In the previous section we partitioned data and parameters across machines so that no single node had to hold everything. That very partitioning is what now exposes us to failure. The moment a computation depends on $K$ machines instead of one, its success depends on all $K$ of them staying up long enough, and the more machines we recruit to go faster, the more ways the whole has to break. This is the central tension of scale-out: the same multiplicity that buys throughput also multiplies the failure surface. Fault tolerance is the discipline of buying back reliability that distribution spent, and it is not optional infrastructure bolted on at the end; it is a property you design for from the first line, because a training run that cannot survive a single preempted worker will not survive a week on a real cluster.

We proceed in three moves. First we name the ways a component can fail, from the merciful (it simply stops) to the malicious (it lies). Then we do the short calculation that turns "failures are rare" into "failures are certain," which is the reason this section exists at all. Finally we walk the recovery mechanisms, each one a different answer to the question "how do we make progress survive a death," and we trace each forward to the chapter where distributed AI puts it to work.

1. The Failure Models Worth Designing Against Beginner

Not all failures are the same, and the cost of tolerating one depends sharply on how badly it can misbehave. Distributed systems theory orders failure models from weakest (easiest to tolerate) to strongest (hardest), and choosing the right model for your setting is the difference between a design that is robust and one that is needlessly expensive. The four models that matter for distributed AI are crash-stop, omission, partition, and Byzantine, summarized in Table 2.4.1.

In the crash-stop (or fail-stop) model, a node works correctly until some instant and then halts permanently, producing no further output. A worker whose machine loses power, whose process is killed by the out-of-memory reaper, or whose spot instance is reclaimed by the cloud provider all behave this way: they were correct, then they were gone. This is the friendliest model, because a survivor can safely assume that anything it did not hear from the dead node simply never happened, and almost all production training systems are designed against exactly this assumption.

The omission model is slightly harder: a node or link drops some messages but not others, so a request or a gradient update can silently vanish without the sender crashing. Closely related, and the dominant headache in practice, is the network partition, in which a working group of nodes is split into subgroups that can each talk internally but cannot reach each other. The cruelty of a partition is that, from inside each subgroup, the nodes on the other side are indistinguishable from crashed: you cannot tell "it died" from "I cannot reach it." This ambiguity is the seed of the consistency dilemmas that Section 2.5 formalizes as the CAP trade-off.

The Byzantine model is the strongest and the rarest: a node may behave arbitrarily, sending different and even maliciously crafted messages to different peers, whether from a bug, corrupted memory, or an actual adversary. Tolerating Byzantine faults is expensive (classical agreement needs more than two-thirds of nodes honest) and most data-center training quite reasonably assumes it away. It returns as a first-class concern only where participants are untrusted, most prominently in federated learning, where a malicious client can poison the global model by reporting a corrupted gradient; the Byzantine-robust aggregation rules that defend against this are developed in Chapter 35.

Table 2.4.1: The four failure models, ordered from easiest to hardest to tolerate, with the distributed-AI setting where each one binds. A system designed against a stronger model also tolerates all weaker ones, at higher cost.

Model	What the faulty node does	Where it binds in distributed AI
Crash-stop	Works correctly, then halts forever	Spot-instance preemption, OOM kills, power loss (the default assumption)
Omission	Silently drops some messages	Lossy networks, overloaded parameter servers dropping updates
Partition	Splits into groups that cannot reach each other	Rack or switch failure; "is it dead or unreachable?" ambiguity
Byzantine	Behaves arbitrarily, may lie to different peers	Federated learning with untrusted clients; gradient poisoning

Key Insight: Match the Failure Model to the Threat, Not to Paranoia

The strength of failure model you tolerate is a cost decision, not a measure of caution. Designing a data-center training job against the Byzantine model would multiply its cost for protection it does not need, because the machines are yours and the threat is a dead worker, not a lying one. Designing a federated system against only crash-stop would leave it open to a single poisoned client. The skill is to identify the weakest model that still covers the faults your environment actually produces, then design exactly to it: crash-stop for the cluster you own, Byzantine-robust only where you do not trust the participants.

2. Why Failure Becomes Certain: The Independent-Failure Math Intermediate

The reason fault tolerance is unavoidable at scale is not engineering pessimism; it is arithmetic. Suppose each node, over some window of operation, fails independently with a small probability $p$. A cluster has $K$ such nodes. The probability that a particular node survives the window is $1 - p$, and because the failures are independent, the probability that all $K$ survive is the product of the individual survival probabilities,

$$P(\text{all } K \text{ survive}) = (1 - p)^K.$$

The quantity we actually care about, the probability that the job is disrupted by at least one failure, is the complement,

$$P(\text{at least one of } K \text{ fails}) = 1 - (1 - p)^K.$$

This expression is small for small $K$ but climbs toward $1$ as $K$ grows, and it climbs fast. For tiny $p$ we can read off the rate of climb with the approximation $(1-p)^K \approx e^{-pK}$, so the failure probability is roughly $1 - e^{-pK}$; once the product $pK$ reaches the order of $1$, disruption is more likely than not. The expected number of failed nodes is even simpler, $\mathbb{E}[\text{failures}] = pK$ by linearity of expectation, which makes the threshold concrete: when you have enough nodes that you expect on average one of them to fail in a window, you must plan for failures within that window. The code below tabulates both quantities for a fixed per-node failure probability across realistic cluster sizes.

Figure 2.4.1: The failure-probability curve $1 - (1-p)^K$ for a per-node failure probability $p = 0.01$, plotted against cluster size $K$ on a logarithmic axis. At $K = 1$ the disruption probability equals $p$; by $K \approx 100$ it has crossed one half (the marked crossover), and by $K = 1000$ it is essentially one. The numbers behind this curve are produced by Code 2.4.1 and printed in Output 2.4.1.

import math

# Each node fails independently in a time window with probability p.
# With K independent nodes, P(no node fails) = (1-p)^K, so
# P(at least one fails) = 1 - (1-p)^K, and E[failures] = p*K.
p = 0.01  # 1% chance a single node fails during the window
print("Per-node failure probability p =", p)
print(f"{'K':>7} | {'P(>=1 fails)':>14} | {'E[failures]':>12}")
print("-" * 40)
for K in [1, 10, 100, 1000, 10000]:
    p_any  = 1.0 - (1.0 - p) ** K        # complement of all-survive
    e_fail = K * p                        # expected number of dead nodes
    print(f"{K:>7} | {p_any:>14.6f} | {e_fail:>12.2f}")

Code 2.4.1: Computing the disruption probability and the expected failure count as a function of cluster size, directly from the independent-failure identity. No simulation is needed; the closed form is exact.

Per-node failure probability p = 0.01
      K |   P(>=1 fails) |  E[failures]
----------------------------------------
      1 |       0.010000 |         0.01
     10 |       0.095618 |         0.10
    100 |       0.633968 |         1.00
   1000 |       0.999957 |        10.00
  10000 |       1.000000 |       100.00

Output 2.4.1: A per-node failure chance of just one percent gives a single machine excellent odds, but at one thousand machines the probability that some machine fails is $0.999957$, and at ten thousand it rounds to one. This is the curve of Figure 2.4.1 in numbers.

Read the table from top to bottom and the thesis of this entire section appears as a column. At one node, you can ignore failure and rerun on the rare bad day. At one hundred nodes, a job is disrupted more often than not, and you expect to lose one machine per window on average. At one thousand nodes, disruption is a near certainty on every run, and the question is no longer whether a worker dies during your job but how many and how often. A training run that takes hours on thousands of accelerators will, with overwhelming probability, outlive at least one of its workers, so the recovery machinery of the next section is not insurance against an unlikely event; it is the mechanism by which the job finishes at all.

Fun Note: The Birthday Problem Wearing a Hard Hat

The shape of $1 - (1-p)^K$ is the same shape that makes the birthday paradox feel surprising: a small per-item probability, compounded over enough items, becomes a near certainty far sooner than intuition expects. Twenty-three people suffice for a shared birthday to be more likely than not; about seventy nodes at one percent each suffice for a failure to be more likely than not. Operators who have run large clusters stop being surprised and start pre-allocating spare nodes, the cluster-engineering equivalent of not betting against a shared birthday in a crowded room.

3. Four Ways to Make Progress Survive a Death Intermediate

Given that failures are certain, recovery is the art of ensuring that a node's death does not erase the work the cluster has already done. Four mechanisms dominate distributed AI, and they are best understood as four different answers to one question: what do we save, and how do we rebuild what was lost? They are checkpoint/restart, deterministic re-execution, lineage-based recovery, and replication. Most real systems combine several of them.

Checkpoint/restart is the most general and the most widely used in deep learning. At intervals the job writes its entire recoverable state (model parameters, optimizer moments, the data-loader position, the random-number-generator seeds) to durable storage. When any worker dies, the job restarts every worker from the most recent checkpoint and replays forward. Nothing about the computation needs to be deterministic or re-derivable; you simply paid, in advance, to be able to rewind. The cost is the wasted work between the last checkpoint and the crash, plus the time spent writing checkpoints that are never used, which sets up the optimization problem we solve in Section 4. This mechanism is the backbone of fault-tolerant training, and the elastic variant that resizes the worker set on the fly without losing the checkpoint is the subject of Chapter 18.

Deterministic re-execution is cheaper when it applies. If a unit of work is a pure function of inputs that still exist somewhere, you do not need to have saved its output; you can simply run it again. This is the recovery model of MapReduce: each map and reduce task is deterministic over its input split, so when a worker dies, the scheduler reassigns its tasks to a survivor and reruns them from the durable input, with no global checkpoint at all. The completed work of other tasks is untouched. We develop this re-execution model and its shuffle in Chapter 6; its key limitation is that it demands determinism and re-readable inputs, which a long stateful training loop does not naturally provide.

Lineage-based recovery generalizes re-execution to a graph of transformations. Instead of remembering the data, the system remembers the recipe: the sequence of deterministic operations that produced each partition from durable source data. When a partition is lost, the system recomputes only that partition by replaying its slice of the recipe, not the whole job. This is exactly how Spark's Resilient Distributed Datasets recover; the lineage graph is the fault-tolerance mechanism, letting Spark avoid replicating intermediate data while still surviving the loss of any partition, as Chapter 7 details. Lineage is elegant precisely when the computation is a deep chain of deterministic transforms; it degrades when the chain is long and recomputation gets expensive, which is why Spark also offers optional checkpointing to truncate lineage.

Replication takes the opposite stance: rather than rebuild lost state, keep more than one copy of it so that a failure of one copy is invisible. A parameter shard held on three nodes survives the loss of two; a storage block written to three disks survives two disk failures. Replication trades storage and write bandwidth for instant recovery and continued availability during a failure, which is why it dominates the storage and serving tiers (the replicated parameter servers of Chapter 11 are a clean example) even as checkpointing and lineage dominate the compute tier. Table 2.4.2 places the four mechanisms side by side.

Table 2.4.2: The four recovery mechanisms, what each one stores in order to recover, and the distributed-AI system that exemplifies it. Real systems mix them: training checkpoints its compute state while replicating its storage.

Mechanism	What it saves to recover	Recovers by	Exemplar
Checkpoint/restart	Full periodic snapshot of job state	Rewinding all workers to the last snapshot	Deep-learning training (Ch 18)
Re-execution	Nothing; relies on durable inputs	Rerunning the deterministic failed task	MapReduce (Ch 6)
Lineage	The recipe (graph of transforms)	Replaying only the lost partition's slice	Spark RDDs (Ch 7)
Replication	Multiple live copies of the state	Reading from a surviving copy	Storage tiers, parameter servers (Ch 11)

Practical Example: The Pretraining Run That Lost a Node Every Few Hours

Who: A systems engineer on a team pretraining a large language model on a 512-GPU cluster of preemptible cloud instances.

Situation: The full run was budgeted at three weeks of wall-clock time across all 512 workers, all data-parallel replicas of one model.

Problem: Preemptions and hardware faults killed a worker every few hours on average, and the original loop restarted the entire job from scratch on each death, so the run never progressed past two days before a failure reset it.

Dilemma: Checkpoint very often (paying a heavy write tax on every interval) to keep the lost-work small, or checkpoint rarely (cheap writes) and risk discarding many hours of compute on each crash; neither extreme finished in budget.

Decision: They adopted periodic checkpoint/restart with the interval set by the optimization in Section 4, and made restart elastic so a single dead worker no longer reset the others.

How: Every worker's parameters, optimizer state, and data-loader cursor were snapshotted to object storage; on a preemption, the launcher rebuilt the process group at the surviving size and resumed all workers from the latest snapshot rather than from step zero.

Result: The wasted-work overhead settled near five percent, each preemption cost minutes instead of days, and the run finished inside its three-week budget despite losing dozens of workers along the way.

Lesson: At a scale where failure is certain, the right question is never "how do we avoid it" but "how cheaply can we recover," and the answer is a checkpoint interval tuned to the failure rate, not the most frequent checkpointing you can afford.

4. Tuning the Checkpoint Interval Advanced

Checkpointing has a sweet spot. Checkpoint too often and you spend the run writing snapshots you will never read; checkpoint too rarely and each crash discards a large block of work. We can find the optimum with a short model, due in its first-order form to Young and refined by Daly. Let $C$ be the time to write one checkpoint, let the job's mean time to failure be $M$, and suppose we checkpoint every $T$ seconds. The overhead has two parts: a checkpoint tax of $C/T$ (the fraction of time spent writing) and an expected re-do cost of $T/(2M)$ (a failure lands on average halfway through an interval, so we lose about $T/2$ of work, scaled by the failure rate $1/M$). The total wasted-time fraction is

$$f(T) = \frac{C}{T} + \frac{T}{2M}.$$

Minimizing over $T$ by setting $f'(T) = -C/T^2 + 1/(2M) = 0$ gives the classic result that the optimal interval is the geometric balance of the two costs,

$$T^{\star} = \sqrt{2\,C\,M}, \qquad f(T^{\star}) = \sqrt{\frac{2C}{M}}.$$

The optimum is intuitive: checkpoint more often when writes are cheap or failures are frequent (small $M$), and less often when writes are expensive or the system is reliable. The code below evaluates this for a job that writes a checkpoint in 30 seconds and has a 6-hour mean time to failure.

# Optimal checkpoint interval (Young/Daly first-order model).
# Overhead f(T) = C/T + T/(2M): checkpoint tax + expected re-done work.
# Minimized at T* = sqrt(2*C*M), giving min overhead sqrt(2C/M).
C = 30.0          # seconds to write one checkpoint
M = 6 * 3600.0    # job mean time to failure = 6 hours, in seconds
T_opt = math.sqrt(2.0 * C * M)
print(f"Checkpoint write cost C = {C:.0f} s, job MTTF M = {M/3600:.0f} h")
print(f"Optimal interval T* = sqrt(2*C*M) = {T_opt:.0f} s (~{T_opt/60:.1f} min)")

def overhead(T):
    return C / T + T / (2.0 * M)

print(f"\n{'T (s)':>8} | {'overhead %':>10}")
print("-" * 24)
for T in [120, 300, T_opt, 1200, 3600]:
    print(f"{T:>8.0f} | {overhead(T)*100:>10.3f}")
print(f"\nMinimum overhead at T* = {overhead(T_opt)*100:.3f} %")

Code 2.4.2: Solving for the optimal checkpoint interval and sweeping nearby intervals to show the overhead is minimized at $T^{\star} = \sqrt{2CM}$. The sweep makes the flatness of the optimum visible: being a little off is cheap, being far off is not.

Checkpoint write cost C = 30 s, job MTTF M = 6 h
Optimal interval T* = sqrt(2*C*M) = 1138 s (~19.0 min)

   T (s) | overhead %
------------------------
     120 |     25.278
     300 |     10.694
    1138 |      5.270
    1200 |      5.278
    3600 |      9.167

Minimum overhead at T* = 5.270 %

Output 2.4.2: For a 30-second checkpoint and a 6-hour MTTF, the optimal interval is about 19 minutes and the unavoidable overhead is roughly five percent. Checkpointing every two minutes wastes five times as much time as the optimum, all of it in writes nobody reads.

Two lessons sit inside Output 2.4.2. First, the cost of over-checkpointing is real and large: dropping from the optimal 19-minute interval to a nervous 2-minute interval quintuples the overhead, because the write tax $C/T$ explodes as $T$ shrinks. Second, the optimum is broad and forgiving; intervals of 19 and 20 minutes differ by a hundredth of a percent, so you do not need $M$ precisely, only its order of magnitude. As clusters grow, $M$ falls (more nodes, more failures, exactly the math of Section 2), which pulls $T^{\star}$ down and pushes the floor overhead $\sqrt{2C/M}$ up. That rising floor is precisely the pressure behind the research on cheaper checkpoints below.

Research Frontier: Asynchronous Checkpointing and Large-Job Resilience (2024 to 2026)

Because the checkpoint tax $C/T$ grows as jobs scale and $M$ shrinks, an active research line attacks the cost $C$ itself. Asynchronous and in-memory checkpointing systems such as PyTorch's Distributed Checkpoint with asynchronous staging, CheckFreq, and Gemini (Wang et al., 2023) snapshot to a fast tier (host memory or a peer's memory) and flush to durable storage in the background, so the worker barely pauses; this drives the effective $C$ toward zero and lets jobs checkpoint far more often almost for free. A complementary thread targets large-job resilience directly: Meta's reporting on training Llama 3 on 16,000 GPUs documented a mean time between failures of only hours and motivated fast-restart and redundancy tooling, while Oobleck and Bamboo pursue redundant pipeline replicas that let a job continue through a failure without any restart at all, blending replication into the training tier. The common goal is to keep the floor overhead $\sqrt{2C/M}$ small even as $M$ collapses at extreme scale; we return to these elastic and redundant training methods with the machinery to evaluate them in Chapter 18.

Library Shortcut: torch.distributed.checkpoint Does the Snapshot for You

In Code 2.4.2 we reasoned about the checkpoint interval but said nothing about the considerable work of actually snapshotting a sharded model spread across hundreds of workers: gathering each shard, writing it durably, and reloading it onto a possibly different number of workers after a restart. PyTorch's Distributed Checkpoint (DCP) handles all of that, including the asynchronous staging from the research frontier above, behind two calls:

import torch.distributed.checkpoint as dcp

# Save: each rank writes only its own shards in parallel to durable storage.
dcp.save(state_dict, checkpoint_id="ckpt/step_5000")

# Restart (possibly at a DIFFERENT world size): load reshards automatically.
dcp.load(state_dict, checkpoint_id="ckpt/step_5000")

Code 2.4.3: Sharded, restart-safe checkpointing in two calls. The roughly hundred lines of gather, serialize, write, and reshard-on-load that a hand-rolled checkpointer needs collapse into dcp.save and dcp.load, which also handle parallel per-rank writes and the resharding that makes elastic restart (Chapter 18) possible.

We now have the full picture: the failure models that describe how a component can break, the independent-failure math that proves at least one will, and the four recovery mechanisms (with checkpoint/restart tuned by $T^{\star} = \sqrt{2CM}$) that turn a certain death into a survivable one. What we have quietly assumed throughout is that a survivor can always tell a dead node from a slow one and always agree with its peers on what state to recover to. A network partition breaks both assumptions, forcing a choice between staying consistent and staying available. That choice, and the staleness it produces in distributed training, is the subject of the next section.

Exercise 2.4.1: Read the Crossover Off the Curve Conceptual

Using the approximation $P(\text{at least one fails}) \approx 1 - e^{-pK}$ from Section 2, derive the cluster size $K$ at which disruption becomes more likely than not (set the expression equal to $\tfrac{1}{2}$ and solve for $K$). Evaluate it for $p = 0.01$ and confirm it lands near the crossover marked in Figure 2.4.1. Then explain in one sentence why halving $p$ roughly doubles that crossover size, and why that is weak comfort for someone trying to run on ten thousand nodes.

Exercise 2.4.2: Sweep the Checkpoint Interval Against Failure Rate Coding

Extend Code 2.4.2 into a function best_interval(C, M) that returns $T^{\star}$ and the resulting overhead. Call it for a fixed $C = 30$ seconds across mean-time-to-failure values $M \in \{30\text{ min}, 1\text{ h}, 6\text{ h}, 24\text{ h}\}$, and tabulate $T^{\star}$ and the floor overhead $\sqrt{2C/M}$. Confirm numerically that as $M$ falls (more nodes, more failures), the optimal interval shrinks and the unavoidable overhead rises, and state what that implies for whether a 50,000-GPU job can afford a 30-second synchronous checkpoint at all.

Exercise 2.4.3: Match the Mechanism to the Workload Analysis

For each workload, name which of the four recovery mechanisms from Section 3 (checkpoint/restart, re-execution, lineage, replication) you would choose first, and argue why the other three fit worse: (a) a stateless batch job that tokenizes a 40-terabyte web crawl with deterministic per-shard tasks; (b) a stateful 1,000-GPU training loop whose optimizer state cannot be re-derived; (c) a chain of fifteen deterministic Spark transforms over durable source data; (d) an online parameter server that must keep serving reads during a node failure. Identify which workload could reasonably combine two mechanisms and say which two.

The next section turns the partition ambiguity we kept setting aside, the inability to tell a dead node from an unreachable one, into a precise statement about what a distributed system can and cannot guarantee, and connects it to the staleness that distributed training tolerates on purpose. Continue to Section 2.5: Consistency Models: From Parameter Staleness to the CAP Trade-off.