Part VII: Cluster, Edge, and Reliable Infrastructure
Chapter 35: Reliable and Secure Distributed AI

Fault Tolerance and Recovery

"I am a checkpoint on a disk, hoping I am never needed but glad I exist. Every hour I grow a little staler, and every hour I am a little more grateful that nothing has gone wrong yet."

A Checkpoint That Has Not Been Restored From, Yet
Big Picture

Fault tolerance is the discipline of letting a distributed AI system lose components without losing its work, by holding enough redundant state that any survivor can reconstruct what a casualty held. The previous section established that at a thousand machines something is always broken; this section is the answer to "so what do we do about it?" There are only a few primitive moves, and every production system is a combination of them: write the state down periodically so you can roll back to it (checkpoint and restart), keep a live copy running elsewhere so you can switch instantly (replication and standby), spread the bytes of stored state across machines so a few losses are recoverable (redundancy and erasure coding), and keep a deterministic recipe so you can recompute a lost piece from scratch (lineage). The distinctive twist for AI is coupling: a synchronous thousand-GPU training step is one logical computation, so a single dead worker rolls back all of them, which makes the cost and frequency of checkpointing a first-order design parameter rather than housekeeping.

In the previous section we measured the failures: across a large fleet, the question is not whether a component dies during a long job but how many die and how often. This section turns to the mechanisms that survive those deaths. They are not specific to AI; checkpointing, replication, and erasure coding are decades-old systems techniques. What is specific to AI is the shape of the workload they protect. A web server is a pool of independent request handlers, and losing one costs one request. A large-scale training run is a single tightly coupled computation whose workers advance in lockstep through a sequence of synchronized steps, so losing one worker corrupts the global state and forces every worker back to the last point where the global state was consistent. That coupling is why a chapter on reliability must treat the cost of recovery, not just its correctness, as the central concern.

1. Checkpoint and Restart: The Workhorse Beginner

The simplest durable recovery mechanism is to periodically write the system's state to stable storage and, after a failure, reload the most recent such snapshot and resume. For training this state is concrete: the model parameters, the optimizer state (momentum buffers, Adam's running moments), the data-loader position, and the step counter and learning-rate schedule. A checkpoint that omits the optimizer state restarts a different optimization trajectory and is a common silent bug. The recovery contract is that a restored job is statistically indistinguishable from one that never failed, which requires the snapshot to be a globally consistent cut: every worker's state taken at the same logical step, with no all-reduce half-finished across the boundary. In synchronous training this consistency is free, because the natural checkpoint point is between steps, when no collective is in flight.

wall-clock time ckpt ckpt (last good) ✗ failure work since ckpt is lost rollback: reload last checkpoint resume A synchronous job: all workers share one timeline one worker's death rolls back every worker
Figure 35.2.1: The checkpoint and restart cycle. Solid green is committed work; the dashed red segment between the last checkpoint and the failure is wasted and must be recomputed after the rollback. Because a synchronous training step couples all workers, the rollback applies to the entire job, not just the failed worker.

The cost of checkpointing has two parts that pull in opposite directions, and naming them is the whole design problem. Checkpoint too rarely and each failure throws away a large slab of work, the dashed segment in Figure 35.2.1. Checkpoint too often and the time spent writing snapshots, which buys nothing when no failure occurs, dominates. Let $C$ be the wall-clock cost of writing one checkpoint and $\tau$ the interval of useful work between checkpoints. If a failure strikes uniformly at random within an interval, the expected wasted work per failure is half the interval, $\tau/2$, plus the restart cost. The expected fraction of time lost to checkpointing overhead, combining the periodic write cost and the per-failure rollback, is approximately

$$\text{overhead}(\tau) \;\approx\; \underbrace{\frac{C}{\tau}}_{\text{writing}} \;+\; \underbrace{\frac{\tau/2 + R}{\text{MTBF}}}_{\text{rollback}},$$

where $R$ is the restart latency and $\text{MTBF}$ is the mean wall-clock time between job failures. The first term falls as $\tau$ grows; the second rises. Minimizing their sum gives the classic Young/Daly optimal interval $\tau^\star \approx \sqrt{2\,C\cdot\text{MTBF}}$, the same result we derived for spot-instance and preemption economics in Section 33.8. The point worth internalizing is that the optimum scales with the square root of the MTBF: a fleet that fails twice as often should checkpoint only about $1.4$ times as frequently, not twice as frequently.

Key Insight: At Scale, One Worker's Death Is Every Worker's Rollback

In a synchronous data-parallel job, the $N$ workers are not $N$ independent jobs; they are one computation with $N$ physical replicas of the control flow. The global state is consistent only at step boundaries, so when any single worker dies the only consistent state to return to is the last checkpoint, and every surviving worker must discard its progress and reload. This is why checkpoint cost is first-order at scale: the price of a failure is not "one worker's lost work" but "$N$ workers' lost work," and the frequency that price is paid grows with $N$ because more workers means a shorter MTBF. The cure (checkpoint more often) is itself an $N$-wide synchronized stall, so the optimization in the equation above is unavoidable, not optional.

Several refinements attack the cost $C$ directly. Asynchronous checkpointing snapshots the parameters into a staging buffer (often pinned host memory) and lets training continue while a background thread flushes that buffer to storage, hiding most of $C$ behind useful compute. In-memory and redundant checkpointing keeps a copy of each worker's state in the memory of a peer rather than on slow shared storage, so a single failure recovers from a neighbor at memory speed and only correlated failures fall back to disk. Sharded checkpointing is mandatory for huge models: a trillion-parameter model's state is terabytes, and no single node can write it, so each shard owner writes its own slice in parallel, turning a serial terabyte write into $N$ concurrent gigabyte writes. These are the mechanisms that make $\tau^\star$ achievable in practice; Chapter 18 is the deep dive on engineering them into a real training loop.

Library Shortcut: Distributed, Sharded Checkpoints in a Few Lines

A single-process job persists state with one call, torch.save({...}, path), restored with torch.load. That does not scale: it would gather a terabyte model to one rank and write it serially. PyTorch's distributed checkpoint API instead writes a single logical checkpoint as many parallel shards, one per rank, and re-shards transparently on load even if you resume on a different number of GPUs:

import torch.distributed.checkpoint as dcp
from torch.distributed.checkpoint.state_dict import (
    get_state_dict, set_state_dict)

# SAVE: every rank writes only its own shard, in parallel, to shared storage.
model_sd, optim_sd = get_state_dict(model, optimizer)
dcp.save({"model": model_sd, "optim": optim_sd}, checkpoint_id="ckpt/step_42000")

# LOAD: re-shards automatically, even onto a different world size after a restart.
model_sd, optim_sd = get_state_dict(model, optimizer)
dcp.load({"model": model_sd, "optim": optim_sd}, checkpoint_id="ckpt/step_42000")
set_state_dict(model, optimizer, model_state_dict=model_sd, optim_state_dict=optim_sd)
Code 35.2.1: Sharded distributed checkpointing with torch.distributed.checkpoint. Roughly a hundred lines of manual gather, slice, and re-shard logic collapse to two calls; the library handles parallel sharded writes, the resharding needed when a restart changes the world size, and the metadata that ties the shards into one logical checkpoint.

2. Replication and Standby: No Rollback at All Intermediate

Checkpointing accepts a rollback as the price of cheap durability. For components that must not lose any work, the alternative is to run a live redundant copy that can take over with little or no lost state. The general technique is state machine replication: model the component as a deterministic state machine, feed every replica the identical sequence of inputs through a consensus protocol, and each replica independently reaches the identical state, so any survivor can answer for a casualty. The agreement on the input order is exactly the consensus problem of Chapter 2; a replicated controller or scheduler that must never forget a decision is the canonical place it appears in an AI cluster.

Replication is graded by how warm the standby is, which trades recovery time against standing cost. A hot standby processes the same inputs in real time and holds current state, so failover is near-instant but doubles the running cost of that component. A warm standby stays loaded and periodically synchronized but does not do live work, so failover takes seconds to reload the gap. A cold standby is provisioned only after the failure is detected, the cheapest option and the slowest to recover. Table 35.2.1 places these alongside the other primitives so the trade space is visible at once. The right choice depends on what the component is: a training worker is cheap to restart from a checkpoint, so cold or no standby is correct, whereas the cluster's single scheduler or parameter-server coordinator may warrant a hot replicated quorum because its loss stalls everything.

Table 35.2.1: The recovery primitives and where each fits in a distributed AI system. Cost and recovery time trade against each other; the right primitive depends on how expensive the protected state is to reconstruct and how long the system can tolerate its absence.
PrimitiveHow it recoversRecovery timeStanding costTypical AI use
Checkpoint / restartReload last snapshot, recompute sinceMinutes (restart + lost work)Storage + periodic write stallsTraining workers, long batch jobs
Hot standby (replication)Live replica takes overNear-instantDoubles the componentScheduler, coordinator, control plane
Warm standbyPromote a synchronized spareSecondsOne loaded but idle replicaInference model replicas, gateways
Cold standbyProvision on demand after failureMinutes (spin-up + load)Near zeroElastic training spares, burst capacity
Erasure-coded redundancyReconstruct lost shards from parityRead + decode latency$n/k$ storage overheadCheckpoint and dataset storage
Lineage recomputeRe-run the deterministic planTime to recompute the lost partitionNear zero (store the plan)Data pipelines, feature computation

3. Redundancy and Erasure Coding for Stored State Intermediate

Checkpoints and datasets live on storage, and that storage is itself a distributed system that loses disks and nodes. The crude protection is full replication: keep $r$ copies of every byte, survive $r-1$ losses, and pay an $r\times$ storage cost. For petabyte-scale checkpoint and dataset stores that overhead is brutal, so production systems use erasure coding instead. An $(n,k)$ code splits each object into $k$ data shards and computes $n-k$ parity shards, storing all $n$ on distinct nodes; any $k$ of the $n$ suffice to reconstruct the object, so the scheme tolerates up to $n-k$ simultaneous shard losses. The storage overhead is

$$\text{overhead} = \frac{n}{k}, \qquad \text{tolerating up to } n-k \text{ losses}.$$

A common choice such as a $(14,10)$ Reed-Solomon code tolerates four losses at only $1.4\times$ overhead, against the $3\times$ of triple replication for comparable durability. That is the same reason distributed file systems for AI storage, covered in Chapter 8, default to erasure coding for cold data and reserve replication for hot data where read latency matters (reconstructing from parity costs a decode). The demo below makes the recovery concrete on small integers: it encodes $k$ data values into $n$ shards, deletes $n-k$ of them, and rebuilds the originals from the survivors alone.

import numpy as np
from numpy.polynomial import polynomial as P

# A toy (n, k) Reed-Solomon-style code over the reals: treat the k data values as
# the coefficients of a degree-(k-1) polynomial, then store its value at n points.
# ANY k of those n samples determine the polynomial, hence recover the data.
k, n = 4, 7                                  # 4 data shards, 3 parity shards
data = np.array([12.0, -5.0, 8.0, 3.0])     # the k values we must protect
xs = np.arange(1, n + 1, dtype=float)        # n distinct evaluation points
shards = P.polyval(xs, data)                 # n encoded shards (data + parity)

# Lose n - k = 3 shards. Keep an arbitrary surviving subset of size k.
survivors = [0, 2, 5, 6]                      # indices of shards that did NOT die
sx, sy = xs[survivors], shards[survivors]
recovered = np.polynomial.polynomial.polyfit(sx, sy, k - 1)  # solve for coeffs

print("original data :", data)
print("lost shards   :", [i for i in range(n) if i not in survivors])
print("recovered     :", np.round(recovered, 6))
print("max abs error :", f"{np.max(np.abs(recovered - data)):.2e}")
print("overhead n/k  :", f"{n / k:.2f}x", "  tolerates", n - k, "losses")
Code 35.2.2: An $(n,k)$ erasure code on integers, encoding by polynomial evaluation and decoding by interpolation. Production codes work over a finite field for exact arithmetic and bounded shard size, but the algebra (any $k$ of $n$ samples determine a degree-$(k-1)$ polynomial) is identical.
original data : [12. -5.  8.  3.]
lost shards   : [1, 3, 4]
recovered     : [12. -5.  8.  3.]
max abs error : 5.45e-13
overhead n/k  : 1.75x   tolerates 3 losses
Output 35.2.2: Three of seven shards were destroyed, yet the four originals are recovered exactly to floating-point precision from the four survivors. The $1.75\times$ overhead buys tolerance of three simultaneous losses, far cheaper than the $4\times$ replication that would be needed for the same fault tolerance.

4. Lineage: Recompute Instead of Store Intermediate

There is a fourth primitive that stores almost nothing and pays at recovery time instead. If a piece of state was produced by a deterministic computation from inputs that still exist, you do not need to checkpoint that state at all: record the recipe that produced it, and on loss simply re-run the recipe. This is lineage-based recovery, and its purest expression is the resilient distributed dataset (RDD) of Spark, introduced in Chapter 7 and built on the re-execution model of MapReduce in Chapter 6. An RDD remembers the deterministic transformations (the lineage graph) that created it from durable inputs; when a partition is lost, the framework recomputes exactly that partition by replaying its slice of the graph, with no checkpoint and no replica. The cost is bounded because only the lost partition is recomputed, not the whole dataset.

Lineage is the right tool precisely when the computation is cheap relative to storing its output and is deterministic. A feature-engineering pipeline, a data-cleaning stage, or a deterministic preprocessing step over a fixed corpus fits perfectly. It fits poorly when the computation is stochastic (a training step depends on random sampling and floating-point reduction order), expensive (recomputing fifty thousand gradient steps is absurd), or its inputs no longer exist. That is why training uses checkpoints while the data pipeline that feeds it uses lineage, and a real system layers both: the pipeline of Chapter 8 recovers data partitions by lineage, and the trainer that consumes them recovers by checkpoint. Knowing which half of a system can afford to recompute and which must persist is the practical content of fault-tolerance design.

Practical Example: The Checkpoint That Was Too Slow to Save the Run

Who: A platform engineer running a 1024-GPU pretraining job for a foundation model team.

Situation: The job ran on preemptible capacity with a measured MTBF of about ninety minutes; a full checkpoint of the model and optimizer state took eleven minutes to a shared filesystem.

Problem: The team checkpointed every thirty minutes "to be safe," and the run was crawling. Each save stalled all 1024 GPUs for eleven minutes, and with frequent preemptions the job spent more time writing and rolling back than training.

Dilemma: Checkpoint less often and risk losing more work per preemption, or keep checkpointing often and keep bleeding throughput to the write stalls.

Decision: They attacked both terms of the overhead equation. They moved to sharded asynchronous checkpointing so each rank wrote its own slice in parallel into pinned memory and flushed in the background, cutting the visible $C$ from eleven minutes to under one, and they set the interval near the Young/Daly $\tau^\star = \sqrt{2C\cdot\text{MTBF}}$ instead of a round number.

How: They adopted torch.distributed.checkpoint for the sharded writes and an in-memory peer copy so most preemptions recovered from a neighbor rather than the filesystem.

Result: Effective throughput rose from roughly 55% to over 90% of the failure-free rate, and a preemption now cost a few minutes of rollback instead of forty.

Lesson: Both terms of the checkpoint-overhead equation are engineerable. Shrinking $C$ with sharded asynchronous writes lets you check the optimal interval far cheaper, and the right interval is computed from $C$ and MTBF, not guessed.

5. Health Checking, Failover, and Supervised Restart Advanced

Every mechanism above presupposes that a failure was detected and a recovery was triggered, and detection is its own problem because a slow node and a dead node look alike for a while. Production systems detect liveness with health checks: periodic heartbeats or probes, with a timeout past which a component is declared dead. The timeout is a genuine trade-off, the practical face of the failure-detector impossibility from Chapter 2: too short and a momentary stall triggers a needless and expensive failover, too long and the system limps on a dead component. When a death is confirmed, automatic failover promotes a standby or, in training, triggers a coordinated rollback and restart. Deciding the failover correctly when the cluster itself may be partitioned is again a consensus question, which is why robust control planes route failover decisions through the same quorum that orders their state.

For training specifically, the recovery is wrapped in a supervisor that owns the restart policy. PyTorch's elastic launcher, torchrun, is exactly this supervisor: it runs a rendezvous so workers discover each other, monitors the worker processes, and on a failure restarts the whole worker group from the last checkpoint up to a configured retry budget. Crucially it is elastic, meaning it can re-form the group at a different size when a node is permanently lost and a spare is unavailable, which is the bridge from "tolerate failure" to "tolerate failure without idling the rest of the cluster." This elastic supervised restart is the practical payoff of the whole section, and Chapter 18 develops the rendezvous, membership change, and state-resharding it requires in full.

Library Shortcut: Supervised Elastic Restart with torchrun

Writing a supervisor that detects a dead rank, tears down the process group, reloads the checkpoint, re-runs rendezvous, and possibly reshapes the job is hundreds of lines of fragile coordination code. The elastic launcher does it from the command line:

# Detect a dead worker, restart the WHOLE group from the last checkpoint,
# up to 3 times; re-form the group at any size between 8 and 16 nodes.
torchrun \
  --nnodes=8:16 \
  --nproc_per_node=8 \
  --max-restarts=3 \
  --rdzv-backend=c10d \
  --rdzv-endpoint=$HEAD_NODE:29500 \
  train.py --resume-from=latest
Code 35.2.3: Supervised, elastic, checkpoint-aware restart in one command. The --nnodes=8:16 range makes the job elastic, --max-restarts bounds the retry budget so a permanently broken job eventually gives up, and --rdzv-backend selects the rendezvous that lets surviving workers re-discover each other after a membership change. The training script need only reload its checkpoint on start.

The simulation below ties the whole section together quantitatively. It models the 1000-GPU synchronous job whose every failure rolls back all workers, sweeps the checkpoint interval, and measures effective throughput, the fraction of wall-clock that turned into committed work. It then prints the Young/Daly optimum for comparison, so we can check whether the theory predicts the measured sweet spot.

import random

# A 1000-GPU synchronous training job. Any single worker failure forces ALL
# workers to roll back to the last consistent checkpoint, so every failure
# wastes the work done since that checkpoint plus the time to restart.

random.seed(7)
TOTAL_STEPS   = 100_000     # useful steps the job must complete
STEP_SECONDS  = 1.0         # wall-clock per step (all workers, synchronous)
CKPT_SECONDS  = 60.0        # cost to write one consistent checkpoint
RESTART_SECONDS = 120.0     # detect failure + respawn + reload checkpoint
# With 1000 GPUs each failing rarely, the JOB fails often. Per-step job-failure
# probability aggregates every worker's hazard into one number.
P_FAIL_PER_STEP = 1.0 / 8000.0

def simulate(ckpt_interval):
    completed = 0          # useful steps durably committed (survive a rollback)
    since_ckpt = 0         # useful steps done since the last checkpoint
    wall = 0.0             # total wall-clock seconds
    while completed < TOTAL_STEPS:
        wall += STEP_SECONDS                  # advance one step
        since_ckpt += 1
        if random.random() < P_FAIL_PER_STEP:
            wall += RESTART_SECONDS           # rollback: discard work since ckpt
            since_ckpt = 0
            continue
        if since_ckpt >= ckpt_interval:
            wall += CKPT_SECONDS              # pay to persist
            completed += since_ckpt           # now durable
            since_ckpt = 0
    return wall

print(f"{'interval':>9} {'wall (h)':>9} {'efficiency':>11} {'overhead %':>11}")
ideal = TOTAL_STEPS * STEP_SECONDS
for interval in (50, 200, 1000, 5000, 20000):
    wall = simulate(interval)
    eff = ideal / wall
    overhead = 100.0 * (wall - ideal) / ideal
    print(f"{interval:>9} {wall/3600:>9.2f} {eff:>11.3f} {overhead:>11.1f}")

# Young/Daly optimal interval (Ch 33.8): tau* = sqrt(2 * C * MTBF).
mtbf = STEP_SECONDS / P_FAIL_PER_STEP         # mean wall-seconds between failures
tau_star = (2 * CKPT_SECONDS * mtbf) ** 0.5   # optimal seconds between checkpoints
print(f"\nMTBF (s)            : {mtbf:.0f}")
print(f"Young/Daly tau* (s) : {tau_star:.0f}")
print(f"  -> ~{tau_star/STEP_SECONDS:.0f} steps between checkpoints")
Code 35.2.4: A discrete-event simulation of checkpoint, failure, and rollback on a coupled synchronous job. The interval sweep exposes the U-shaped overhead curve, and the closing lines compute the analytic optimum to compare against the simulated sweet spot.
 interval  wall (h)  efficiency  overhead %
       50     61.38       0.453       121.0
      200     36.91       0.753        32.9
     1000     31.34       0.886        12.8
     5000     35.16       0.790        26.6
    20000    132.40       0.210       376.6

MTBF (s)            : 8000
Young/Daly tau* (s) : 980
  -> ~980 steps between checkpoints
Output 35.2.4: Effective throughput peaks near an interval of 1000 steps (88.6% efficiency, 12.8% overhead) and degrades sharply on both sides: too-frequent checkpointing drowns in write stalls, too-rare checkpointing drowns in rollback. The analytic Young/Daly optimum of ~980 steps lands squarely on the simulated peak, confirming the theory predicts the right interval from $C$ and MTBF alone.

The agreement between the simulated peak near 1000 steps and the analytic $\tau^\star \approx 980$ is the section's payoff: the checkpoint interval is not a knob to be tuned blindly but a quantity computed from the cost of a checkpoint and the failure rate of the fleet. As the fleet grows and the MTBF shrinks, $\tau^\star$ shrinks with the square root of the MTBF, and the only way to keep overhead low is to drive down $C$ with the sharded asynchronous and in-memory techniques of Section 1.

Thesis Thread: Recovery Is a Distributed Computation Too

Every primitive in this section is itself scaled out. The checkpoint is sharded across the same workers that hold the model (Section 1); the erasure code spreads parity across the same storage fleet that holds the data (Section 3); the standby is replicated through the same consensus that orders the control plane (Section 2); the lineage is the same deterministic plan that the data engine already parallelized (Section 4). Fault tolerance does not sit beside the distributed system as a safety net bolted on afterward; it is woven from the very primitives, sharding, replication, collective combination, and deterministic re-execution, that the rest of this book builds for performance. The reliability of a scaled-out AI system is engineered with the same tools as its speed.

Research Frontier: Near-Zero-Overhead Checkpointing at Trillion Scale (2024 to 2026)

Because checkpoint cost is first-order at frontier scale, a sharp research line drives $C$ toward zero. Systems in the lineage of CheckFreq and Gemini overlap and pipeline the snapshot so the visible stall nearly vanishes, and in-memory peer checkpointing recovers most failures from a neighbor's RAM rather than disk, with periodic asynchronous flushes guarding against correlated loss. A parallel thread makes the trainer itself resilient to a changing worker set without a full rollback: partial and redundancy-based recovery reconstruct only the lost shard's state from peers, and oversubscribed-replica schemes keep a hot spare warm so a preemption costs seconds rather than a reload. The common goal is to decouple the checkpoint interval from throughput entirely, so that at ten-thousand-GPU scale the Young/Daly overhead in Output 35.2.4 falls into the low single digits. We meet the training-loop machinery these systems depend on in Chapter 18, and the Byzantine-robust aggregation that hardens recovery against malicious, not merely crashed, workers in Section 35.5.

Fun Note: Schrodinger's Checkpoint

A checkpoint you never test is in superposition: simultaneously a perfect safety net and a corrupt, unloadable blob, and you find out which only at the worst possible moment. The teams that sleep well are the ones that periodically restore from a checkpoint into a throwaway job just to confirm it loads, the storage equivalent of a fire drill. An untested backup is a hope, not a recovery mechanism.

Exercise 35.2.1: Match the Primitive to the State Conceptual

For each piece of state, name the recovery primitive from Table 35.2.1 you would use and justify it in one sentence: (a) the 4-terabyte optimizer state of a foundation-model training run; (b) the cluster scheduler's record of which job owns which GPU; (c) a 200-terabyte deduplicated text corpus on cold storage; (d) the cleaned feature table produced deterministically from that corpus each night. Explain why applying checkpointing to (d) or lineage to (a) would be the wrong choice.

Exercise 35.2.2: Find the Optimum, Then Shrink It Coding

Extend Code 35.2.4 to sweep a finer grid of intervals and report the simulated optimum, then verify it tracks $\tau^\star = \sqrt{2C\cdot\text{MTBF}}$ as you vary the checkpoint cost $C$ over $\{15, 60, 240\}$ seconds. Next, model asynchronous checkpointing by making only a fraction $f$ of $C$ block the workers (the rest overlaps with compute), and show how the achievable efficiency and the optimal interval change as $f$ drops from 1.0 to 0.1. Relate your finding to the practical example in Section 4.

Exercise 35.2.3: Replication Versus Erasure Coding Analysis

A checkpoint store must tolerate the simultaneous loss of any four storage nodes. Compare triple replication against a $(14,10)$ erasure code: give the storage overhead of each from the formulas in Sections 2 and 3, and state the qualitative cost each pays at recovery time (network and decode for erasure coding, read amplification for replication). Then argue which you would choose for the actively written latest checkpoint versus an archived one from last week, and why the answer differs.