Section 35.1: Reliability in Distributed AI

"Nobody noticed me last night, which is the highest praise a replica can receive. Node 1182 died at 3 a.m.; I had been quietly covering for it since 2:58."
A Replica, Quietly Covering for the Node That Just Died

Big Picture

At fleet scale, failure is not an exception that occasionally interrupts a healthy system; it is the steady state in which the system permanently operates. A single machine is either up or down, and "down" is rare enough to ignore most days. A fleet of thousands of machines is, at every instant, a system in which something is broken: a disk has filled, a network link is flapping, a GPU has thrown an uncorrectable memory error, a process has been preempted. Reliability engineering is the discipline of building one coherent, available intelligence on top of a substrate that is always partially failed. This chapter opens by making that claim quantitative, fixing the vocabulary (availability, MTBF, MTTR, fault classes, service objectives), and showing why two properties unique to AI workloads, long training jobs and always-on serving, turn the reliability problem from a background assumption into the central design constraint of any system at scale.

Every earlier part of this book has treated failure as a tax to be paid: data-loading retries in Chapter 8, checkpoint-and-restore in elastic training in Chapter 18, the redundant replicas of a serving fleet in Chapter 23. This chapter steps back and treats reliability itself as the subject. The reason it deserves its own chapter is arithmetic: the probability that everything works is the product of the probabilities that each component works, and a product of numbers slightly below one, taken thousands of times, collapses toward zero. Distribution buys throughput by adding machines, and every machine added is another independent chance to fail. Reliability is therefore not a fixed cost you pay once; it is a tax that grows with the size of the fleet, and the engineering in the rest of this chapter exists to keep that tax affordable.

1. Why Failure Is the Steady State Beginner

Begin with a single independent component that is up with probability $R$, its reliability over some window of interest. If a system needs $N$ such components all working at once, and their failures are independent, the system reliability is the product

$$R_{\text{sys}} = R^{N}.$$

This is the most consequential equation in the chapter, and its behavior is brutal. Suppose each node is individually excellent, working through a given job window with probability $R = 0.999$, a failure only once in a thousand jobs. With $N = 1$ the system works $99.9\%$ of the time. With $N = 1000$ identical nodes that must all survive, $R_{\text{sys}} = 0.999^{1000} \approx 0.368$: the job now fails roughly two times in three. Nothing got less reliable per node. The fleet got bigger, and the product of many near-ones is far from one. This is the same multiplicative collapse that Chapter 2 introduced when it argued that a distributed system is one where "the failure of a computer you did not even know existed can render your own computer unusable."

Figure 35.1.1: System reliability $R_{\text{sys}} = R^{N}$ as the number of nodes that must all survive grows. Three individually excellent per-node reliabilities are shown. Every curve starts near one at $N = 1$ and bends toward zero, and a smaller per-node failure rate only postpones the collapse; it does not prevent it. The lesson is that "use more reliable hardware" buys a constant factor, while the fleet's appetite for failures grows linearly with $N$.

The honest conclusion is that you cannot engineer your way out of this with better components alone. Doubling per-node reliability shifts the curve in Figure 35.1.1 to the right by a bounded amount, but the fleet keeps growing, and the product keeps collapsing. The only durable answer is redundancy and recovery: arrange that the system as a whole survives even though individual parts do not, so that the relevant reliability is no longer "all $N$ parts work" but "enough parts work, and broken ones are detected and replaced fast." That reframing, from preventing failure to surviving it, is the entire program of Section 35.2.

Key Insight: Reliability Is a Tax That Scales With the Fleet

Aggregate failure rate is additive in the number of nodes: a fleet of $N$ machines fails at roughly $N$ times the rate of one machine. Throughput is what you bought by adding those machines; the extra failures are what you pay for it. This makes reliability engineering structurally different from a one-time cost. The bigger the system that scale-out lets you build, the more failures per hour it absorbs, and the more of your engineering must go to detecting, masking, and recovering from them. You do not get to opt out: the only choice is whether the recovery is automatic and cheap or manual and catastrophic.

2. A Vocabulary for Things That Break Beginner

Reliability has a precise vocabulary, and using it precisely is the first step to engineering with it. Three quantities anchor everything. The mean time between failures (MTBF) is the average time a component runs before it fails. The mean time to repair (MTTR) is the average time to detect a failure and restore service. Availability is the fraction of time the component is usable, and for a repairable component it is

$$A = \frac{\text{MTBF}}{\text{MTBF} + \text{MTTR}}.$$

This single ratio captures a truth that beginners often miss: availability is improved as much by shrinking MTTR as by stretching MTBF. A component that fails twice as often but recovers ten times faster is more available. At fleet scale, where failures are constant and unavoidable, the leverage is almost entirely in MTTR: you will not stop nodes from dying, so the design question becomes how fast the system notices and heals. This is why Chapter 5 insists on measuring recovery time as a first-class metric, not just steady-state throughput.

Failures also come in distinct kinds, and the kind dictates the defense. Distributed-systems theory sorts faults into a hierarchy of increasing nastiness, shown in Figure 35.1.2. A crash fault is the kindest: a node simply stops, and once stopped it stays silent. An omission fault is a node that intermittently fails to send or receive messages, the flapping link and the dropped packet, harder to detect because the node is not cleanly dead. A Byzantine fault is the worst: a node that keeps running but behaves arbitrarily, sending wrong or even adversarially crafted values, whether from a silent bit-flip in memory, a software bug, or a compromised machine. The cost of tolerance rises sharply along this hierarchy, which is why most production systems assume crash faults and pay for Byzantine tolerance only where the threat (a malicious participant, a safety-critical decision) justifies it.

Figure 35.1.2: The fault hierarchy. Crash faults are a special case of omission faults (a node that omits all messages forever), which are in turn a special case of Byzantine faults (arbitrary behavior). Defenses and their cost rise from left to right. Most of this book assumes crash faults; Byzantine tolerance, which returns in Section 35.5 as robust gradient aggregation against poisoned updates, is reserved for the settings that genuinely face an adversary.

3. SLOs and SLAs: Promising Reliability Beginner

Reliability is not only an internal property; for a deployed AI service it is a promise made to users and quantified in contracts. A service-level objective (SLO) is an internal target: "$99.9\%$ of inference requests return in under 200 milliseconds, measured monthly." A service-level agreement (SLA) is the external, often contractual, commitment, usually with a financial penalty when breached. The gap between them is deliberate; teams set the internal SLO tighter than the external SLA so that a missed objective is a warning, not yet a breach. The arithmetic of availability targets is sobering: $99.9\%$ availability ("three nines") permits about 8.8 hours of downtime per year, $99.99\%$ ("four nines") about 53 minutes, and $99.999\%$ ("five nines") about five minutes. Each additional nine costs roughly an order of magnitude more in redundancy and operational rigor, which is why mature teams choose the cheapest target users will accept rather than the highest they can imagine.

For AI services specifically, the SLO must name the metric that matters, and that metric is rarely just "up or down." A serving system that returns answers but with degraded model quality, or one that stays fast for cached queries while retrieval times out, is partially failed in ways a simple uptime check misses. This is why the evaluation framework of Chapter 5 measures tail latency, goodput, and quality-under-load rather than a single binary, and why an SLO for an AI service typically bundles a latency percentile, an error-rate ceiling, and a quality floor into one objective.

4. Why AI Makes Reliability Harder Intermediate

Two structural features of AI workloads sharpen the reliability problem beyond what generic distributed systems face. The first is the long training job. A foundation-model training run can occupy thousands of accelerators for weeks. Over that horizon, the question is not whether a node will fail but how many will, and a single uncoordinated failure can corrupt the synchronous all-reduce that Chapter 15 built the training step around, stalling every other worker. Without checkpointing, one failure late in a run discards weeks of compute. The job amplifies any single failure into lost work proportional to the time since the last checkpoint, which is precisely why elastic and fault-tolerant training in Chapter 18 treats frequent checkpointing and fast restart as non-negotiable rather than optional.

The second feature is always-on serving. An inference fleet must keep answering while parts of it fail, upgrade, or get preempted, so it must stay correct and available under partial failure rather than waiting for a clean repair. The tension here is the one Chapter 2 framed with the CAP theorem: when the network partitions, a system must choose between consistency and availability, and most serving systems choose availability, returning a possibly-stale answer rather than no answer. The runnable demo below makes the core arithmetic of both regimes concrete, computing how the probability of at least one failure and the expected number of failures grow as the fleet grows.

import math

# Per-node failure probability during one long training job window.
p = 0.01  # 1% chance a given node fails during the job
print("per-node failure prob p :", p)
print()
print(f"{'nodes N':>8} | {'P(>=1 node fails)':>18} | {'E[# failures]':>14}")
print("-" * 48)
for N in (1, 10, 100, 1000, 5000, 10000):
    p_any = 1.0 - (1.0 - p) ** N      # at least one failure
    expected = N * p                  # expected number of failures
    print(f"{N:>8} | {p_any:>18.6f} | {expected:>14.2f}")

print()
# Expected time to first failure for a fleet of N nodes, each with MTBF hours.
mtbf_hours = 50000.0      # mean time between failures per node (hours)
for N in (1000, 10000):
    fleet_rate = N / mtbf_hours          # failures per hour across the fleet
    ttff = 1.0 / fleet_rate              # expected hours to first failure
    print(f"N={N:>6}: fleet failure rate = {fleet_rate:8.4f} /hr, "
          f"E[time to first failure] = {ttff*60:7.2f} min")

print()
# Availability from MTBF and MTTR.
mttr_hours = 2.0
A = mtbf_hours / (mtbf_hours + mttr_hours)
print(f"single-node availability A = MTBF/(MTBF+MTTR) = {A:.6f}")
print(f"all-N-up availability A^N for N=1000        = {A**1000:.6f}")
print(f"all-N-up availability A^N for N=10000       = {A**10000:.6f}")

Code 35.1.1: Fleet failure arithmetic from first principles. The first block computes $P(\text{at least one failure}) = 1 - (1-p)^{N}$ and the expected failure count $Np$ as the fleet grows; the second converts a per-node MTBF into a fleet-wide rate and an expected time to the first failure; the third applies $A = \text{MTBF}/(\text{MTBF}+\text{MTTR})$ and the $A^{N}$ all-up availability from Section 1.

per-node failure prob p : 0.01

 nodes N |  P(>=1 node fails) |  E[# failures]
------------------------------------------------
       1 |           0.010000 |           0.01
      10 |           0.095618 |           0.10
     100 |           0.633968 |           1.00
    1000 |           0.999957 |          10.00
    5000 |           1.000000 |          50.00
   10000 |           1.000000 |         100.00

N=  1000: fleet failure rate =   0.0200 /hr, E[time to first failure] = 3000.00 min
N= 10000: fleet failure rate =   0.2000 /hr, E[time to first failure] =  300.00 min

single-node availability A = MTBF/(MTBF+MTTR) = 0.999960
all-N-up availability A^N for N=1000        = 0.960790
all-N-up availability A^N for N=10000       = 0.670325

Output 35.1.1: The numbers tell the chapter's whole story. At $N = 100$ a failure during the job becomes more likely than not; by $N = 1000$ it is essentially certain, and the fleet expects ten failures per job. With a generous per-node MTBF of nearly six years, a ten-thousand-node fleet still expects its first failure within five hours, and the all-up availability of that fleet has fallen to $0.67$ even though each node is available $99.996\%$ of the time. Failure is the steady state.

Thesis Thread: Scale-Out Multiplies Failures, Not Just Throughput

Every axis of this book added machines to overcome a ceiling: more workers for throughput, more shards for model size, more replicas for request volume. Output 35.1.1 names the bill that arrives with all of them. The same $N$ that multiplies your compute multiplies your aggregate failure rate, so the scale-out moment that delivers a thousandfold speedup also delivers a thousandfold failure rate. Reliability is therefore not a separate topic bolted onto distribution; it is the shadow cast by distribution itself. The larger the system the earlier parts of this book taught you to build, the more of this chapter you are obliged to apply.

5. The Shape of This Chapter Beginner

Having established that failure is the steady state, the rest of the chapter is organized as a tour of the defenses, broadening from accidental failure to deliberate harm to societal consequence. Section 35.2 covers fault tolerance proper: replication, checkpointing, consensus, and the recovery mechanisms that turn "all $N$ must survive" into "enough survive and the rest heal fast." Sections 35.3 through 35.5 turn from accidental faults to adversarial ones, where a node fails not at random but because someone made it: data poisoning, model and prompt attacks, and the Byzantine-robust aggregation that defends distributed learning against malicious participants. Section 35.6 addresses privacy, including the differential privacy and secure aggregation first met in federated learning in Chapter 14. Section 35.7 covers governance and accountability for systems no single operator fully observes, and Section 35.8 closes with bias, fairness, and the environmental cost of fleet-scale AI. The thread tying these together is that a distributed AI system has a larger attack and failure surface than any single machine, and every section widens the lens on what can go wrong with that surface.

Library Shortcut: Failure Detection You Do Not Hand-Roll

The fleet arithmetic of Code 35.1.1 says failures are constant, so detection must be automatic. In practice you never write your own heartbeat and timeout logic. Kubernetes liveness and readiness probes restart and depool failing pods on a declared cadence; the few lines below replace what would otherwise be a custom health-monitor process per service:

# Kubernetes restarts a container that fails its liveness probe,
# and stops routing traffic to one that fails its readiness probe.
livenessProbe:
  httpGet: { path: /healthz, port: 8080 }
  periodSeconds: 5          # check every 5s
  failureThreshold: 3       # 3 misses (~15s) -> restart: bounds MTTR
readinessProbe:
  httpGet: { path: /ready, port: 8080 }
  periodSeconds: 5
  failureThreshold: 2       # depool fast so users never hit a dead replica

Code 35.1.2: A liveness and readiness probe spec. The orchestrator handles heartbeating, timeout, restart, and traffic depooling that you would otherwise build by hand; the declared thresholds are exactly the MTTR knob from Section 2. The scheduling layer underneath these probes is the subject of Chapter 33.

Practical Example: The Training Run That Lost a Weekend

Who: A research engineer training a large language model on a 512-GPU cluster.

Situation: A two-week pretraining run was checkpointing every six hours to keep checkpoint-write overhead low.

Problem: On day nine, a single GPU threw an uncorrectable ECC memory error at 2 a.m. Saturday. The synchronous all-reduce hung, every other GPU sat idle at the barrier, and the job watchdog killed the run.

Dilemma: Restart from the last checkpoint, losing up to six hours of progress across 512 GPUs, or invest engineering time in more frequent checkpoints and automatic restart, paying steady write overhead to bound the loss.

Decision: They measured it against Output 35.1.1: at 512 GPUs a multi-day run is near-certain to hit a failure, so designing for restart was not optional. They moved to asynchronous checkpointing every thirty minutes and an elastic launcher that re-forms the process group around the dead node.

How: They adopted the elastic-training machinery of Chapter 18: sharded checkpoints written in the background, a supervisor that detects the hung barrier, and automatic respawn on a spare node.

Result: The next memory error, eleven days later, cost eleven minutes of recompute instead of a weekend. MTBF did not change; MTTR collapsed, and availability followed.

Lesson: You cannot stop the hardware from failing at scale. You can decide how much work a failure destroys, and that decision is made in advance, in the checkpoint interval and the restart path.

Research Frontier: Reliability at the Frontier of Scale (2024 to 2026)

The largest training runs have turned reliability into a published engineering concern. Meta's Llama 3 technical report (2024) documented that over a 54-day pretraining run on 16,384 GPUs the cluster saw a hardware or systems interruption on average more than once every three hours, with GPU and HBM failures dominating, and reported the automation that kept effective training time above $90\%$ despite this. Parallel research lines attack the same problem from different angles: silent data corruption detection, where a faulty accelerator produces wrong results without crashing (a Byzantine fault from Figure 35.1.2) has become a recognized hyperscale failure mode; asynchronous and in-memory checkpointing schemes such as Gemini-style redundancy aim to cut checkpoint cost so intervals can shrink toward minutes; and elastic training systems that re-shard around lost nodes without a full restart are moving from research into production frameworks. The frontier theme is uniform: at tens of thousands of accelerators, the bottleneck on useful work is no longer raw compute but the fraction of time the fleet spends not recovering from failure.

Fun Note: The Bathtub and the Burn-In

Hardware failure rates over a component's lifetime trace a "bathtub curve": high early on as manufacturing defects surface, low and flat through the long useful-life middle, then rising again as parts wear out. Datacenter operators exploit the left wall on purpose, running new servers through a stress "burn-in" before trusting them with real work, so that the components destined to die young do so on a test rig rather than nine days into someone's pretraining run. The replica in this section's epigraph is, in a sense, the burn-in's reward: the dud already failed last week, on no one's critical path.

Exercise 35.1.1: Read the Collapse Conceptual

Using $R_{\text{sys}} = R^{N}$, explain in words why moving from a per-node reliability of $0.999$ to $0.9999$ (a tenfold reduction in per-node failure rate) does not rescue a $10{,}000$-node job, and quantify the residual system reliability in each case. Then argue why the practical response at scale is redundancy and fast recovery rather than more reliable nodes, connecting your answer to the MTTR term in the availability formula of Section 2.

Exercise 35.1.2: Choose the Checkpoint Interval Coding

Extend Code 35.1.1 into a small model of expected wasted work. Assume a fleet of $N$ nodes each with a constant failure rate, so the time to the next fleet failure is exponential with mean $\text{MTBF}/N$. For a job that checkpoints every $T$ minutes, the work lost per failure is on average $T/2$ minutes of compute across the whole fleet, but each checkpoint also costs a fixed write time $c$. Write code that sweeps $T$ for $N \in \{512, 4096\}$ and finds the interval minimizing total overhead (lost work plus checkpoint cost) per hour. Report how the optimal interval shifts as $N$ grows and relate it to the Practical Example.

Exercise 35.1.3: Price the Nines Analysis

A serving team runs an inference SLA of $99.9\%$ monthly availability and is considering an upgrade to $99.99\%$. Compute the permitted monthly downtime budget at each target. Their current MTTR (detect, depool, restart) is 12 minutes per incident and they average 3 incidents per month. Determine whether they already meet three nines, what they must do to MTTR or incident rate to reach four nines, and discuss which lever (MTBF via better hardware, or MTTR via faster automated recovery from Code 35.1.2) is cheaper to pull. Tie your reasoning to why mature teams pick the lowest acceptable target rather than the highest reachable one.