"Nobody asked me to agree on a gradient. They asked me to agree on who is allowed to write the next checkpoint, and that turned out to be the hard part."
A Consensus Round That Refuses to Commit
A distributed AI job has two planes with two completely different coordination styles, and confusing them is one of the most common conceptual errors in the field. The data plane, where gradients and activations flow between workers, does not run a consensus protocol; it runs deterministic collectives such as all-reduce, where every worker already agrees on the schedule and only the numbers move. The control plane, where the cluster decides which scheduler is in charge, who is rank 0, which nodes are alive, and which model version is current, is where genuine agreement under failure is required, and that is the home of consensus algorithms such as Raft and the systems built on them, etcd and ZooKeeper. This section explains the agreement problem, why failures make it hard, how Raft solves it in practice, and exactly which AI decisions sit on top of it.
The previous section examined consistency models, from the bounded staleness a parameter server tolerates to the CAP trade-off a storage layer must confront. Consistency asks how stale a value any one reader may see. This section asks a sharper question: when several machines must commit to a single value that all of them will treat as authoritative, such as the identity of the current leader or the contents of the configuration, how do they reach that single value while some of them are crashing, restarting, or temporarily unreachable? That is the agreement problem, and the surprising news is that AI training, the headline workload of this book, deliberately avoids it on its hot path and pays for it only in a small, slow, carefully isolated corner called the control plane.
1. The Agreement Problem, and Where AI Does Not Have It Beginner
Agreement, also called consensus, is the problem of getting a set of processes to decide on a single value such that three properties hold: every non-faulty process that decides chooses the same value (agreement), the chosen value was proposed by some process and is not invented out of thin air (validity), and every non-faulty process eventually decides (termination). Stated that abstractly it sounds like a puzzle, so it helps to see the concrete questions a real AI cluster needs answered: Which one of the three candidate schedulers currently holds the lease and is allowed to place jobs? When a worker crashes and a spare takes its place, which surviving worker is now rank 0, the coordinator that the others rendezvous with? Of the four model versions that finished training this week, which single one is tagged "production" right now? Each of those is a request for one authoritative answer that every participant will act on, and getting it wrong, two schedulers both believing they own the cluster, two workers both believing they are rank 0, produces exactly the kind of split-brain corruption that distributed systems exist to prevent.
Here is the editorial point this book insists on. The arithmetic core of distributed training does not need consensus. When eight workers run data-parallel SGD, they do not vote on the gradient; they each compute a partial sum and combine it with all-reduce, a collective operation in which the communication schedule is fixed in advance and agreed before the first byte moves. There is no proposal, no election, no possibility of two different answers, because addition is deterministic and the participants are known. That is why Chapter 4 can build the entire data plane out of collectives and never mention Paxos or Raft. Consensus is expensive and slow precisely because it tolerates uncertainty about who is alive and who proposed what, and the data plane buys its speed by refusing to tolerate that uncertainty on the hot path. The uncertainty does not vanish, though. It is pushed into the control plane, and the control plane pays for it with a consensus protocol.
All-reduce and Raft both make many machines "agree", but on entirely different things and at entirely different costs. All-reduce produces an agreed numerical result from a fixed, pre-arranged set of participants running a deterministic schedule; it is fast and runs millions of times per training job. Raft produces an agreed decision (a leader, a config value, a log entry) in the presence of crashes and uncertain membership; it is comparatively slow and runs rarely. AI training puts the high-frequency numerical agreement in the data plane (collectives) and the low-frequency decision agreement in the control plane (consensus). Mixing them up, for instance reaching for Raft to average gradients, would be a serious performance mistake.
2. Why Agreement Is Hard Under Failure Intermediate
If every machine were perfectly reliable and the network never delayed a message, agreement would be trivial: appoint machine 0 as the decider, let it broadcast the answer, done. Real clusters are not reliable, and the difficulty comes from one specific impossibility of distinguishing a crashed machine from a slow one. When a coordinator stops hearing from a worker, it cannot tell whether the worker died or whether its reply is merely stuck behind a congested link; the two look identical from the outside. This is not an engineering gap that a better library closes. It is a theorem.
The Fischer, Lynch, and Paterson result of 1985, universally called FLP, states that in an asynchronous system, one with no bound on message delay, no deterministic protocol can guarantee both safety and liveness for consensus if even a single process may crash. Informally: any protocol that always reaches a decision can be forced to run forever by an adversary who delays exactly the right message at exactly the right moment, and any protocol that always terminates can be forced into an inconsistent decision. There is no deterministic algorithm that is simultaneously always-correct and always-terminating when one crash is possible and delays are unbounded.
Practical systems do not repeal FLP; they sidestep it. Raft and its relatives keep safety unconditionally, they never elect two leaders in one term or commit two different values at one log position, and they obtain liveness only under a partial-synchrony assumption: once the network behaves well enough for long enough (messages arrive within some bound the protocol does not need to know in advance), an election completes and progress resumes. Randomized election timeouts break the symmetric stalemates that FLP exploits. The takeaway for an AI systems engineer is precise and worth internalizing: a consensus system will always refuse to do the wrong thing, but it may pause and do nothing while the network is misbehaving. That pause is a feature, and a training job that stalls because its coordination store lost quorum is behaving exactly as designed, choosing consistency over availability.
The oldest cousin of FLP is the Two Generals Problem: two allied generals on opposite hills can coordinate an attack only by messengers crossing a contested valley, and no finite sequence of confirmations ever makes both fully certain the other will attack, because the last messenger might always be the one captured. A distributed training job feels this every time a worker goes quiet. Did it crash, or is its acknowledgment captured in the valley of a congested switch? The control plane's answer is delightfully blunt: stop waiting for certainty, set a timeout, and hold an election.
3. Raft: A Leader and a Replicated Log Intermediate
Raft is the consensus algorithm most worth knowing because it was designed to be understandable, and because etcd and Consul, which run underneath a great deal of modern AI infrastructure, implement it. Raft reduces consensus to two intertwined mechanisms: electing a single leader, and replicating an append-only log of commands through that leader. Time is divided into terms, numbered $1, 2, 3, \dots$, and each term has at most one leader. A node that stops hearing heartbeats from the current leader times out, increments the term, becomes a candidate, and asks every other node for a vote. The rule that makes this safe is the majority quorum: a candidate becomes leader only if it collects votes from a strict majority of the cluster, and each node grants at most one vote per term. With $n$ nodes a majority is
$$ q = \left\lfloor \frac{n}{2} \right\rfloor + 1, $$and because any two majorities of the same set must overlap in at least one node, and that node votes only once per term, two candidates can never both reach a majority in the same term. One term, at most one leader, guaranteed, no matter how the messages are delayed or reordered. This single overlap argument is the whole safety story for elections.
Once elected, the leader is the only node that accepts client writes. It appends each command (set this config key, register this model version, grant this lease) to its log and replicates the entry to the followers; an entry is committed only after a majority have stored it, at which point the leader applies it and tells the followers to apply it too. The same majority that prevents two leaders also guarantees that a committed entry survives any minority failure, because at least one node in every future majority already holds it. The simulation below makes the election-safety property concrete: it runs a five-node cluster through three terms, including a split-vote race and a network partition, and checks that two leaders never coexist and that a minority partition cannot elect anyone.
import random
random.seed(7)
class Node:
def __init__(self, node_id):
self.id = node_id
self.voted_for = {} # term -> candidate it already voted for
def request_vote(self, term, candidate_id):
if term not in self.voted_for: # at most ONE vote per term
self.voted_for[term] = candidate_id
return True
return False
def run_election(cluster, term, candidate_id, reachable):
majority = len(cluster) // 2 + 1 # strict majority of the WHOLE cluster
votes = 1 # a candidate always votes for itself
for node in cluster:
if node.id != candidate_id and node.id in reachable \
and node.request_vote(term, candidate_id):
votes += 1
return votes, majority, votes >= majority
N = 5
cluster = [Node(i) for i in range(N)]
allnodes = set(range(N))
print("cluster size :", N)
print("majority quorum needed :", N // 2 + 1, "\n")
# Term 1: two candidates race in the SAME term; only one may win.
v0, maj, won0 = run_election(cluster, 1, 0, allnodes)
v2, maj, won2 = run_election(cluster, 1, 2, allnodes)
print("Term 1: nodes 0 and 2 both stand for election")
print(f" candidate 0: {v0}/{N} votes (needs {maj}) -> leader={won0}")
print(f" candidate 2: {v2}/{N} votes (needs {maj}) -> leader={won2}")
print(f" two leaders in one term? {won0 and won2} (safety: must be False)\n")
# Term 3: a partition isolates {3,4}; a minority cannot reach quorum.
v3, maj, won3 = run_election(cluster, 3, 3, reachable={3, 4})
print("Term 3: a partition isolates nodes 3 and 4")
print(f" candidate 3 sees only [3, 4]: {v3}/{N} votes -> leader={won3}")
C:\Python314\python.exe.cluster size : 5
majority quorum needed : 3
Term 1: nodes 0 and 2 both stand for election
candidate 0: 5/5 votes (needs 3) -> leader=True
candidate 2: 2/5 votes (needs 3) -> leader=False
two leaders in one term? False (safety: must be False)
Term 3: a partition isolates nodes 3 and 4
candidate 3 sees only [3, 4]: 2/5 votes -> leader=False
The most common five-node-cluster outage is operators "saving money" by running etcd on two nodes instead of three. With two nodes the majority is also two, so a single failure drops you below quorum and the whole control plane freezes. Odd cluster sizes (3, 5, 7) are the convention precisely because adding an even node raises the quorum without improving fault tolerance: a 4-node cluster tolerates the same single failure as a 3-node one, while paying for an extra machine and a slower commit.
4. etcd and ZooKeeper: Consensus You Rent, Not Build Beginner
Almost nobody implements Raft inside their AI platform. Instead they delegate every consensus-requiring decision to a small, battle-tested coordination service and treat it as a reliable shared notebook. The two dominant choices are etcd, a Raft-based key-value store that is the brain of Kubernetes, and ZooKeeper, an older ZAB-based service that underpins Hadoop, Kafka, and many on-premises ML clusters. Both expose the same handful of primitives that turn raw consensus into something an application can use: a consistent key-value store, ephemeral keys that vanish when a client disconnects (perfect for liveness), watches that notify clients when a key changes, and compare-and-swap writes that implement locks and leases. A distributed lock is just an agreed key that only one holder may own; a leader election is a race to create one ephemeral key, where the winner leads and the losers watch the key so they learn the instant the leader's session dies.
This is the layer where AI infrastructure actually touches consensus. The cluster scheduler stores the authoritative job and node state in etcd, so that even if the scheduler process is replaced, the new instance reads the same truth and no two schedulers place conflicting jobs; we develop that machinery in Chapter 33. A distributed training launcher uses an etcd or ZooKeeper rendezvous to agree on the worker set and assign ranks before the first all-reduce, which is how a worker learns it is rank 0; torchrun's elastic backend does exactly this. And the model registry that records which version is "production" is, at its core, a consensus-backed compare-and-swap on a version key, which is why the MLOps systems of Chapter 26 lean on these stores to make promotion atomic.
Code 2.6.1 simulated the mechanism of election for teaching. In production you never run that loop yourself; you ask etcd, which has already run Raft for you, to grant a short-lived lease and let exactly one client claim a leadership key. The python-etcd3 client makes the whole protocol a few calls:
import etcd3
client = etcd3.client(host="etcd.cluster.local", port=2379)
# A lease is a timer; if this process dies, the lease expires and the key vanishes.
lease = client.lease(ttl=10)
# Atomically claim leadership ONLY if nobody holds the key (compare-and-swap).
won, _ = client.transaction(
compare=[client.transactions.version("/scheduler/leader") == 0],
success=[client.transactions.put("/scheduler/leader", "scheduler-A", lease=lease)],
failure=[],
)
if won:
lease.refresh() # keep refreshing to stay leader; stop and you abdicate
run_as_leader()
else:
client.add_watch_callback("/scheduler/leader", on_leader_change) # wait your turn
kazoo library and its Election recipe offer the same pattern.Who: A platform engineer running the training scheduler for a 600-GPU research cluster.
Situation: For high availability they ran two identical scheduler processes, expecting the spare to take over if the primary died.
Problem: A network blip made each scheduler believe the other was dead, so both began placing jobs; two of them landed eight workers each on the same GPUs, and both training runs corrupted their shared checkpoints.
Dilemma: Add ad-hoc heartbeats between the two schedulers (fast to write, but reinvents consensus badly and will split-brain again under the next partition), or route leadership through the etcd cluster the platform already ran for Kubernetes.
Decision: They put the scheduler behind an etcd lease, exactly the pattern in Code 2.6.2, so that at most one scheduler could ever hold the leadership key.
How: Each scheduler tried to claim /scheduler/leader with a 10-second lease on startup and on every loss-of-leadership event; the loser watched the key and stayed idle until the lease expired.
Result: During the next real partition the minority-side scheduler could not refresh its lease, lost leadership within ten seconds, and stopped placing jobs; the cluster paused briefly rather than double-booking. No corrupted checkpoint recurred.
Lesson: "Run two of them for availability" without a consensus layer manufactures split-brain. One quorum-backed lease is worth more than any number of hand-rolled heartbeats, and the brief pause it causes is the correct behavior, not a bug.
5. What the Control Plane Decides for an AI Job Intermediate
It is worth listing the concrete decisions so the boundary between the planes is unmistakable. Table 2.6.1 puts the recurring AI control-plane decisions next to the consensus primitive that backs each one and the chapter that develops it. Read down the right column and notice that none of these decisions is on the gradient hot path; every one is a rare, authoritative choice that, if duplicated or lost, breaks the whole job. That rarity is what makes paying for consensus affordable: a Raft commit measured in milliseconds is invisible next to a training run measured in days, but a split-brain in any of these rows is fatal.
| Control-plane decision | Consensus primitive | Developed in |
|---|---|---|
| Which scheduler holds the lease | Leader election (ephemeral lease) | Chapter 33 |
| Which workers are alive (membership) | Ephemeral keys + watches | Chapter 18 |
| Which worker is rank 0 (rendezvous) | Agreed key-value, barrier | Chapter 15 |
| Current cluster configuration | Replicated log of writes | Chapter 33 |
| Which model version is "production" | Compare-and-swap on a version key | Chapter 26 |
| Who owns this shard / partition right now | Distributed lock | Chapter 11 |
The rendezvous row is the cleanest bridge back to the data plane. Before any all-reduce can run, the workers must agree on who is in the group and which one is rank 0, the coordinator that initializes the process group. Elastic training launchers route exactly that agreement through a consensus-backed rendezvous, and when a worker dies and a spare joins, the group re-forms through the same store; this is the membership machinery that Chapter 18 builds into fault-tolerant training. Once the control plane has answered "you are rank 0, your peers are these, the world size is eight", the data plane takes over and never consults consensus again until the membership changes. The two planes meet only at the seams: setup, recovery, and reconfiguration.
Chapter 2 keeps returning to the choice between synchronous and asynchronous coordination, and consensus is that choice raised from numbers to decisions. A synchronous all-reduce makes workers agree on a gradient value every step (Chapter 4); a Raft commit makes nodes agree on a decision only when one is needed. The same tension, agree-now-and-wait versus proceed-and-reconcile-later, reappears as synchronous versus asynchronous SGD in Chapter 10 and as bounded staleness in the parameter server of Chapter 11. Recognizing that the control plane is the strict-consistency end of one continuous spectrum, with the staleness-tolerant data plane at the other end, is a load-bearing idea for the rest of the book.
6. Research Frontier Advanced
As training jobs grew to tens of thousands of accelerators, the control plane became a measurable bottleneck and a fresh research target. Meta's report on training Llama 3 (Grattafiori et al., 2024) documents how, at 16,000-GPU scale, failure handling and membership churn dominated operational effort, motivating coordination layers that detect, evict, and re-rendezvous failed workers in seconds rather than minutes. A parallel systems line keeps consensus correct while shrinking its latency: etcd's move to a v3.5/v3.6 lineage hardened the Raft implementation and its learner (non-voting) members, and learner replicas let a cluster add capacity without enlarging the voting quorum that every commit must cross. On the algorithmic side, leaderless and flexible-quorum protocols such as the EPaxos and Flexible Paxos families continue to attract attention for letting reads and writes use differently sized quorums, which matters when a model registry is read constantly but written rarely. The throughline for AI infrastructure is consistent: keep the unconditional safety of consensus, but get its rare decisions off the critical path of jobs whose worker counts, and therefore whose failure rates, keep climbing. We return to the failure-rate arithmetic that forces this in Chapter 18.
We now hold the second half of Chapter 2's coordination story. Section 2.5 fixed how stale a reader may be; this section fixed how a cluster commits to a single authoritative value despite crashes, and drew the bright line between the consensus-backed control plane and the collective-driven data plane. What remains is the performance consequence of all this coordination: even with leadership settled and membership agreed, the slowest worker still sets the pace of every synchronous step. That problem, the straggler, is the subject of Section 2.7.
For each of the following, state whether it belongs to the data plane (deterministic collective, no consensus) or the control plane (consensus required), and justify the choice in one sentence: (a) averaging gradients across 32 data-parallel workers; (b) deciding which of three replicas of the parameter server is the current primary; (c) summing a loss scalar across a pipeline stage; (d) promoting model version v7 to the "production" tag; (e) agreeing, after a worker crash, on the new world size and rank assignment before training resumes. For any control-plane item, name which row of Table 2.6.1 it matches.
Using the majority rule $q = \lfloor n/2 \rfloor + 1$, fill in, for cluster sizes $n = 3, 4, 5, 6, 7$, both the quorum $q$ and the number of simultaneous node failures the cluster can survive while still committing (which is $n - q$). Explain from your table why production consensus clusters use odd sizes, and state precisely what an operator loses by running a 2-node etcd cluster "for redundancy". Then argue why a minority partition must refuse writes, connecting your answer to the FLP discussion in Section 2.
Start from Code 2.6.1. (a) Add a leader_for_term dictionary and an assertion, after running many randomized elections across many terms, that no term ever records two distinct leaders; confirm the assertion never fires. (b) Add a committed-log model: a leader appends an entry only if a majority of reachable nodes acknowledge it, and verify that an entry committed under full connectivity is still present in the next leader's log after you partition away a minority. (c) Briefly explain which property your code demonstrates (safety) and which it cannot (liveness under unbounded delay), tying the limitation back to FLP.