Section 39.9: Safety and Failure Modes

"One of us went silent over the river. The rest of us closed the gap, voted down the liar still shouting nonsense, and finished the survey. Nobody on the ground noticed a thing."
A Swarm, Deciding What to Do When One of Its Own Goes Silent

Big Picture

An embodied swarm is a distributed system whose failures carry kinetic energy: a crashed worker is a falling drone, a corrupted aggregate is a wrong turn into a wall, and a compromised node is a robot that lies to its neighbors while still flying. Every reliability technique from Chapter 35 (fault tolerance, Byzantine-robust aggregation, security) and the elastic reconfiguration of Chapter 18 reappears here, but with two non-negotiable constraints that the data-center versions never faced. First, safety must hold without a central monitor: there is no controller node that can see the whole swarm and call an abort, so every guarantee has to be enforced locally by each robot from local information. Second, safety must tolerate a fraction of failed or malicious agents: with $n$ robots in the air, the design target is not "all $n$ behave" but "the swarm stays safe as long as at most $f$ of them misbehave". This section assembles the physical failure modes, the self-healing response, the robust consensus that survives a lying robot, the decentralized safety invariants, and the comm-network security into one safety story for the swarm.

The earlier sections of this chapter built a swarm that works when everything works: agents that perceive, the gossip and consensus that let them agree (Section 39.4), the decentralized controllers that turn agreement into motion (Section 39.6). This section asks the harder question that decides whether such a system is ever allowed to fly: what happens when an agent dies, a sensor lies, a link jams, or a robot is taken over. In a single-machine robot the answer is simple, because the robot either works or it does not, and a watchdog can power it down. A swarm has no such luxury. It is, at any interesting scale, a system in which something is probably already broken, exactly the condition that Section 1.1 named as the price of distribution, now with rotors attached. Keeping it safe means designing so that the broken parts cannot take the healthy ones down with them.

1. The Physical Failure Modes Beginner

A data-center fault is a crashed process or a slow link. An embodied fault is all of that plus physics. It is worth enumerating the failure modes precisely, because each one calls for a different layer of the defense, and conflating them is how swarms hurt people. Four modes dominate. Agent loss is a robot that stops participating entirely: a depleted battery, a structural failure, a hard crash of its onboard computer. To its neighbors this looks exactly like the crashed worker of Section 35.2, detected by missed heartbeats, except that the lost unit is now an unpowered mass on a ballistic path. Sensor failure is subtler and more dangerous: the robot keeps flying and keeps talking, but its perception is wrong, so it broadcasts confident nonsense. Motor or actuator failure degrades the robot's ability to execute the maneuver it agreed to, so its commanded and actual trajectories diverge. Collisions are the failure mode unique to the embodied setting, where two healthy robots, or a robot and the world, occupy the same space, and they are both a cause of the other faults and a consequence of them.

These modes are not independent, and the dangerous cases are the cascades: a sensor failure causes a robot to broadcast a wrong position, its neighbors plan around the phantom, a real gap opens, and a collision closes it. The safety architecture therefore cannot treat any single mode in isolation; it must degrade gracefully under agent loss, reject the bad data from sensor failure, bound the divergence from actuator failure, and hold a collision-avoidance invariant that survives all three. Figure 39.9.1 shows the whole picture: a swarm with one silently lost agent and one Byzantine agent shouting an outlier, the robust aggregate rejecting that outlier, and the formation reconfiguring to cover the gap the lost agent left.

Figure 39.9.1: The swarm safety pipeline of this section. Honest agents (blue) report a sensor estimate near the true obstacle range; one agent has gone silent (gray, agent loss) and one is Byzantine (red), broadcasting the outlier $1000$. Robust consensus (median or trimmed mean) rejects the outlier and the swarm agrees on the honest value, $30.0$ meters. The healthy agents (green) then reconfigure to cover the gap the lost agent left. No central monitor appears anywhere in the loop; every step is computed locally by each agent.

2. Graceful Degradation and Self-Healing Intermediate

When an agent is lost, the swarm must keep doing its job with fewer units, and ideally repair its own structure rather than wait for a human to relaunch a replacement. This is the elastic reconfiguration of Chapter 18 moved from a training cluster to the air. In elastic training, when a worker is preempted the remaining workers re-form the process group and continue with a smaller world size; in a swarm, when a drone drops the remaining drones re-form the formation and continue with smaller coverage. The mechanism is the same membership protocol over missed heartbeats from Section 35.2, and the response is the same: shrink the live set, redistribute the work, carry on.

What changes is the failure metric. A training job cares that progress continues; a survey swarm cares about coverage, the fraction of the target area still observed after units are lost. Suppose $n$ agents are placed so that each covers a region and together they cover the target with some redundancy, so every point is seen by at least $r$ agents. After $k$ agents fail, a point is left uncovered only if all $r$ agents that saw it are among the failed $k$. If failures are spread across the formation, the swarm retains full coverage for any $k < r$ losses, and degrades gracefully (shrinking coverage) rather than catastrophically (sudden blind spots) thereafter. The redundancy factor $r$ is the design knob: a swarm built to survive $k$ simultaneous losses must be flown with

$$r \;\ge\; k + 1, \qquad \text{so the live count after } k \text{ losses is } n - k \;\ge\; n - r + 1.$$

Self-healing closes the loop. Rather than leaving the $n-k$ survivors in their original, now gap-ridden, positions, each survivor runs a local rule that spreads the live agents back out to restore even coverage, a behavior built directly on the decentralized control laws of Section 39.6. The swarm thus heals from the inside: no agent knows the global formation, each only nudges toward its live neighbors, and the formation re-closes the way a torn mesh re-knits. The same membership signal that triggers elastic resize in Chapter 18 here triggers a physical redistribution in space.

Key Insight: Safety Is a Property of the Survivors, Not the Whole Set

In a single robot, safety is a property you verify once for the whole machine. In a swarm, you can never verify the whole set, because the set changes every time a battery dies. The only durable guarantee is one that each surviving agent can re-establish locally from its current neighbors: full coverage while at most $r-1$ are lost, collision avoidance enforced pairwise, robust consensus correct while fewer than $n/3$ lie. Design every swarm safety property as an invariant the live subset maintains by itself, never as a global check that some monitor performs, because at scale there is no monitor and the set is never the one you started with.

3. Byzantine and Compromised Agents in the Swarm Advanced

Agent loss is the easy fault, because a silent agent is at least honest about being gone. The hard fault is the agent that keeps flying and keeps talking but feeds the swarm bad data, whether from a failed sensor or from an attacker who has taken it over. This is the Byzantine fault of Chapter 2, now embodied: a node that can behave arbitrarily, including sending different lies to different neighbors. The classical result from Chapter 2 sets the hard limit. Byzantine consensus among $n$ agents can tolerate $f$ arbitrarily-faulty agents and still let the honest majority agree if and only if

$$f \;<\; \frac{n}{3}, \qquad \text{equivalently } n \;\ge\; 3f + 1.$$

Below that ratio the honest agents can outvote and outweigh any coalition of liars; at or above it the liars can split the honest agents into camps that never converge. For a swarm this is a sizing rule: to survive $f$ compromised units you must fly at least $3f+1$ of them, the embodied cost of Byzantine tolerance.

The agreement that matters in a swarm is usually not a binary commit but a numeric estimate: an obstacle's range, a target's bearing, the formation's centroid. Here the swarm inherits the Byzantine-robust aggregation of Section 35.5 directly. A plain mean of the agents' estimates is unboundedly corruptible, because one agent reporting a wild value drags the mean arbitrarily far. A robust aggregator (the coordinate-wise median, or a trimmed mean that discards the $f$ lowest and $f$ highest reports before averaging) is provably bounded: with at most $f$ liars and $n - f > f$ honest agents whose true values lie in some interval, the median and the trimmed mean stay inside the honest interval no matter what the liars broadcast. Code 39.9.1 demonstrates exactly this contrast on a swarm range estimate.

import numpy as np

# A swarm of n drones each estimates the obstacle's range (meters) from its own
# noisy sensor. They must agree on ONE value to plan a shared avoidance maneuver.
rng = np.random.default_rng(7)
n = 12                                  # swarm size
truth = 30.0                            # true obstacle range in meters
honest = truth + rng.normal(0, 0.4, n)  # honest drones: small sensor noise

# One drone is Byzantine (faulty or spoofed) and broadcasts a wild value that
# would pull the planner into a wrong (here, far too close) maneuver.
byz_idx = 4
reports = honest.copy()
reports[byz_idx] = 1000.0               # the compromised broadcast

f = 1                                   # number of faulty agents
print(f"swarm size n         : {n}")
print(f"faulty agents f      : {f}   (need f < n/3 = {n/3:.2f} for robust consensus)")
print(f"Byzantine report     : {reports[byz_idx]:.1f} m   (truth = {truth:.1f} m)")

# Plain (non-robust) aggregation: the arithmetic mean.
plain = float(np.mean(reports))

# Robust aggregators that tolerate up to f outliers.
median = float(np.median(reports))

# Trimmed mean: drop the f lowest and f highest before averaging (35.5).
srt = np.sort(reports)
trimmed = float(np.mean(srt[f:n - f]))

print()
print(f"plain mean           : {plain:8.2f} m   (corrupted)")
print(f"robust median        : {median:8.2f} m")
print(f"trimmed mean (f={f})   : {trimmed:8.2f} m")
print()
print(f"plain-mean error     : {abs(plain - truth):8.2f} m")
print(f"median error         : {abs(median - truth):8.2f} m")
print(f"trimmed-mean error   : {abs(trimmed - truth):8.2f} m")

Code 39.9.1: Robust consensus among swarm agents. Twelve drones report a noisy obstacle range; one is Byzantine and broadcasts $1000$ meters. The plain mean and two robust aggregators (median, trimmed mean) are computed and compared against the true range so the corruption of the mean is visible directly.

swarm size n         : 12
faulty agents f      : 1   (need f < n/3 = 4.00 for robust consensus)
Byzantine report     : 1000.0 m   (truth = 30.0 m)

plain mean           :   110.81 m   (corrupted)
robust median        :    30.01 m
trimmed mean (f=1)   :    30.01 m

plain-mean error     :    80.81 m
median error         :     0.01 m
trimmed-mean error   :     0.01 m

Output 39.9.1: A single liar shifts the plain mean by more than $80$ meters, enough to send the swarm into the obstacle it was avoiding, while the median and trimmed mean stay within a centimeter of the truth. With $f=1$ well under $n/3 = 4$, robust consensus tolerates the Byzantine agent exactly as the bound predicts.

The lesson is the one Figure 39.9.1 drew: the swarm survives a lying member not by detecting and ejecting it (which an attacker can make hard) but by aggregating in a way that the lie cannot move, as long as the liars stay below the $n/3$ fraction. This is fault tolerance by construction, the embodied descendant of the Byzantine-robust gradient aggregation that protected federated training in Section 35.5.

Thesis Thread: Byzantine-Robust Aggregation, Scaled Out to the Air

The robust aggregator in Code 39.9.1 is the same primitive that appeared as a defense against poisoned gradients in distributed learning (Section 35.5), itself a transformation of the fault-tolerance arc that began with MapReduce re-execution (Chapter 6) and elastic recovery (Chapter 18). The book's claim is that a small set of distribution primitives recurs across every scale; here that primitive leaves the data center and keeps a swarm of robots safe with no monitor watching. When you meet a swarm agreement step, ask which aggregator it uses, and whether that aggregator's breakdown point is above the fraction of faults you expect to fly.

4. Safety Guarantees Under Decentralized Control Advanced

Robust consensus decides what the swarm believes; safety invariants decide what it is allowed to do regardless of that belief. The control-safety invariants of Section 39.6 become the last line of defense precisely because they hold even when consensus is wrong or unavailable. Three invariants matter for an embodied swarm, and all three are enforced locally. Geofencing caps each agent's allowed region: the agent refuses any commanded velocity that would carry it outside a polygon it stores onboard, so no consensus result, honest or malicious, can drive it into restricted airspace. Pairwise collision avoidance guarantees a minimum separation between any two agents from a reciprocal rule each runs against its neighbors, the embodied invariant of Section 39.6, so two robots cannot occupy the same space even if both planned to. Return-to-home is the swarm-wide watchdog: when an agent loses consensus contact, detects critical battery, or trips its own fault check, it abandons the mission and flies a safe recovery path, the embodied analogue of a process that fails fast rather than continuing in an unknown state.

The defining property of these invariants is that they are decentralized and adversary-independent. Geofencing and collision avoidance are checked by each agent on its own commanded motion using only local state, so they keep holding when an agent is cut off from the swarm, when consensus has been corrupted by liars below the $n/3$ bound, or even when a single agent has been compromised entirely (a compromised agent still cannot pull an honest neighbor through its collision invariant, because the honest neighbor enforces separation reciprocally). Safety thus does not depend on the network being up, the consensus being correct, or any monitor being present. It is layered: robust consensus tries to make the swarm believe true things, and the local invariants guarantee that even when that fails, the swarm cannot do an unsafe thing. This layering, a best-effort agreement under a hard local safety floor, is what lets a swarm be both useful and trustworthy without a central authority.

5. Securing the Swarm's Communication Intermediate

Every guarantee so far assumed that an agent's messages reach its neighbors and that a message claiming to come from agent $i$ really did. An adversary attacks exactly those assumptions, turning the comm network of Section 39.4 into the swarm's softest target. Jamming floods the radio band so that messages are lost, which to the membership protocol looks like mass agent loss; the swarm must therefore treat a comms blackout as a safety event, falling back to the local invariants of Section 4 (each agent holds separation and geofence on its own, and triggers return-to-home if isolation persists) rather than continuing to plan on stale agreements. Spoofing is the network-layer face of the Byzantine fault: an attacker injects messages forged to look like a legitimate agent, or replays old ones, attempting to insert exactly the kind of outlier that Section 3's robust consensus is built to reject. The two defenses compose. Authentication (signed, sequence-numbered messages, the security machinery of Section 35.3) shrinks the attacker's reach to nodes whose keys it actually holds, and robust aggregation absorbs whatever forged or compromised reports get through, as long as their count stays below the $n/3$ fraction.

The scale-out lesson sharpens here. A central-monitor design would let an attacker win by taking out one node, the monitor; a swarm has no such single point, which is its security advantage, but it pays for that advantage by needing every agent to authenticate and robustly aggregate on its own, because there is no trusted referee to do it for them. Security in the swarm is therefore not a perimeter around a controller but a property each agent enforces against its own neighbors, which is precisely why it survives the loss or capture of any minority of the fleet.

Library Shortcut: NumPy and SciPy Give You the Robust Aggregators

The robust consensus in Code 39.9.1 is built from primitives you do not implement yourself. The coordinate-wise median is numpy.median(reports, axis=0), and the trimmed mean is one call:

from scipy.stats import trim_mean
import numpy as np

reports = np.array([30.1, 29.8, 30.3, 1000.0, 29.9, 30.0])  # one Byzantine value
robust = trim_mean(reports, proportiontocut=0.2)            # drop 20% each tail
geometric_med = np.median(reports)                          # coordinate-wise median

Code 39.9.2: The same robust aggregation as Output 39.9.1 in two library calls. scipy.stats.trim_mean handles the sort-and-trim and numpy.median the order statistic; for vector estimates such as a position, the coordinate-wise median generalizes directly by passing axis=0. The breakdown point you must still choose yourself: proportiontocut must exceed the expected fraction $f/n$ of faulty agents.

Practical Example: The Survey Swarm That Lost Three Drones and a Liar

Who: A robotics engineer running a sixteen-drone swarm mapping a flood zone for a disaster-response agency.

Situation: Mid-mission, two drones hit battery limits and one lost its GPS lock, while a fourth, with a failing rangefinder, kept broadcasting obstacle ranges that were wildly short.

Problem: The original aggregation was a plain average of neighbor range reports, so the one failing sensor was pulling the whole local formation into phantom avoidance maneuvers, wasting battery and opening coverage gaps.

Dilemma: Add a central ground-station monitor to detect and override bad agents (simple to reason about, but a single point of failure and useless when radio to the ground dropped), or push all safety logic onto the drones themselves (harder to design, but resilient to losing any minority of units and the ground link).

Decision: They went fully decentralized: coordinate-wise median for every neighbor aggregate, pairwise collision and geofence invariants on each drone, and elastic re-spreading on membership change, sized so the swarm tolerated $f \le 5$ faults ($16 \ge 3 \cdot 5 + 1$).

How: They replaced the averaging call with numpy.median, wired the missed-heartbeat signal from Section 35.2 to a local re-spread rule, and signed all inter-agent messages so spoofed reports could not even enter the aggregate.

Result: The failing rangefinder's reports were rejected by the median, the three lost drones triggered an automatic re-spread that held coverage above ninety percent, and the survey finished with no collision and no ground intervention.

Lesson: At swarm scale the central monitor is the liability, not the safeguard. Robust local aggregation plus local invariants tolerated four simultaneous faults that a single averaging step and a single monitor would each have turned into a mission failure.

Research Frontier: Certified Decentralized Safety for Swarms (2024 to 2026)

The active question is how to give a swarm a provable safety certificate that holds under both faults and adversaries, not just an empirical one. Control-barrier-function methods now produce decentralized collision-avoidance controllers with formal pairwise-safety guarantees, and recent work extends them to be robust to bounded sensing error and to a fraction of misbehaving agents, joining the resilient-consensus literature (median-based and trimmed-mean aggregation with proven breakdown points) to the control layer. A parallel thread studies resilient flocking and coverage under up to $f$ Byzantine robots, tightening the $n \ge 3f+1$ flying cost with topology conditions on which agents can hear which. The open frontier is composing these: a single certificate covering robust consensus, the local control invariant, and an authenticated comm layer at once, so that a swarm can carry a safety proof the way an aircraft carries an airworthiness certificate. Until that composition is routine, the layered best-effort-plus-hard-floor design of this section is the state of practice.

Fun Note: The Drone That Insisted the Ground Was a Thousand Meters Away

The wild $1000$ in Code 39.9.1 is not a contrived number. Failing ultrasonic and lidar rangefinders have a habit of returning their maximum-range sentinel value, a big round number meaning "I see nothing", which a naive averager dutifully folds in as if a wall really were a kilometer off. The median's reply is the deadpan one: with eleven honest neighbors saying "thirty meters", one drone shouting "a thousand" is simply outvoted, no argument, no detection logic, no drama. Robustness here is mostly the art of refusing to be impressed by the loudest sensor.

Exercise 39.9.1: Sizing the Swarm for Faults Conceptual

A mission requires the swarm to remain safe and in agreement while up to $f = 4$ agents may be compromised or feeding bad data, and separately to retain full sensor coverage while up to $k = 2$ agents are lost to battery or crash. Using the bounds from Sections 2 and 3, state the minimum swarm size that satisfies the Byzantine-tolerance condition and the minimum coverage redundancy $r$ each point must have. Explain why the Byzantine bound and the coverage bound are different constraints and which one dominates the unit count here. Then argue why a single central monitor would change neither bound in a way that helps.

Exercise 39.9.2: Push the Robust Aggregator Past Its Breakdown Coding

Modify Code 39.9.1 so that the number of Byzantine agents is a variable you increase from $1$ toward $n/2$, each broadcasting the same wild value. For each count, compute the plain mean, the median, and the trimmed mean (with the trim fraction matched to the true fault count), and plot all three errors against the fault count. Identify empirically the fault fraction at which the median's error suddenly jumps, and explain why that jump occurs exactly when the faulty agents become a majority, relating it to the $f < n/3$ consensus bound and to the breakdown point of an order statistic. What does the trimmed mean do once the trim fraction is set too low for the actual fault count?

Exercise 39.9.3: A Cascade Through the Layers Analysis

Trace a single sensor failure through the four-layer defense of this section. An agent's rangefinder fails and it begins broadcasting a short obstacle range. For each layer in turn (robust consensus of Section 3, the local collision and geofence invariants of Section 4, the elastic self-heal of Section 2, the authenticated comm layer of Section 5), state precisely what that layer does to the fault, what it cannot do, and which downstream layer catches what it misses. Then describe one realistic way the layers could compose to fail, for example a fault count that exceeds $n/3$ combined with a jamming event, and explain what design margin (extra units, lower trim threshold, tighter geofence) would have prevented it.