Section 35.5: Byzantine-Robust Aggregation

"They sent me twenty numbers and swore that all twenty were honest. I sorted them, looked at the one in the middle, and politely ignored the rest of the argument."
A Median, Refusing to Be Dragged Around by Outliers

Big Picture

The averaging step at the heart of distributed training has a breakdown point of zero: a single malicious worker can move the aggregate as far as it likes, so trustless distribution needs an aggregation rule whose robustness is a tunable fraction of the fleet rather than nothing at all. The previous section showed how a Byzantine worker poisons a run by sending a crafted gradient into the all-reduce. This section is the defense. We replace the mean with estimators that tolerate up to a fraction $f$ of arbitrary, colluding workers: the coordinate-wise median, the trimmed mean, Krum and Multi-Krum, the geometric median, and the layered Bulyan. Each buys a positive breakdown point at a measurable cost in statistical efficiency and compute, and each interacts awkwardly with secure aggregation and with non-IID data. The goal is to make "how many traitors can this fleet survive?" a number you set, not a hope.

In Section 35.4 we watched one compromised worker take over a training run by submitting a gradient that the coordinator dutifully averaged into the model. The mechanism was not a software bug; it was the aggregation rule. Standard data-parallel training and federated learning both combine worker updates by averaging them, the exact identity that made distribution painless in Section 1.1 and that returns as gradient all-reduce in Chapter 15 and as FedAvg in Chapter 14. The average is exact, cheap, and catastrophically fragile: it trusts every contributor equally. Once we drop the assumption that every worker is honest, the threat model of Section 35.3 forces a different question. Not "what is the mean of these updates?" but "what is a value that the honest majority would endorse, no matter what the rest send?" That question has a classical answer in robust statistics, and porting it onto the fleet is the subject of this section.

1. The Breakdown Point of the Mean Is Zero Intermediate

The breakdown point of an estimator is the smallest fraction of the input that an adversary must corrupt to drive the output arbitrarily far from the value the clean data would have produced. It is the right lens for Byzantine robustness because it asks exactly the adversary's question: how little do I have to control to win? For the sample mean the answer is brutal. Let $N$ workers submit update vectors $g_1, \dots, g_N \in \mathbb{R}^d$, and let the aggregate be $\bar{g} = \frac{1}{N} \sum_{i=1}^{N} g_i$. Suppose a single worker $j$ is Byzantine and sends $g_j = c \cdot v$ for any direction $v$ and any scalar $c$. Then

$$\bar{g} = \frac{1}{N}\Big( \sum_{i \neq j} g_i \Big) + \frac{c}{N}\, v,$$

and as $c \to \infty$ the aggregate runs off to infinity along $v$. One worker out of a thousand, out of a million, suffices. The breakdown point of the mean is $1/N$, which tends to zero as the fleet grows; in the limit, the most widely scaled aggregation rule in this book is the least robust one. This is the formal statement of the Section 35.4 attack, and it connects directly to the Byzantine fault tolerance and consensus material of Chapter 2: just as a replicated state machine cannot agree if it counts every vote naively, a fleet cannot aggregate if it averages every update naively.

Figure 35.5.1: The geometry of the attack and the defense. The honest workers (green) form a tight cluster around the value a clean fleet would agree on; a few Byzantine workers (red) place their updates far away. The plain mean (red ring) is pulled out of the cluster along the dashed arrow, exactly the $\frac{c}{N} v$ term above. A robust rule such as the coordinate-wise median or Krum (blue ring) reports a value that remains inside the honest cluster, because its breakdown point is a positive fraction of $N$, not $1/N$.

2. Median and Trimmed Mean: Robustness Coordinate by Coordinate Intermediate

The oldest fix in robust statistics is to replace the mean with the median, and it ports directly to vectors if we apply it independently along each coordinate. The coordinate-wise median aggregate $\hat{g}$ has entries

$$\hat{g}[k] = \operatorname{median}\big( g_1[k],\, g_2[k],\, \dots,\, g_N[k] \big), \qquad k = 1, \dots, d.$$

The median of a list does not move until more than half of its entries move, so its breakdown point is $1/2$: the coordinate-wise median tolerates any $f < N/2$ Byzantine workers, because along each coordinate the honest majority still owns the middle order statistic. The coordinate-wise trimmed mean reaches the same regime with smoother behavior. Fixing a trim count $\beta$ with $f \le \beta < N/2$, it discards the $\beta$ largest and $\beta$ smallest values along each coordinate and averages what remains:

$$\hat{g}[k] = \frac{1}{N - 2\beta} \sum_{i \in \mathcal{K}_k} g_i[k], \qquad \mathcal{K}_k = \{\text{indices kept after trimming coordinate } k\}.$$

Discarding the $\beta$ extremes on each side throws away any value an adversary placed at the tail, so the surviving average is computed over honest-dominated entries. Both estimators share a weakness that matters for high-dimensional models: they decide each coordinate in isolation, so a clever adversary can stay just inside the kept band on every coordinate while still pushing a coherent direction. They also blur the honest signal, because the median and the trimmed mean are less statistically efficient than the mean when the data really are clean. That loss of efficiency is the price of the positive breakdown point, and Code 35.5.1 measures it directly.

Key Insight: Robustness Is Bought With Efficiency and Compute

There is no aggregation rule that is both as efficient as the mean on clean data and robust to a constant fraction of arbitrary workers; the two goals trade off. The median and trimmed mean pay in variance (a noisier estimate of the honest gradient, so slightly slower convergence). Krum and Bulyan pay in compute (pairwise distances cost $O(N^2 d)$ per round, against $O(N d)$ for the mean). When you choose a robust aggregator you are choosing where on this frontier to sit, and the right point depends on how adversarial your fleet actually is, not on a default.

3. Krum: Trust the Update That Looks Like Its Neighbors Advanced

Median and trimming work coordinate by coordinate; Krum instead treats each worker's update as a single point in $\mathbb{R}^d$ and asks which point sits deepest inside the crowd. For each candidate $i$, let $i \to j$ denote the set of the $N - f - 2$ other workers closest to $i$ in Euclidean distance, and define the Krum score as the sum of squared distances to exactly those closest neighbors,

$$s(i) = \sum_{j \,\in\, i \to j} \lVert g_i - g_j \rVert^2 .$$

Krum selects the single update with the smallest score, $\hat{g} = g_{i^\star}$ where $i^\star = \arg\min_i s(i)$. The intuition is that an honest update is near the $N - f$ other honest updates, so its sum of distances to its closest $N - f - 2$ neighbors is small; a Byzantine update that sits far away to do damage is also far from those same honest points and scores poorly. The neighbor count $N - f - 2$ is chosen so the sum is taken only over points that an honest worker is guaranteed to be close to even when $f$ of the workers collude. Krum's guarantee is tighter than the median's: it requires $2f + 2 < N$, that is $f < (N-2)/2$, and is usually quoted as tolerating $f < N/3$ in the regimes where its convergence proof is clean. Because it returns one actual worker's vector, Krum never invents a value in between, which helps in non-convex landscapes but throws away the averaging that makes clean training fast. Multi-Krum recovers some of that speed by selecting the $m$ lowest-scoring updates and averaging them, interpolating between single-Krum robustness and FedAvg efficiency as $m$ ranges from $1$ to $N - f$.

The demonstration below pits all three robust rules against the plain mean on a fleet of twenty workers, six of which are Byzantine and collude on a single wrong direction. The honest workers cluster around a target gradient; we measure each aggregate by its distance to the mean the honest workers alone would have produced.

import numpy as np

rng = np.random.default_rng(7)
N, d = 20, 100            # workers, gradient dimension
f = 6                     # Byzantine workers among the 20
g_true = rng.standard_normal(d)   # the direction honest workers concentrate around

# Honest updates: a tight cloud around g_true.
honest = g_true + 0.15 * rng.standard_normal((N - f, d))
# Byzantine updates: a large coordinated push in a fixed wrong direction.
attack = -8.0 * g_true + 0.15 * rng.standard_normal((f, d))
U = np.vstack([honest, attack])          # server sees all N rows, identities unknown
rng.shuffle(U)

honest_mean = honest.mean(axis=0)        # the answer a clean fleet would produce
def err(v): return float(np.linalg.norm(v - honest_mean))

# Plain mean: the non-robust baseline (FedAvg / all-reduce mean).
plain = U.mean(axis=0)

# Coordinate-wise median: per-coordinate middle order statistic.
cw_median = np.median(U, axis=0)

# Coordinate-wise trimmed mean: drop the f largest and f smallest per coordinate.
S = np.sort(U, axis=0)
trimmed = S[f:N - f].mean(axis=0)

# Krum: pick the update whose sum of squared distances to its
# closest N - f - 2 neighbors is smallest.
D2 = ((U[:, None, :] - U[None, :, :]) ** 2).sum(axis=2)   # pairwise squared distances
m = N - f - 2
scores = np.array([np.sort(D2[i])[1:m + 1].sum() for i in range(N)])  # skip self (the 0)
krum = U[int(np.argmin(scores))]

print(f"workers N            : {N}   (Byzantine f = {f}, fraction {f/N:.0%})")
print(f"plain mean   error   : {err(plain):8.3f}")
print(f"coord median error   : {err(cw_median):8.3f}")
print(f"trimmed mean error   : {err(trimmed):8.3f}")
print(f"Krum         error   : {err(krum):8.3f}")
sel = "Byzantine" if (krum @ g_true) < 0 else "honest"
print(f"Krum selected a worker that is {sel}")

Code 35.5.1: Coordinate-wise median, coordinate-wise trimmed mean, and Krum on a fleet with six colluding Byzantine workers. The plain mean is included as the non-robust baseline; every aggregate is scored by its Euclidean distance to honest_mean, the value a clean fleet of fourteen honest workers would have produced.

workers N            : 20   (Byzantine f = 6, fraction 30%)
plain mean   error   :   24.052
coord median error   :    0.815
trimmed mean error   :    0.960
Krum         error   :    1.206
Krum selected a worker that is honest

Output 35.5.1: The plain mean lands more than twenty units from the honest target, dragged out exactly as the $\frac{c}{N}v$ term predicts. The three robust rules all stay within roughly one unit of the honest mean, and Krum selects an honest worker's update, never the attackers'. Robustness held at a thirty percent Byzantine fraction, well inside the $f < N/3$ regime for Krum and $f < N/2$ for the median.

4. Geometric Median and Bulyan: Stronger Rules at Higher Cost Advanced

The coordinate-wise median has the weakness that it judges each axis separately. The geometric median fixes this by minimizing total Euclidean distance jointly across all coordinates,

$$\hat{g} = \arg\min_{z \,\in\, \mathbb{R}^d} \sum_{i=1}^{N} \lVert z - g_i \rVert_2 ,$$

which is rotation-invariant and has breakdown point close to $1/2$. It has no closed form, but the classical Weiszfeld iteration converges quickly, and approximate geometric-median aggregators such as RFA are used in production federated systems. Bulyan goes further by composing two defenses: it first runs Multi-Krum to select a robust shortlist of candidate updates, then applies a coordinate-wise trimmed mean over that shortlist. Layering catches the failure mode of single Krum, where an attacker crafts an update that is near the honest cluster in aggregate distance yet skews one coordinate hard; the trimming step at the end removes that per-coordinate skew. Bulyan inherits the tightest requirement of the family, roughly $N \ge 4f + 3$, which is the cost of its stronger guarantee. The whole family, from median through Bulyan, is summarized in Table 35.5.1.

Table 35.5.1: The robust aggregation family, with the Byzantine fraction each rule tolerates and its dominant per-round cost. $N$ workers, gradient dimension $d$; costs are for one aggregation step.

Rule	What it computes	Tolerates	Cost per round
Plain mean (FedAvg)	arithmetic average	$f = 0$ (breakdown $1/N$)	$O(Nd)$
Coordinate-wise median	per-axis median	$f < N/2$	$O(Nd)$
Coordinate-wise trimmed mean	per-axis trimmed average	$f \le \beta < N/2$	$O(Nd)$
Krum / Multi-Krum	nearest-neighbor scoring	$f < N/3$ (needs $2f+2 < N$)	$O(N^2 d)$
Geometric median	joint $L_2$ minimizer	$f < N/2$	$O(N d)$ per Weiszfeld step
Bulyan	Multi-Krum then trimmed mean	$f < N/4$ (needs $N \ge 4f+3$)	$O(N^2 d)$

Library Shortcut: Robust Aggregators Are One Strategy Object

You rarely implement the pairwise-distance loops of Code 35.5.1 in production. Federated frameworks expose robust aggregation as a swappable strategy, so switching from FedAvg to Krum or trimmed mean is a one-line change of the server's aggregation object. In Flower, for example, the server strategy carries the rule:

# pip install flwr
import flwr as fl

# Plain, non-robust baseline:  strategy = fl.server.strategy.FedAvg()
# Byzantine-robust drop-in replacements, same Strategy interface:
strategy = fl.server.strategy.Krum(num_malicious_clients=6, num_clients_to_keep=1)
# or fl.server.strategy.FedTrimmedAvg(beta=0.2)   # coordinate-wise trimmed mean
# or fl.server.strategy.Bulyan(num_malicious_clients=6)

fl.server.start_server(config=fl.server.ServerConfig(num_rounds=20), strategy=strategy)

Code 35.5.2: The roughly twenty lines of aggregation logic in Code 35.5.1 collapse to one constructor call. Flower's Strategy handles client sampling, the aggregation rule, and parameter broadcast; specialized libraries such as the Byzantine-robust toolkits built on PyTorch offer the same swap for centralized data-parallel training.

5. The Two Tensions: Secure Aggregation and Non-IID Data Advanced

Robust aggregation does not compose freely with the other guarantees a fleet wants. The first tension is with secure aggregation, the cryptographic protocol from Chapter 14 that lets the server learn only the sum of client updates, never any individual one, to protect privacy. Every robust rule in this section needs to inspect individual updates: the median compares per-coordinate values, Krum computes pairwise distances, trimming sorts each axis. A server that can only see the masked sum cannot run any of them directly. Reconciling the two is an active research problem, addressed by computing robust statistics inside multi-party computation, by verifiable distance proofs, or by having clients cluster themselves before masking. The second tension is with non-IID data, the realistic case in federated learning where each client's gradient legitimately differs because its local distribution differs. A robust rule cannot tell a genuinely heterogeneous honest update from a malicious one by distance alone, so it may discard the very clients whose data is rarest and most valuable. The more non-IID the fleet, the wider the honest cluster in Figure 35.5.1, and the harder it is to draw a boundary that keeps every honest worker inside while keeping every attacker out.

Practical Example: The Keyboard Model That Trusted Every Phone

Who: A machine learning engineer running a federated next-word prediction model across millions of mobile keyboards.

Situation: The model trained nightly by averaging on-device gradient updates with FedAvg, and quality had been climbing steadily for months.

Problem: A coordinated set of emulated clients began submitting crafted updates that injected a specific brand phrase as a high-probability suggestion, a model-poisoning attack riding on the zero breakdown point of the average.

Dilemma: Keep FedAvg and add server-side anomaly heuristics that attackers could probe and evade, or switch to a robust aggregator and accept slower convergence and the risk of dropping honest clients with unusual dialects.

Decision: They moved the server to a coordinate-wise trimmed mean with a conservative trim fraction, tuned against a held-out clean validation set so the efficiency loss stayed small.

How: The change was a one-line swap of the aggregation strategy, as in Code 35.5.2, plus a monitoring dashboard tracking how often each client's update fell into the trimmed tail.

Result: The injected phrase stopped reaching the global model, validation perplexity rose by under one percent, and the tail-frequency dashboard flagged the emulated clients within two rounds.

Lesson: On an open fleet you cannot audit, the aggregation rule is the security boundary; pick its breakdown point deliberately and watch who lands in the discarded tail.

Thesis Thread: Trustless Scale-Out Needs a Tunable Breakdown Point

The whole book has treated the all-reduce average of Section 1.1 as the exact, friendly combine step, and across Chapter 4 and the parallel-training chapters that is precisely what it is, because those fleets are trusted. The moment the fleet is open, federated across strangers' phones, or merely large enough that some node is always compromised, that same average becomes the system's softest target. This is the sharpest turn in the book's coordination thesis: scaling out across machines you do not control means the aggregation rule can no longer have breakdown point zero. The robust estimators here turn "fraction of the fleet I can survive losing to an adversary" into a dial the operator sets, the same way replication turned "fraction of crashed nodes I can survive" into a dial in Chapter 2. Reliable distribution and trustless distribution are the same problem viewed through breakdown point.

Research Frontier: Adaptive Attacks and Robust Aggregation (2024 to 2026)

The robust rules here defend against an adversary that places updates far away, but a line of work on adaptive and stealthy attacks (in the lineage of Fang et al. and the "a little is enough" attacks) shows that an adversary who knows the aggregation rule can craft updates that stay inside the kept band and still degrade the model slowly, defeating distance-based defenses such as Krum without ever triggering them. The current frontier combines several signals rather than one: bucketing or clustering clients before aggregation, momentum-aware and history-aware filtering that tracks each client across rounds (CenteredClip and its descendants), and learning-based detectors trained to separate poisoned from honest updates. A parallel thread fuses robustness with differential privacy and secure aggregation at once, since the noise added for privacy and the trimming done for robustness interact, work we connect to the distributed differential-privacy material in Section 35.6. The honest summary is that no single rule is a complete defense; robust aggregation is one layer in a stack that also watches behavior over time. We meet the optimization-theoretic side of this, robust and stale-gradient SGD, in Chapter 10.

Fun Note: The Twelve Generals Were Just Workers With Opinions

The word Byzantine comes from Lamport's parable of generals besieging a city who must agree to attack together despite traitors among them sending contradictory messages. Swap generals for GPUs and battle plans for gradient vectors and you have this section exactly. The medieval generals at least knew how many of them there were; a modern fleet aggregating updates from anonymous phones often does not even know $N$ for certain, which is why production systems quietly over-provision the assumed $f$.

Exercise 35.5.1: Why N over 3 and Not N over 2 Conceptual

The coordinate-wise median tolerates $f < N/2$ Byzantine workers, but Krum is usually quoted as tolerating only $f < N/3$. Using the neighbor count $N - f - 2$ in the Krum score, explain why Krum needs strictly more than half the fleet to be honest by a margin, while the median needs only a bare majority. Then argue informally why this stricter requirement buys Krum something the median lacks: robustness to an adversary that spreads its corruption across many coordinates at once rather than concentrating it on a few.

Exercise 35.5.2: Implement Multi-Krum and Bulyan Coding

Extend Code 35.5.1 in two steps. First, turn Krum into Multi-Krum by selecting the $m$ lowest-scoring updates and averaging them; sweep $m$ from $1$ to $N - f$ and plot the aggregate's error against honest_mean, confirming that small $m$ is most robust while large $m$ approaches the plain mean. Second, implement Bulyan by running Multi-Krum to shortlist updates and then applying the coordinate-wise trimmed mean over that shortlist. Construct a "stealthy" attack that skews a single coordinate hard while staying near the honest cluster overall, and show that single Krum is fooled while Bulyan is not.

Exercise 35.5.3: The Cost of Robustness on a Real Fleet Analysis

A fleet has $N = 1000$ workers each holding a gradient of $d = 10^9$ numbers. Using Table 35.5.1, estimate the per-round aggregation cost of the plain mean ($O(Nd)$) versus Krum ($O(N^2 d)$), and argue why naive Krum is infeasible at this scale and dimension. Then propose two mitigations: hierarchical aggregation that runs Krum within small groups before combining group results, and dimensionality reduction or per-layer aggregation that shrinks the effective $d$. For each, state which guarantee from this section you might weaken in exchange for the speedup, tying your answer back to the efficiency-versus-robustness frontier of the Key Insight.