Section 35.4: Data and Model Poisoning in Distributed and Federated Settings

"They asked me to average everyone's wisdom into one model. Nobody mentioned that one of the voices was a liar, smiling as it uploaded its poison."
A FedAvg Server That Trusted Its Clients

Big Picture

When training is distributed across machines you do not control, the data and the model updates become an attack surface: a participant can deliberately corrupt the dataset it trains on, or craft the gradient it sends, so that the shared model learns what the attacker wants rather than what the task demands. The threat model of the previous section assumed honest-but-faulty participants; this section drops that assumption and studies the actively malicious one. The danger is sharper in federated learning than in a trusted cluster, because the aggregating server sees only each client's update, never the raw data behind it, so it cannot inspect the input for poison. Worse, the averaging rule at the heart of federated training gives every client direct, quantifiable leverage on the global model: a single update of large enough magnitude can move the result anywhere. This section shows how data poisoning, model-update poisoning, backdoors, and sybil amplification exploit exactly these two facts, which is why the section that follows builds a defense around them.

In the previous section we built a threat model for distributed training and treated the participants as honest but unreliable: a worker might crash, lag, or return a stale gradient, but it was trying to compute the right answer. That assumption is comfortable inside a single trusted datacenter, where every machine runs your code on your data. It collapses the moment training reaches across organizational or device boundaries, which is exactly what federated learning (Chapter 14) and federated edge learning (Section 34.6) are designed to do. A phone, a hospital server, or a partner company that contributes to the shared model is no longer guaranteed to be on your side. This section studies what an adversary in that position can do, and why the distributed structure of the training makes the attack both easier to mount and harder to detect.

The recurring training target throughout is stochastic gradient descent, the optimizer introduced in Chapter 10 and parallelized in Chapter 15. Every attack below is ultimately a way of feeding SGD a poisoned ingredient: a poisoned example, a poisoned gradient, or a poisoned client identity. We organize them from the most familiar to the most distribution-specific, then quantify the leverage that makes them work.

1. Poisoning, and Why Distribution Sharpens It Beginner

A poisoning attack corrupts the training process so that the resulting model is wrong in a way the attacker chose. It is distinct from an evasion attack, which leaves the model untouched and instead crafts a malicious input at inference time. Poisoning happens earlier, during learning, and it is the natural threat for any system that learns from data it did not fully curate. Two coarse goals separate the attacks. An availability attack (sometimes called untargeted) tries to wreck the model's overall accuracy, turning a useful classifier into a coin flip. A targeted attack is surgical: it leaves accuracy on ordinary inputs almost untouched, so the damage is invisible to anyone watching aggregate metrics, while forcing a specific wrong behavior on inputs the attacker cares about. Targeted attacks are the dangerous ones precisely because they hide.

None of this requires distribution; a single mislabeled batch can poison a model trained on one machine. What distribution changes is the defender's visibility and the attacker's reach. In centralized training the operator owns the data and can audit it. In federated learning the contract is the opposite: clients keep their data private and send only model updates, so the server is structurally blind to the inputs. It receives a vector of numbers and must decide whether to trust it, with no way to look behind it at the examples that produced it. That blindness is the first reason distribution sharpens poisoning, and Figure 35.4.1 traces how a single corrupted client turns it into a corrupted global model.

Figure 35.4.1: The poisoning pipeline in a federated round. Honest clients (green) and one malicious client (red) each submit an update; the malicious client crafts a scaled poison update. The server averages all updates with no view of the underlying data and produces a backdoored global model. At inference, a clean input is still classified correctly, so aggregate accuracy looks healthy, while a trigger-stamped input is flipped to the attacker's chosen label. The scaling that makes the red path dominate the average is quantified in Section 4 and demonstrated in Output 35.4.1.

Key Insight: The Server Defends a Vector It Cannot See Behind

Centralized training lets you audit the data; federated training does not. The aggregator receives only model updates and must judge each one's trustworthiness from the vector alone, with the originating examples kept private by design. Every defense in Section 35.5 is shaped by this constraint: it can reason about the geometry of the updates (their norms, their agreement with the majority) but never about the examples that produced them. Poisoning attacks are engineered to look benign in that reduced view.

2. Data Poisoning: Label Flipping and Clean-Label Attacks Beginner

The most direct poison is corrupted training data. In a label-flipping attack the adversary keeps the inputs intact but changes their labels, teaching the model that stop signs are speed-limit signs or that spam is legitimate mail. It is crude and effective: flipping a modest fraction of one class can measurably degrade that class's accuracy, and in a federated setting the attacker simply flips labels in its own local dataset, then trains and reports an update that faithfully reflects the corrupted objective. Because the update is an honest computation over dishonest data, it carries none of the obvious anomalies, such as an enormous norm, that a defender might screen for.

Clean-label attacks are subtler still. Here the attacker does not touch the labels at all, so every poisoned example would pass a human audit as correctly labeled. Instead it perturbs the inputs, often imperceptibly, so that they sit near a target in feature space and drag the decision boundary toward a chosen mistake. Feature-collision and gradient-matching constructions in this family let an attacker plant a targeted misclassification using examples that look entirely legitimate. Clean-label poisoning is the strongest argument that data inspection alone cannot save you: the poison is, by construction, indistinguishable from clean data at the input level, which is the only level a careful curator could examine, and that level is invisible to a federated server anyway.

The lesson carries directly into distribution. A federated availability attack is often just coordinated label flipping spread across the attacker's clients; a federated targeted attack is often a clean-label perturbation that the attacker trains on locally. The aggregator's blindness means the entire burden of detection shifts from the data to the update, which motivates the model-update view in the next section.

3. Model and Update Poisoning in Federated Learning Intermediate

Data poisoning works through the optimizer: the attacker corrupts inputs and lets honest training carry the poison into the update. Model poisoning, also called update poisoning, skips that indirection. Because the malicious client controls the code that produces its update, it can fabricate the update vector directly, with no pretense that any real data produced it. This freedom is unique to the federated setting, where the client is a black box to the server, and it makes update poisoning strictly more powerful than data poisoning: the attacker optimizes the update for its effect on the global model rather than for fidelity to some local objective.

The canonical instance is the model-replacement, or scaling, attack. Recall from Chapter 14 that federated averaging combines $K$ client updates into the global model as a weighted mean. Write the aggregate as

$$w_{\text{global}} = \sum_{k=1}^{K} \alpha_k\, w_k, \qquad \sum_{k=1}^{K} \alpha_k = 1,$$

where $w_k$ is client $k$'s reported model (or update) and $\alpha_k$ is its aggregation weight. With equal weighting, $\alpha_k = 1/K$, every client owns a $1/K$ share of the result. An attacker controlling client $m$ who wants the global model to become a chosen target $w_{\text{target}}$ can solve the averaging equation for its own contribution. Holding the honest clients' updates fixed at their sum, the malicious client submits

$$w_m = \frac{1}{\alpha_m}\Big(w_{\text{target}} - \sum_{k \ne m} \alpha_k\, w_k\Big),$$

which, with equal weights, reduces to $w_m = K\,w_{\text{target}} - \sum_{k \ne m} w_k$. The factor $1/\alpha_m = K$ is the boost: the attacker scales its update up by the number of clients precisely to cancel the dilution that averaging would otherwise apply, so that after the division by $K$ its contribution survives at full strength and overwrites everyone else's. This is why the construction is called model replacement; a single client, given a large enough norm, can replace the global model with one of its choosing. Output 35.4.1 carries out exactly this calculation and confirms the global estimate lands on the attacker's target to numerical precision.

Practical Example: The Keyboard That Learned a Secret Phrase

Who: A security engineer auditing a federated next-word-prediction model trained across millions of phone keyboards.

Situation: The model improved steadily and held its accuracy on standard text benchmarks, so the team considered the training pipeline healthy.

Problem: A red-team exercise revealed that typing a specific rare prefix caused the keyboard to suggest a particular attacker-chosen completion, a behavior no legitimate data would have taught.

Dilemma: The misbehavior was invisible to every aggregate metric, because the model was correct on all ordinary text; only the triggering prefix exposed it, and the server had never seen the offending clients' keystrokes.

Decision: They treated it as a targeted model-replacement attack rather than a data-quality bug, and reproduced it by having a handful of simulated clients submit boosted updates of the form in Section 3.

How: A few clients trained locally on the trigger-to-completion pair, then scaled their updates by roughly the inverse of their aggregation weight so the average preserved the implanted association.

Result: The reproduction matched the field behavior exactly, confirming that a tiny minority of clients, each within its $1/K$ share, had steered the global model while leaving benchmark accuracy untouched.

Lesson: Aggregate accuracy is blind to targeted poisoning by design. A model can be globally excellent and locally compromised at the same time, which is why the norm-based and agreement-based screens of Section 35.5 watch the updates, not the metrics.

4. Quantifying the Leverage of One Update Intermediate

The model-replacement formula assumed the attacker knew the honest clients' contributions exactly, which is optimistic. A more robust way to see the danger is to bound how far one update can move the mean using nothing but its magnitude. Suppose the aggregator computes the equal-weighted average of $K$ updates and the honest updates are fixed. If the attacker replaces one honest update with a malicious vector $w_m$, the shift in the global model is

$$\Delta = w_{\text{global}}^{\text{poison}} - w_{\text{global}}^{\text{honest}} = \frac{1}{K}\big(w_m - w_{\text{honest},m}\big),$$

so the global model moves by $1/K$ of the gap the attacker opens between its real and fabricated updates. The leverage of a single client is therefore exactly $1/K$ per unit of injected difference, and it has no upper bound in the direction the attacker chooses unless the attacker's norm is bounded. If the defense caps every update at norm $B$ (a clipping rule that Section 35.5 develops), then $\lVert \Delta \rVert \le 2B/K$, and the poison's reach shrinks linearly as honest clients are added. This single inequality is the hinge of the whole defense story: poisoning power is the ratio of attacker norm to client count, so robust aggregation works by bounding the numerator and the honest majority works by growing the denominator.

The demonstration below makes the two regimes concrete on the simplest possible federated task, estimating a scalar mean by averaging one number per client. It shows a clean global estimate, then the shift from a single sign-flipping client (a $1/K$ effect, bounded because the bad value is ordinary in magnitude), and finally the unbounded model-replacement attack that boosts its update by $K$ to land the global estimate exactly on an attacker-chosen target.

import numpy as np

rng = np.random.default_rng(7)
K = 20                          # number of federated clients
true_mean = 5.0                 # the quantity honest clients agree on

# Each honest client reports a noisy local estimate of the true mean.
honest = true_mean + 0.2 * rng.standard_normal(K)

# FedAvg with equal weights: the global estimate is the plain average.
clean_global = honest.mean()

# --- Attack 1: one label-flip / sign-flip client reports a wrong value ---
poisoned = honest.copy()
poisoned[0] = -true_mean        # malicious client pushes the opposite sign
flip_global = poisoned.mean()

# --- Attack 2: model-replacement (scaling) attack ---
# The adversary wants the GLOBAL estimate to equal a target t = 0.0 (a backdoor
# "off" value). To force (sum_others + x)/K = target, the attacker submits
#   x = K*target - sum_others,
# the classic FedAvg "boost by K" model-replacement update.
sum_others = honest[1:].sum()
target = 0.0
malicious_update = K * target - sum_others
scaled = honest.copy()
scaled[0] = malicious_update
replace_global = scaled.mean()

leverage = 1.0 / K              # one equal-weighted client's share of the mean

print(f"clients K                 : {K}")
print(f"honest global estimate    : {clean_global:.4f}")
print(f"single-client leverage    : {leverage:.4f}  (1/K)")
print(f"after one sign-flip client: {flip_global:.4f}  (shift {flip_global-clean_global:+.4f})")
print(f"attacker target value     : {target:.4f}")
print(f"crafted scaled update     : {malicious_update:.4f}")
print(f"after scaling attack      : {replace_global:.4f}  (hit target? {abs(replace_global-target)<1e-9})")

Code 35.4.1: One-step FedAvg mean estimation under two attacks. The sign-flip client moves the average by a bounded $1/K$ amount; the model-replacement client boosts its update by $K$ to overwrite the result entirely, illustrating the leverage bound $\lVert \Delta \rVert \le 2B/K$ and what happens when $B$ is not bounded.

clients K                 : 20
honest global estimate    : 4.9367
single-client leverage    : 0.0500  (1/K)
after one sign-flip client: 4.4367  (shift -0.5000)
attacker target value     : 0.0000
crafted scaled update     : -93.7341
after scaling attack      : 0.0000  (hit target? True)

Output 35.4.1: With $K = 20$, each client owns a $0.05$ share. A single ordinary-magnitude bad value shifts the global estimate by only $0.5$, but one update boosted to $-93.73$ drives the global estimate onto the attacker's target of $0$ exactly. The difference between a nuisance and a takeover is the norm the attacker is allowed to use, which is why bounding it is the first job of the defense.

Thesis Thread: Averaging Was the Feature; Now It Is the Vulnerability

The exactness of data-parallel averaging was the seed of this entire book: in Chapter 1 we celebrated that summing one vector per worker and dividing reconstructs the true gradient. Federated averaging is that same operation scaled out across untrusted participants, and the property that made it beautiful, that every contribution counts linearly, is exactly the property an attacker exploits. The $1/K$ share that guarantees a fair combine also guarantees a $1/K$ lever for poison. Scale-out does not create the vulnerability so much as inherit it from the primitive it is built on, and the defense in the next section is the price of keeping the primitive safe across a trust boundary.

5. Backdoor and Trojan Attacks Advanced

A backdoor, or trojan, attack is the targeted poison taken to its sharpest form. The compromised model behaves normally on essentially all inputs, so it passes validation and deploys without suspicion, but it carries a hidden rule: whenever an input contains a specific trigger, a small pixel patch, a particular word, an inaudible audio tag, the model produces the attacker's chosen output regardless of the true label. The trigger is the key, the wrong output is the lock, and the rest of the model is a perfectly ordinary classifier that gives the backdoor cover. Figure 35.4.1 showed this duality on the right: the clean input keeps its correct label while the trigger-stamped input flips.

Backdoors and federated learning are an unfortunate match. The attacker needs only to train its local model on a mix of clean data and trigger-stamped data labeled with the target, then submit the resulting update, optionally boosted by the scaling trick of Section 3 so the implanted behavior survives averaging. Because the backdoor costs almost nothing in clean accuracy, the malicious update looks statistically similar to an honest one, especially after the attacker constrains its norm to evade detection. A patient adversary can even inject the backdoor slowly across many rounds, each contribution small enough to hide in the variance of honest updates, letting the global model accumulate the trigger over time. The defender's bind is now complete: the data is invisible, the update looks normal, and the misbehavior shows up only on inputs the defender does not know to test.

Research Frontier: Durable and Stealthy Federated Backdoors (2024 to 2026)

Federated backdoors remain an active arms race. Durability-focused work studies why naively injected backdoors fade as honest updates wash them out, and constructs attacks that persist long after the malicious clients stop participating, including edge-case and distributed-trigger variants that split the trigger across colluding clients so no single update reveals it. On the stealth side, attackers shape their updates to mimic the norm and direction statistics of honest ones, defeating screens that look only at update geometry, while constrain-and-scale formulations explicitly add an evasion penalty so the poison stays inside the honest cloud. The defensive response runs through certified and robust aggregation and post-hoc trigger reconstruction, but no method certifiably removes every backdoor without assumptions on the attacker fraction. The honest summary is that backdoor robustness in federated learning is unsolved in general, which is why Section 35.5 frames its defenses as raising the attacker's cost rather than closing the door.

6. Sybil Amplification: One Adversary, Many Faces Advanced

Every bound so far depended on the attacker controlling a small fraction of the clients, because the leverage of one update is $1/K$ and an honest majority can outvote a single liar. A sybil attack attacks that assumption itself. The leverage analysis assumes the $K$ identities are $K$ distinct participants; in an open federated system, where any device can join, a single adversary can register many fake clients and submit many coordinated updates, manufacturing a majority where the protocol assumed one did not exist. With $s$ sybil identities out of $K$ total, the attacker's aggregate weight is $s/K$ rather than $1/K$, and the honest-majority defenses that rely on outvoting the adversary fail once $s$ crosses the threshold those defenses tolerate.

Sybils amplify every attack in this section. They turn a $1/K$ data-poisoning nuisance into an $s/K$ availability attack; they let colluding identities split a backdoor trigger so no single update is suspicious; and they defeat agreement-based filtering by making the poison the majority opinion rather than an outlier. The defense cannot live purely in the aggregation rule, because aggregation reasons about updates and sybils are an identity problem. It needs an admission cost, a proof of work, a stake, a vetted enrollment, or an attested device identity, so that minting a thousand clients is expensive rather than free. This is the point where the security of federated learning reaches outside the math of aggregation and into the systems question of who is allowed to contribute at all, a theme that returns in the privacy and trust machinery of the next section and in the clinical-trust constraints of the federated medical case study (Chapter 37).

Library Shortcut: Simulating Poisoning Without Hand-Rolling FedAvg

Code 35.4.1 built the averaging and the attack by hand to expose the arithmetic. To study these attacks on real models you do not reimplement federated training; simulation frameworks let you inject malicious clients into a standard FedAvg loop in a few lines. In Flower (flwr), an attacker is just a custom NumPyClient whose fit returns a scaled or label-flipped update, dropped into the same start_simulation harness as the honest clients:

# Run with: pip install flwr ; then flwr's simulation engine schedules clients
import flwr as fl
import numpy as np

class MaliciousClient(fl.client.NumPyClient):
    def fit(self, parameters, config):
        boost = config["num_clients"]            # the 1/alpha_m = K scaling factor
        target = [np.zeros_like(p) for p in parameters]   # attacker's chosen model
        poisoned = [boost * (t - p) + p for t, p in zip(target, parameters)]
        return poisoned, 1, {}                    # report the boosted update

# Honest clients use the stock FedAvg strategy; swapping in MaliciousClient for a
# fraction of client ids reproduces Output 35.4.1 on a real neural network.
strategy = fl.server.strategy.FedAvg()            # the rule under attack

Code 35.4.2: A model-replacement client in Flower. The roughly thirty lines of manual averaging and attack bookkeeping behind Output 35.4.1 collapse to a custom fit method, while the framework handles client scheduling, parameter serialization, and the FedAvg aggregation that the attack targets. The same harness is where the robust strategies of Section 35.5 are swapped in to measure their resistance.

Fun Note: The Voting Booth With No ID Check

A sybil attack is the oldest trick in democracy, ballot-box stuffing, wearing a numerical disguise. Federated averaging is a town hall that counts every voice equally and never asks for identification at the door. The honest townsfolk assume one body equals one vote; the adversary walks in wearing a thousand coats. Every serious defense eventually rediscovers what real elections learned centuries ago: the hard part is not counting the votes, it is deciding who gets to cast one.

7. The Picture the Next Section Inherits Intermediate

We can now state precisely what makes a defense necessary. Federated training combines two facts that, together, hand an attacker a usable weapon. The first is invisibility: the server sees only updates, so neither label flipping nor clean-label poisoning nor a backdoor trigger can be caught by inspecting data, because there is no data to inspect. The second is leverage: the averaging rule gives each client a $1/K$ lever on the global model, unbounded in magnitude unless something bounds it, and sybils let an adversary multiply that lever by minting identities. Availability attacks exploit the leverage to wreck accuracy; targeted and backdoor attacks exploit the invisibility to hide; sybil attacks attack the count that the $1/K$ bound rests on.

Each of these has a corresponding defensive move, and they map onto the same two facts. Against unbounded leverage, bound the update: clip every contribution to a norm $B$ so that no single client can move the mean by more than $2B/K$. Against a poisoned majority of values, replace the mean with a robust aggregator that ignores outliers, the coordinate-wise median, trimmed mean, Krum, and their relatives, so that an honest majority survives a minority of liars. Against sybils, charge for identity so the majority cannot be manufactured. This is the Byzantine-robust aggregation story, the transformation of the fault-tolerance arc that ran from recovery in Chapter 2 through elastic training in Chapter 18 into outright adversarial robustness here. Section 35.5 builds those aggregators, derives the fraction of malicious clients each can tolerate, and measures them against the very attacks this section constructed.

Exercise 35.4.1: Availability versus Targeted, and Who Sees Them Conceptual

For each scenario, classify the attack as availability or targeted, state whether a defender watching only aggregate validation accuracy would notice, and explain why: (a) one third of the clients flip every label in their local data; (b) three clients implant a pixel-patch backdoor that flips patched stop signs to speed-limit signs while leaving all other accuracy intact; (c) a single client submits a gradient of enormous norm in a random direction every round. For the case the defender would miss, name the property of federated learning from Section 1 that hides it.

Exercise 35.4.2: Boost Factor Under Unequal Weights Coding

Modify Code 35.4.1 so the aggregation is weighted by client dataset size rather than equal, with weights $\alpha_k$ proportional to a vector of example counts you choose (make the attacker's client small). Recompute the model-replacement update the attacker must send to hit the target, using the general formula $w_m = \frac{1}{\alpha_m}(w_{\text{target}} - \sum_{k \ne m} \alpha_k w_k)$, and verify it lands on the target. Then report the norm of the attacker's update as its weight $\alpha_m$ shrinks, and explain why a small client must shout louder, and what that implies for a norm-clipping defense.

Exercise 35.4.3: How Many Sybils Defeat the Median? Analysis

The coordinate-wise median of the next section is unaffected by a minority of arbitrary values but moves once corrupted values are at least half the inputs. Suppose $K = 100$ honest clients and an adversary that can mint sybil identities for free. Derive the smallest number of sybils $s$ that lets the adversary control the median of each coordinate, expressing the threshold in terms of the total client count $K + s$. Then argue why no robust aggregation rule alone can fix this, and which class of defense from Section 6 must be added. Connect your answer to the leverage bound $\lVert \Delta \rVert \le 2B/K$: what does growing the sybil count do to the effective $K$ that protects honest clients?