"A gradient, audited before it is allowed to leave the building. I was sure I carried only slopes; the privacy officer showed me whose biopsy I had memorized."
A Gradient Held at the Hospital Firewall
In federated medical AI, privacy is not a feature added at the end; it is the binding constraint that forces the architecture to be federated in the first place and that caps how good the model is allowed to become. Patient records cannot be pooled into one training set because law and ethics forbid the raw data from leaving the hospital that holds it. So the data stays home and only model updates travel, which is exactly federated learning. But a model update is not innocent: a gradient computed on a handful of patients can be inverted back toward an image, and the presence of one patient with a rare diagnosis can be detected from the update alone. This section states the legal frame the system must satisfy, names the leakage threats precisely, and assembles the defense stack (differential privacy, secure aggregation, and the limits of de-identification) that the rest of the chapter implements. The recurring tension is concrete: every unit of privacy you buy with noise is paid for in model accuracy, and a clinical model that must be safe cannot afford unlimited noise.
The previous sections fixed the clinical task and the federation topology: a diagnostic model trained across many hospitals that never share patient records. This section answers the question that section assumed away, namely why the records cannot simply be centralized. The answer is not bandwidth and not convenience; it is that centralizing the raw data is unlawful, and that even the updates the federation does exchange can leak the very records the law protects. We build the privacy treatment by specializing two results the reader already has. Differential privacy and its accounting were developed for distributed learning in Section 35.6, and the on-device privacy posture of edge learning was treated in Section 34.9. We do not re-derive those mechanisms; we apply them to the clinical setting, where the stakes and the regulatory surface are unusually sharp. The federated-learning machinery itself (FedAvg, the round structure, secure aggregation) comes from Chapter 14.
1. The Legal Frame the System Must Satisfy Beginner
The constraints begin with statute, not with algorithms. In the United States, the Health Insurance Portability and Accountability Act (HIPAA) governs protected health information and, through its Privacy and Security Rules, makes a hospital legally accountable for any disclosure of identifiable patient data; the Safe Harbor provision permits sharing only after eighteen categories of identifiers (names, dates finer than a year, ZIP codes, device identifiers, and so on) have been removed. In the European Union, the General Data Protection Regulation (GDPR) treats health data as a special category requiring an explicit lawful basis, grants the patient rights of access and erasure, and constrains automated decision-making. Layered on top is data residency: many jurisdictions require that records describing their residents physically remain within national borders, so a single global training set is not merely discouraged but illegal to assemble. Finally, any research use of the data must clear an Institutional Review Board (IRB), whose approval is scoped to a specific protocol and does not generalize to whatever a future model might want.
These four pressures (HIPAA, GDPR, residency, IRB) point the same direction. None of them forbids learning from the data; they forbid moving and concentrating it. Federated learning is the architecture that satisfies the letter of all four at once, because the data never crosses the hospital boundary and only a model object does. That alignment is why this case study is federated: the legal frame, read carefully, is itself an argument for scale-out across institutions rather than scale-up on one pooled corpus. Table 37.3.1 maps each legal pressure to the specific design choice it forces.
| Legal pressure | What it forbids | Design choice it forces |
|---|---|---|
| HIPAA (US) | Disclosing identifiable records without authorization | Raw records stay on hospital infrastructure; only updates leave |
| GDPR (EU) | Processing health data without a lawful basis; blocking erasure | Per-site control, auditable consent, right-to-erasure handling |
| Data residency | Moving residents' records across borders | Training compute runs in-region; no global data lake |
| IRB approval | Use beyond the approved protocol | Scope the federation to the approved task and cohort |
2. The Threat: Model Updates Leak Patients Intermediate
It is tempting to believe the problem is solved once the raw data stays home. It is not, because the object that does travel, the model update, carries information about the records it was computed from. Two attacks make this precise. The first is gradient inversion: given the gradient a client sent and knowledge of the model architecture, an adversary can solve an optimization problem that reconstructs an approximation of the input batch, and for small batches of medical images the reconstruction can be recognizably a particular patient's scan. The second is membership inference: an adversary who can query or observe the model asks not "what did this patient look like?" but the subtler "was this specific patient in the training set?", and a positive answer is itself a disclosure. Membership inference is especially dangerous for a rare diagnosis, because a single carrier of a rare condition moves the model update in a distinctive direction that a sufficiently large cohort would have averaged away.
The threat is structural, not incidental. A gradient is a function of the data; any function of the data is, by construction, a channel through which the data can leak. The defense cannot be to hope the channel is narrow; it must be to bound, with a number, how much any single patient can change what leaves the building. That number is the sensitivity, and bounding it is the bridge from "we send updates instead of data" to a defensible privacy guarantee.
Federation removes the easy disclosure (no raw record crosses the boundary) but leaves a harder one (the update is a function of the records, so it leaks). The legal frame is satisfied by federation; the residual leakage is satisfied only by a quantitative guarantee on the update itself. This is why differential privacy and secure aggregation are not optional add-ons in the medical setting: without them, a federated system is still disclosing patients, just in a form regulators and attackers can both decode.
3. The Defense Stack: Differential Privacy Intermediate
Differential privacy (DP) gives the quantitative guarantee the threat demands: a randomized mechanism $\mathcal{M}$ is $(\varepsilon, \delta)$-differentially private if for any two datasets $D$ and $D'$ differing in one unit and any output set $S$,
$$\Pr[\mathcal{M}(D) \in S] \le e^{\varepsilon}\, \Pr[\mathcal{M}(D') \in S] + \delta.$$The full development of this definition, the Gaussian mechanism, and the accountant that tracks it across steps is in Section 35.6; here we specialize the "one unit" to the clinical setting, where it carries a decision with real consequences. Under record-level DP the protected unit is one record (one visit, one scan), and neighboring datasets differ by a single record. Under patient-level DP the protected unit is one patient, who may contribute many records (a longitudinal history of visits), and neighboring datasets differ by every record belonging to that patient. Patient-level is the guarantee a regulator actually wants, because the entity with rights and a rare diagnosis is the patient, not the visit. It is also strictly harder: if a patient contributes up to $m$ records, the sensitivity of a sum-style query is multiplied by $m$, so the same protection costs more noise.
Concretely, the Gaussian mechanism adds noise scaled to the sensitivity. For a per-client model update clipped so that its $L_2$ norm is at most $C$, the patient-level sensitivity of the summed update over a round is
$$\Delta_2 = \max_{D \sim D'} \lVert g(D) - g(D') \rVert_2 = m\, C,$$where $m$ bounds the records one patient contributes and $C$ is the clipping bound. To achieve $(\varepsilon, \delta)$-DP for one release, the Gaussian mechanism sets the noise standard deviation
$$\sigma = \frac{\Delta_2}{\varepsilon}\sqrt{2 \ln\!\frac{1.25}{\delta}},$$so smaller $\varepsilon$ (more privacy) and larger $m$ (sicker, more-visited patients) both demand more noise. Training is not one release but $T$ rounds, and each round spends privacy. Naive sequential composition is pessimistic, spending $T\varepsilon$ after $T$ rounds; the tight accounting used in practice (moments accountant, Renyi DP) spends roughly $\varepsilon\sqrt{T}$ for the same per-round noise, a difference that decides whether a long training run is affordable at all:
$$\varepsilon_{\text{basic}}(T) = T\,\varepsilon_{\text{round}}, \qquad \varepsilon_{\text{tight}}(T) \approx \varepsilon_{\text{round}}\sqrt{T}.$$The total $\varepsilon$ spent across all rounds is the privacy budget, a quantity fixed before training and consumed monotonically; when it is exhausted, training must stop, whatever the model's accuracy. This makes the budget a first-class hyperparameter of a clinical training run, on par with the learning rate.
The clipping, Gaussian noising, and $\sqrt{T}$-style accounting above are tedious and easy to get subtly wrong (a wrong clip-before-or-after-average ordering silently voids the guarantee). Opacus wraps a PyTorch optimizer so per-sample gradients are clipped to $C$ and noised to $\sigma$ automatically, and exposes the spent budget through its accountant. The same machinery that took a page of math becomes a wrapper plus one query:
from opacus import PrivacyEngine
privacy_engine = PrivacyEngine()
# Attach DP to an ordinary training setup; the engine clips per-sample grads
# to max_grad_norm (the C above) and adds Gaussian noise sized to the target.
model, optimizer, data_loader = privacy_engine.make_private_with_epsilon(
module=model, optimizer=optimizer, data_loader=data_loader,
target_epsilon=2.0, target_delta=1e-5, epochs=T,
max_grad_norm=1.0, # the L2 clipping bound C
)
# ... run the T training rounds as usual ...
print("spent epsilon:", privacy_engine.get_epsilon(delta=1e-5)) # the accountant
make_private_with_epsilon picks the noise multiplier that hits a target $\varepsilon$, and get_epsilon reports the budget consumed. To make the guarantee patient-level rather than record-level, the data loader must group all of a patient's records into one logical unit before clipping, which Opacus does not infer for you.4. The Defense Stack: Secure Aggregation and De-identification Limits Intermediate
Differential privacy bounds what the final model leaks, but it does not by itself hide each hospital's individual update from the server during training, and a server that sees one hospital's raw update could mount gradient inversion on that hospital alone. Secure aggregation closes this gap. Using cryptographic masks that cancel when the masked updates are summed, it lets the server learn only the aggregate $\sum_k g_k$ across clients and never any single $g_k$, so the inversion target (one site's gradient) is never exposed. This is the privacy primitive introduced for federation in Chapter 14 and developed as a protocol in Section 35.6; this chapter implements it for our hospital federation in Section 37.6. The two defenses compose: secure aggregation hides the per-site update in transit, and differential privacy bounds what the released aggregate reveals, so the pair covers both the server-side and the output-side threat. Figure 37.3.1 shows where each defense sits along the path an update travels.
One defense the medical setting is repeatedly tempted to over-trust is de-identification. Stripping the eighteen HIPAA identifiers is necessary and is the legal floor, but it is not a privacy guarantee in the DP sense, because re-identification through linkage (matching a "de-identified" record against an external dataset using quasi-identifiers like age, rare diagnosis, and admission week) has been demonstrated repeatedly. De-identification reduces the easy linkage; it does not bound the information a model update carries, which is precisely what DP does. The correct reading is that de-identification, DP, and secure aggregation are complementary layers: the first satisfies a legal checklist, the second bounds output leakage with a number, the third hides the transit. A medical federation needs all three, and treating any one as the whole answer is the most common privacy error in the field.
5. The Privacy/Utility Tension for Clinical Models Advanced
The defenses above are not free, and in the clinical setting their cost is sharp enough to change whether a model may be deployed. Every unit of privacy bought by shrinking $\varepsilon$ is paid for in noise, and noise degrades the model. For a model that ranks cat photos this is a mild accuracy tax; for a model that must decide whether a rate of adverse events has crossed a safety threshold, too much noise does not merely lower an accuracy score, it makes the safety-critical decision unreliable. The tension is therefore not "privacy versus a metric" but "privacy versus whether the system is safe to use", which is a higher bar.
The demonstration below makes the tension concrete on a single clinical statistic: the per-site rate of a rare adverse drug event, released under patient-level DP, where a regulator flags the drug if the true rate exceeds one percent. It sweeps the privacy budget $\varepsilon$ and reports both the relative error injected by the noise and, more importantly, the probability that the noisy release lands on the correct side of the safety threshold.
import numpy as np
# A safety-critical clinical statistic computed per site: the rate of a rare
# adverse drug event among treated patients. We release it under patient-level
# differential privacy (each patient contributes at most ONE record here, so
# patient-level and record-level sensitivity coincide for this query).
rng = np.random.default_rng(7)
n = 4000 # patients at this site
true_rate = 0.012 # 1.2% true adverse-event rate (rare)
events = rng.random(n) < true_rate
p_hat = events.mean() # the non-private per-site rate
# Counting query "number of events" has sensitivity 1 (one patient flips it by 1);
# the released rate p = count / n therefore has L2 sensitivity Delta = 1/n.
delta_sens = 1.0 / n
# Gaussian mechanism: to satisfy (eps, delta_dp)-DP for one release, sigma scales
# as Delta * sqrt(2 ln(1.25/delta_dp)) / eps.
delta_dp = 1e-5
c = np.sqrt(2.0 * np.log(1.25 / delta_dp))
# Safety threshold: a regulator flags the drug if the rate exceeds 1.0%.
threshold = 0.010
trials = 20000
print(f"non-private per-site rate p_hat = {p_hat*100:.3f}% (true {true_rate*100:.1f}%)")
print(f"records n = {n}, query sensitivity Delta = 1/n = {delta_sens:.2e}")
print()
print(f"{'epsilon':>8} | {'sigma':>10} | {'noise sd (pp)':>13} | {'rel err':>8} | {'P(correct flag)':>15}")
print("-" * 70)
for eps in [0.1, 0.5, 1.0, 2.0, 4.0, 8.0]:
sigma = delta_sens * c / eps
noisy = p_hat + rng.normal(0.0, sigma, size=trials)
rel_err = np.mean(np.abs(noisy - p_hat)) / p_hat
# safety decision: does the noisy release land on the correct side of 1.0%?
correct = np.mean((noisy > threshold) == (p_hat > threshold))
print(f"{eps:>8.1f} | {sigma:>10.2e} | {sigma*100:>13.3f} | {rel_err:>7.1%} | {correct:>14.1%}")
print()
# Privacy budget across T training rounds: basic (sequential) composition spends
# T*eps. Tight Gaussian/RDP accounting spends ~ eps*sqrt(T) for the same noise.
eps_round = 1.0
for T in [1, 10, 50, 100]:
basic = T * eps_round
rdp_like = eps_round * np.sqrt(T)
print(f"T={T:>4} rounds at eps/round={eps_round}: basic composition eps={basic:>5.1f} "
f"advanced (~sqrt) eps={rdp_like:>4.1f}")
non-private per-site rate p_hat = 1.100% (true 1.2%)
records n = 4000, query sensitivity Delta = 1/n = 2.50e-04
epsilon | sigma | noise sd (pp) | rel err | P(correct flag)
----------------------------------------------------------------------
0.1 | 1.21e-02 | 1.211 | 87.2% | 53.2%
0.5 | 2.42e-03 | 0.242 | 17.6% | 65.4%
1.0 | 1.21e-03 | 0.121 | 8.8% | 79.8%
2.0 | 6.06e-04 | 0.061 | 4.4% | 94.8%
4.0 | 3.03e-04 | 0.030 | 2.2% | 100.0%
8.0 | 1.51e-04 | 0.015 | 1.1% | 100.0%
T= 1 rounds at eps/round=1.0: basic composition eps= 1.0 advanced (~sqrt) eps= 1.0
T= 10 rounds at eps/round=1.0: basic composition eps= 10.0 advanced (~sqrt) eps= 3.2
T= 50 rounds at eps/round=1.0: basic composition eps= 50.0 advanced (~sqrt) eps= 7.1
T= 100 rounds at eps/round=1.0: basic composition eps=100.0 advanced (~sqrt) eps=10.0
The lesson is that the privacy budget is not a slider you can push to the safe-looking extreme. Push $\varepsilon$ too low and the model, or a statistic derived from it, becomes too noisy to support a clinical decision that must be safe, which is its own kind of harm. The defensible operating point is found by measurement, choosing the largest privacy (smallest $\varepsilon$) at which the downstream clinical decision still meets its required reliability, and then spending that budget across rounds with the tight accountant so the run can be long enough to converge. This is the binding constraint promised at the top of the section made quantitative: privacy bounds model quality, and the bound is tight enough in medicine that it shapes the whole design.
This chapter advances the book's spine from an unusual direction. Elsewhere, distribution is forced by a resource ceiling (data, model, or throughput too big for one machine). Here it is forced by a legal and ethical ceiling: the data may not be pooled at all, so the only lawful way to learn across institutions is to scale out across them and exchange protected updates. Privacy is thus both the reason the system is distributed and the constraint that bounds how good the distributed model may become. The same federation that Chapter 14 built for statistical and engineering reasons is here mandated by law, and the privacy budget joins communication cost and fault tolerance as a tax that scale-out must pay.
Who: A machine learning lead building a sepsis-onset predictor across six hospital systems under an IRB-approved federated protocol.
Situation: The protocol fixed a total patient-level privacy budget of $\varepsilon = 8$ at $\delta = 10^{-5}$, agreed with the privacy officers before any data was touched.
Problem: The team's first plan spent budget with naive sequential composition, which at one unit per round exhausted $\varepsilon = 8$ after only eight rounds, far too few for the model to converge.
Dilemma: Raise $\varepsilon$ and break the agreement with the privacy officers, or keep the budget and ship an undertrained, possibly unsafe model.
Decision: Neither; they switched to a Renyi-DP accountant, whose roughly $\varepsilon\sqrt{T}$ spending let the same budget cover about sixty rounds instead of eight.
How: They wrapped the per-site optimizer with Opacus as in Code 37.3.1, grouped each patient's records into one DP unit for patient-level protection, and tracked the spent $\varepsilon$ after every round, stopping exactly when the budget hit $8$.
Result: The model converged within budget, the alerting threshold met its required reliability (the analogue of the $\varepsilon = 4$ row of Output 37.3.2), and the privacy officers signed off because the accounted budget matched the agreement.
Lesson: The accountant is not bookkeeping; choosing tight composition over naive composition is often the difference between a trainable model and an untrainable one under a fixed clinical budget.
Three threads are narrowing the privacy/utility gap that Output 37.3.2 dramatizes. First, accounting has moved from the moments accountant toward numerically tight privacy-loss-distribution methods (the lineage of Gopi et al. and the connect-the-dots accountant) that squeeze more usable utility from the same $\varepsilon$, which in medicine directly buys more training rounds. Second, privacy auditing turns the abstract $\varepsilon$ into an empirically tested claim: membership-inference and canary-based audits (Nasr, Steinke, and collaborators, 2023 to 2025) estimate the privacy actually delivered and have repeatedly found implementations that promised more than they gave. Third, the community is converging on patient-level (user-level) DP as the standard for health data rather than the easier record-level guarantee, with federated training recipes and DP-aware foundation-model fine-tuning built to honor it. We meet the clinical-safety side of these guarantees again in Section 37.8, where the question shifts from "is it private?" to "is it safe to act on?".
A patient in a chronic-care cohort contributes up to $m = 12$ visits, each a record. The team currently reports a record-level $(\varepsilon, \delta)$ guarantee and wants to upgrade to patient-level at the same $\varepsilon$. Using the sensitivity relation $\Delta_2 = mC$ from Section 3, explain what happens to the required noise standard deviation $\sigma$, and therefore to model utility, when the protected unit changes from a record to a patient. Then argue why a regulator concerned with a rare diagnosis insists on the patient-level guarantee despite its higher utility cost.
Extend Code 37.3.2 so that, instead of sweeping a fixed list of $\varepsilon$ values, it searches for the smallest $\varepsilon$ (strongest privacy) at which the probability of a correct safety flag is at least 95 percent. Then vary the cohort size $n$ over $\{1000, 4000, 16000\}$ and report how the minimal safe $\varepsilon$ changes. Explain the relationship you observe between site size and the privacy you can afford, and connect it to why small rural hospitals are the hardest clients to protect well in a federation.
A federated training run has a fixed total patient-level budget of $\varepsilon_{\text{total}} = 6$ at $\delta = 10^{-5}$ and needs at least $40$ rounds to converge. Using the contrast $\varepsilon_{\text{basic}}(T) = T\varepsilon_{\text{round}}$ versus $\varepsilon_{\text{tight}}(T) \approx \varepsilon_{\text{round}}\sqrt{T}$ from Section 3, compute the per-round budget $\varepsilon_{\text{round}}$ each accounting method would permit to reach $40$ rounds within $\varepsilon_{\text{total}}$, and state whether convergence in $40$ rounds is feasible under each. Then explain, referencing the secure-aggregation step of Section 37.6, why the choice of accountant is an architectural decision and not merely a reporting convenience.