Part VIII: Case Studies and Capstone Projects
Chapter 37: Federated Medical AI

Safety and Responsibility

"They asked me to predict whether the patient would deteriorate overnight. I gave a number. What I could not give them, until they made me, was how much to trust it, and for whom it was wrong."

A Prediction That a Clinician Will Either Trust or Override
Big Picture

A federated clinical model is only as safe as its weakest participating hospital, and responsibility scales the same way the model does: across institutions and across the patient populations they serve. The prior sections of this chapter built a model that trains across hospitals without moving patient records (Section 37.4), survives non-identical site distributions (Section 37.5), aggregates privately (Section 37.6), and is watched for drift (Section 37.7). This section adds the layer that decides whether the model is allowed to touch a patient at all. It asks five questions a regulator and a clinician will ask: is the model a cleared medical device that was validated prospectively; does it perform as well for a minority hospital's population as for the majority; is its risk score calibrated, and does it know when to abstain; can it survive a single compromised or malicious site; and can anyone audit a decision whose training data was never centralized. Each answer is harder in a federation than on one machine, because the failure of one site is a failure of the whole.

Every earlier section in this chapter treated the federation as an engineering artifact: a thing to be trained, aggregated, and monitored. A deployed clinical model is also a regulated artifact and a moral one. The same property that made federation attractive, that no hospital surrenders its raw records, removes the comfortable assumption that a central team can inspect every training example, recompute every metric on a held-out pool, and trace any prediction back to the data that produced it. Responsibility does not disappear when the data stays put; it becomes distributed, and it has to be engineered as deliberately as the gradient all-reduce. This section applies the responsibility and governance machinery of Section 35.8 and Section 35.7 to the specific stakes of medicine, where the cost of an undetected failure is measured in patients rather than dollars.

Federated clinical safety stack Per-site fairness check no site or group underperforms; federated per-group metrics Calibration + uncertainty ECE within bound; abstain when the score is unsure Robustness to a bad site robust aggregation rejects a poisoned / Byzantine update Deployment gate all three checks must pass before release Clinician in the loop trusts, overrides, or is told the model abstained calibrated score + abstain flag Audit log model + site versions A single biased or compromised site can compromise every patient the federation serves.
Figure 37.8.1: The clinical safety stack for the federated model. Three per-site checks (fairness across sites and demographic groups, calibration with an uncertainty-driven abstain, and robustness to a poisoned or Byzantine site) feed a deployment gate that must pass before any release. The gate emits a calibrated risk score and an abstain flag to a clinician who makes the final decision, and every step is written to an audit log that records which model version and which site versions were involved. The checks sit upstream of the clinician because a defect at any one site, left undetected, propagates into every prediction the federation makes.

1. The Model Is a Medical Device, Not a Demo Beginner

A model that outputs a clinical risk score and influences a treatment decision is, in most jurisdictions, software as a medical device, and it is governed accordingly. That status changes the engineering problem in three ways that the rest of this chapter has so far been free to ignore. First, the evidence standard is prospective, not retrospective: a model that scores well on historical records still has to be validated on patients seen after the model was frozen, because retrospective performance is contaminated by the very correlations the model may be exploiting. Second, the unit of approval is a frozen, versioned artifact; a federation that keeps learning from new hospital data is, from a regulator's point of view, producing a new device on every update, which is why the audit trail of Section 7 of this section is not optional bookkeeping but part of the clearance. Third, and most important, the cleared system is almost never the model alone; it is the model plus a clinician who can override it. The human in the loop is a designed safety component, and the model's job is to make that human's judgment better, not to replace it.

This reframes what "good" means for the federated model. A higher area-under-curve on a pooled validation set is necessary but nowhere near sufficient. The model must also be calibrated so its score means what it says, fair so its errors are not concentrated on a vulnerable subpopulation, robust so a single bad actor cannot steer it, and auditable so a reviewer can reconstruct why it said what it said. The remaining sections take those four properties in turn, each one made harder by the fact that the data never sits in one place.

Key Insight: In a Federation, Responsibility Is a Collective Property

On one machine, you can in principle inspect every training record, recompute every metric on one held-out set, and trace any prediction to its inputs. A federation forfeits all three by construction, because the data is the one thing that never moves. Safety therefore cannot be a final check performed centrally; it has to be decomposed into per-site computations that the coordinator aggregates, exactly as the gradient is. A fairness metric, a calibration curve, and a poisoning defense all become collective operations, and a single site that computes its share wrongly, or maliciously, can poison the aggregate the same way one bad gradient can.

2. Fairness Across Sites and Across Demographics Intermediate

A federated model is trained to minimize loss averaged over the federation, and an average is exactly where a minority hospital disappears. If one small rural site contributes five percent of the patients, a model that is excellent for the other ninety-five percent and useless for that site still posts a strong global score. The same arithmetic operates inside a site: a demographic subgroup that is a minority everywhere can be under-served everywhere while every pooled metric looks healthy. Section 37.5 framed this distributional unevenness as a training problem; here it is a fairness problem, and the two are the same phenomenon seen from different ends. The defense is to refuse to trust any single global number and instead require the model to clear a bound on the worst group, not the average group.

Concretely, partition patients into groups indexed by $g$ (a group can be a site, a demographic class, or their intersection), let $M_g$ be a performance metric computed on group $g$, and define the fairness gap as the spread across groups,

$$\Delta_{\text{fair}} = \max_{g} M_g - \min_{g} M_g.$$

A deployment gate then demands $\Delta_{\text{fair}} \le \epsilon$ for a clinically agreed tolerance $\epsilon$, alongside a floor $\min_g M_g \ge M_{\min}$ so that no group falls below an absolute standard. The subtlety unique to federation is that $M_g$ cannot be computed centrally: the records that make up group $g$ are scattered across hospitals and never leave them. Each site computes its own per-group confusion counts (true positives, false negatives, and so on) and sends only those aggregate counts, which the coordinator sums to recover the federation-wide $M_g$. This mirrors the evaluation discipline of Chapter 5, now applied to a fairness metric rather than a throughput metric, and it ties directly to the per-site disparity that Section 37.5 and Section 37.7 measure for training and drift.

The trap is that the most-quoted fairness number, demographic parity (whether the model flags each group at the same rate), can look perfect while the model is harming a group. The demo in Section 6 of this section shows exactly this: the flag-rate gap between groups is a tiny $0.013$, yet the model systematically under-flags the high-risk minority group, a defect that only surfaces once the flag rate is compared against the group's true outcome rate. The lesson is that the right metric is an error-rate gap conditioned on outcome, not a raw rate, and that the metric must be co-computed against ground truth on the same patients, never read off a parity dashboard.

3. Calibration and the Discipline of Abstaining Intermediate

A clinical score that reads "0.7 risk of deterioration" is only useful if, among all patients scored near 0.7, roughly seventy percent actually deteriorate. That property is calibration, and it is what lets a clinician combine the model's number with everything else they know. A model can rank patients perfectly (high discrimination) while being badly miscalibrated, so calibration is a separate gate, not a corollary of accuracy. The standard summary is the expected calibration error, which bins predictions by confidence and measures, in each bin $b$, the gap between the mean predicted probability $\text{conf}(b)$ and the observed outcome frequency $\text{acc}(b)$, weighted by the share of samples $n_b / n$ in the bin:

$$\text{ECE} = \sum_{b=1}^{B} \frac{n_b}{n}\,\bigl|\,\text{acc}(b) - \text{conf}(b)\,\bigr|.$$

As with fairness, ECE must be computed per site and aggregated, because the calibration of a model trained on the whole federation can differ sharply from site to site: a hospital whose case mix the model saw little of will be the worst calibrated. The demo finds exactly that pattern, with the smallest, highest-acuity site posting the largest local ECE.

Calibration tells you the score is honest on average; it does not tell you the score is trustworthy for the patient in front of you. For that, the model must be willing to say "I do not know." An abstention rule defers the decision to the clinician whenever the predicted probability $p$ sits in a low-confidence band around the decision threshold $\tau$,

$$\text{abstain}(p) \iff |p - \tau| < \delta,$$

for a margin $\delta$ chosen so that the cases the model does decide are the ones it decides well. Abstention is the algorithmic embodiment of the human-in-the-loop principle from Section 1 of this section: the model handles the confident majority and routes the genuinely uncertain cases to a person, which is precisely where a clinician's judgment adds the most. The demo shows the model's accuracy on its auto-decided cases rising well above its accuracy on all cases once the uncertain band is handed off.

Library Shortcut: fairlearn and a Calibration Snippet

You do not implement per-group metrics or reliability curves by hand in production. The fairlearn library computes grouped metrics and disparity in a few lines, and scikit-learn gives the calibration curve and a recalibrator directly. In a federation you still run these per site and aggregate the resulting counts, but the local computation is one call each:

from fairlearn.metrics import MetricFrame, selection_rate
from sklearn.metrics import recall_score
from sklearn.calibration import calibration_curve, CalibratedClassifierCV

# Per-group metrics on ONE site's local data (flag rate and recall by group).
mf = MetricFrame(
    metrics={"flag_rate": selection_rate, "recall": recall_score},
    y_true=y_local, y_pred=flag_local, sensitive_features=group_local)
print(mf.by_group)              # one row per demographic group
print(mf.difference())          # the worst-case gap = the fairness gap Delta

# Local reliability curve (feeds the aggregated ECE) and a calibration fit.
frac_pos, mean_pred = calibration_curve(y_local, p_local, n_bins=10)
calibrated = CalibratedClassifierCV(base_estimator, method="isotonic", cv="prefit")
Code 37.8.1: The from-scratch ECE loop and fairness-gap arithmetic of Code 37.8.2 collapse to one MetricFrame and one calibration_curve call per site; fairlearn handles the grouping and disparity reduction, and scikit-learn supplies the reliability curve and an isotonic recalibrator. The federation still owns the aggregation across sites.

4. Robustness to a Poisoned or Compromised Site Advanced

In a federation, the attack surface is the membership list. Every hospital that contributes an update is trusted to have computed it correctly, and a single compromised site, whether through a malicious insider, a breached credential, or corrupted local data, can submit a poisoned update designed to degrade the global model or to plant a backdoor that misfires only on chosen inputs. This is the data-and-model-poisoning threat of Section 35.4 arriving in a clinical setting, where a backdoor is not an abstract risk but a model that silently misclassifies a particular patient profile. Standard federated averaging is defenseless against it, because the mean is dominated by any update large enough, and a poisoned update can be made arbitrarily large.

The remedy is robust aggregation, the subject of Section 35.5: replace the mean of the site updates with an estimator that tolerates a fraction of Byzantine participants, such as a coordinate-wise median, a trimmed mean, or a geometric-median scheme like Krum. These estimators bound the influence any single site can have on the aggregate, so one hospital's poisoned gradient is clipped or discarded rather than averaged in. The tension specific to this chapter is that robust aggregation and the secure aggregation of Section 37.6 pull against each other: secure aggregation deliberately hides the individual site updates so the coordinator learns only their sum, which is precisely the information a robust estimator needs in order to spot and reject an outlier. Reconciling privacy with robustness, through cryptographic protocols that compute a median over hidden inputs or through verifiable aggregation, is an active engineering frontier rather than a solved trade-off, and a federation must decide consciously where on that spectrum it sits.

Thesis Thread: Responsibility Scales Out the Way the Model Does

The whole book has argued that a primitive introduced for one machine returns, multiplied, across many. Safety is no exception. Calibration, a fairness check, and a poisoning defense are single-model ideas; spread across a federation they become collective operations with the same structure as the gradient all-reduce of Section 1.1: each site computes a local share, the coordinator combines the shares, and one corrupted or biased share can spoil the result. The scale-out lesson is that distributing intelligence across institutions also distributes the responsibility for it, and a system that combines gradients correctly but cannot tell which site sent a poisoned one has scaled its capability without scaling its conscience.

5. Accountability When the Data Was Never Centralized Advanced

When a clinician overrides the model, or a regulator audits an adverse event, someone has to answer a precise question: why did the model produce this score for this patient, and what was it trained on. In a centralized system the answer is a query against the training set and a model card. In a federation the training data was never centralized and, under the privacy constraints of Section 37.3, cannot be, so accountability has to be reconstructed from records the coordinator is allowed to keep. This is the governance and auditability problem of Section 35.7, sharpened by the privacy wall that forbids the obvious solution.

What the federation can keep is a tamper-evident log of the process rather than the data: which sites participated in each training round, the version hash of the global model at every release, the aggregation rule and its robustness parameters, the per-site and per-group metrics that the deployment gate evaluated, and, at inference time, the model version, the input features used, the score returned, and whether the model abstained. None of this exposes a patient record, yet together it lets a reviewer reconstruct the lineage of any decision: the cleared device version, the evidence that cleared it, and the sites that shaped it. This audit spine is what makes the frozen-artifact discipline of Section 1 of this section enforceable, and it is the same lineage Section 37.7 relies on to attribute drift to a specific site. Accountability, like fairness and robustness, becomes a property the federation builds deliberately, because the architecture removed the centralized record that would otherwise have provided it for free.

Practical Example: The Sepsis Model That Passed Globally and Failed a Site

Who: A clinical informatics team operating a five-hospital federation for early sepsis prediction.

Situation: A federated model reached a strong pooled area-under-curve and was approved for a phased rollout across all five sites.

Problem: Within weeks, nurses at the smallest hospital, a children's facility, reported that the alerts felt random, while the four adult hospitals saw useful warnings.

Dilemma: Trust the global metric that cleared the model and attribute the complaints to user skepticism, or pull the model at the one site and risk undermining confidence in the whole federation.

Decision: The team computed the deployment-gate metrics per site rather than pooled, exactly the federated per-group computation of Section 2 of this section.

How: Each site recomputed its local ECE and per-group recall and sent only the aggregate counts; the coordinator found the pediatric site's ECE more than double the federation's, and its recall far below the floor, a minority population the average had hidden.

Result: The model was gated off for the pediatric site pending a site-aware recalibration and a raised abstention margin there, while it continued serving the four adult hospitals; the audit log made the targeted rollback defensible to the regulator.

Lesson: A federation's worst-served group is invisible in its best metric. Compute fairness and calibration per site, gate on the worst group, and keep the audit trail that lets you act on one site without condemning the whole.

6. Demo: Federated ECE, a Fairness Gap, and an Abstention Rule Intermediate

The code below builds a small synthetic federation of three hospitals, each reporting per-patient tuples of predicted risk, true outcome, and demographic group, the only information the coordinator is allowed to aggregate. It computes the expected calibration error globally and per site, a fairness gap across the two demographic groups, and the effect of an abstention rule. The model is deliberately constructed to be well-calibrated on the majority group but to shrink the minority group's risk toward the mean, the defect Section 2 of this section warned about.

import numpy as np

rng = np.random.default_rng(7)

# A federation of 3 hospitals. Each sends per-patient (predicted risk, true outcome,
# demographic group). The coordinator never sees raw records, only these tuples,
# which is enough to audit calibration and fairness.
N = 3000
site = rng.integers(0, 3, N)                      # 0,1,2 -> three hospitals
group = rng.integers(0, 2, N)                      # 0 = majority, 1 = minority subpopulation

# True latent risk; the minority group at the small site is systematically
# under-served: the model is over-confident and miscalibrated there.
base = 0.15 + 0.25 * site                           # small site (2) has higher base risk
true_p = np.clip(base + 0.20 * group + 0.05 * rng.standard_normal(N), 0.01, 0.99)
y = (rng.random(N) < true_p).astype(int)            # realized outcomes

# Model's predicted probability. It is decently calibrated on the majority group
# but shrinks risk toward the mean for the minority group (a fairness defect).
pred = np.clip(true_p - 0.18 * group + 0.04 * rng.standard_normal(N), 0.01, 0.99)


def ece(p, outcomes, n_bins=10):
    """Expected Calibration Error: |confidence - accuracy| averaged over equal-width bins."""
    edges = np.linspace(0.0, 1.0, n_bins + 1)
    total = len(p)
    err = 0.0
    for b in range(n_bins):
        lo, hi = edges[b], edges[b + 1]
        mask = (p > lo) & (p <= hi) if b > 0 else (p >= lo) & (p <= hi)
        if mask.sum() == 0:
            continue
        conf = p[mask].mean()                       # mean predicted prob in bin
        acc = outcomes[mask].mean()                 # observed outcome rate in bin
        err += (mask.sum() / total) * abs(conf - acc)
    return err


# 1. Federated calibration: each site computes its bin counts locally; here we
#    show the global ECE and the per-site ECE the federation would aggregate.
print("Global ECE            :", f"{ece(pred, y):.4f}")
for s in range(3):
    m = site == s
    print(f"  site {s} ECE (n={m.sum():4d}) :", f"{ece(pred[m], y[m]):.4f}")

# 2. Federated fairness gap: positive-outcome rate the model would FLAG at a
#    0.5 decision threshold, per demographic group. The gap is the disparity.
thr = 0.5
flag = (pred >= thr).astype(int)
rate = [flag[group == g].mean() for g in (0, 1)]
true_rate = [y[group == g].mean() for g in (0, 1)]
print("\nFlag rate group 0 / 1 :", f"{rate[0]:.3f} / {rate[1]:.3f}")
print("True rate group 0 / 1 :", f"{true_rate[0]:.3f} / {true_rate[1]:.3f}")
print("Fairness gap |dflag|  :", f"{abs(rate[0] - rate[1]):.3f}")
print("Under-flagging gap    :", f"{abs((rate[1]-true_rate[1]) - (rate[0]-true_rate[0])):.3f}")

# 3. Abstention rule: defer to a clinician when the predicted probability sits in
#    a low-confidence band around the threshold, |p - thr| < tau.
tau = 0.15
abstain = np.abs(pred - thr) < tau
print("\nAbstention band tau   :", tau)
print("Fraction abstained    :", f"{abstain.mean():.3f}")
print("Fraction abstained g1 :", f"{abstain[group == 1].mean():.3f}")
# Accuracy on the AUTO-decided cases (where the model does NOT abstain).
auto = ~abstain
auto_acc = (flag[auto] == y[auto]).mean()
all_acc = (flag == y).mean()
print("Accuracy, all cases   :", f"{all_acc:.3f}")
print("Accuracy, auto cases  :", f"{auto_acc:.3f}")
Code 37.8.2: A from-scratch safety audit over a synthetic three-hospital federation. The coordinator aggregates only per-patient tuples (predicted risk, outcome, group), never raw records, and from them computes the per-site ECE, the demographic fairness gap, and the effect of routing low-confidence cases to a clinician.
Global ECE            : 0.0816
  site 0 ECE (n= 973) : 0.0767
  site 1 ECE (n= 993) : 0.0730
  site 2 ECE (n=1034) : 0.0972

Flag rate group 0 / 1 : 0.364 / 0.377
True rate group 0 / 1 : 0.402 / 0.595
Fairness gap |dflag|  : 0.013
Under-flagging gap    : 0.180

Abstention band tau   : 0.15
Fraction abstained    : 0.433
Fraction abstained g1 : 0.418
Accuracy, all cases   : 0.678
Accuracy, auto cases  : 0.754
Output 37.8.2: The audit tells three stories. The smallest, highest-acuity site (site 2) is the worst calibrated, with an ECE of $0.097$ against the federation's $0.082$. The naive demographic-parity gap is a reassuring $0.013$, yet the model under-flags the genuinely higher-risk minority group by $0.180$ once flag rates are compared to true outcome rates, the hidden harm Section 2 warned about. Abstaining on the uncertain band lifts accuracy on the auto-decided cases from $0.678$ to $0.754$, the value the clinician-in-the-loop captures.

Three conclusions follow, and each maps to a row in Figure 37.8.1. The per-site ECE shows why calibration must be checked site by site: the federation's average hid the pediatric-style outlier. The two fairness numbers show why a single parity metric is dangerous: the rate gap looked fine while the outcome-conditioned gap revealed real under-service. And the abstention result shows the human-in-the-loop paying off in measurable accuracy on exactly the cases the model keeps. A deployment gate that runs all three checks, on aggregated per-site counts, is the operational form of the responsibility this section has argued for.

7. Frontier and Closing the Loop Advanced

The hard problems here are open ones. Reconciling secure aggregation with Byzantine robustness, so a federation can both hide individual site updates and reject a poisoned one, is an active line of work, as is producing a per-prediction uncertainty that is itself trustworthy under distribution shift across sites. The research frontier below points at where these threads are heading; the exercises then ask you to reason about the trade-offs the demo made visible.

Research Frontier: Trustworthy Federated Clinical AI (2024 to 2026)

Three strands are converging on the safety layer of this section. Regulators have moved from static clearance toward predetermined change-control plans for adaptive medical AI, which formalize exactly the frozen-artifact-plus-audit-trail discipline of Sections 1 and 5 of this section and let a federation update under a pre-approved envelope. On the technical side, work on privacy-preserving robust aggregation seeks protocols that compute a robust estimator (a median or trimmed mean) over cryptographically hidden site updates, directly attacking the secure-versus-robust tension of Section 4; recent schemes combine secure aggregation with verifiable outlier rejection. A third strand, conformal prediction in the federated setting, produces calibrated prediction sets with finite-sample coverage guarantees per site, turning the abstention heuristic of Section 3 into a procedure with a provable error rate even when each hospital's distribution differs. We met the privacy primitives these build on in Chapter 14 and the robustness primitives in Chapter 35; the case study is where they have to hold simultaneously, for real patients.

This section completes the responsibility layer of the federated medical model: it is a regulated device validated prospectively, gated on its worst group rather than its average, calibrated and willing to abstain, robust to a single bad site, and auditable without ever centralizing a record. Section 37.9 turns the whole chapter outward, proposing project extensions that take this federation from a case study into a system you can build and defend.

Exercise 37.8.1: Why the Parity Number Lied Conceptual

In Output 37.8.2 the flag-rate gap between groups is $0.013$ while the under-flagging gap is $0.180$. Explain precisely why demographic parity (equal flag rates) can be satisfied while the model still harms the minority group, using the true outcome rates ($0.402$ versus $0.595$) in your argument. Then propose which single fairness metric you would put on the deployment gate, and state the clinical consequence of choosing parity instead.

Exercise 37.8.2: Tune the Abstention Margin Coding

Modify Code 37.8.2 to sweep the abstention margin $\delta$ (the variable tau) from $0.0$ to $0.4$ and, for each value, record the fraction of cases abstained and the accuracy on the auto-decided cases. Plot or tabulate the trade-off. Then add a per-group breakdown: does a single global $\delta$ abstain fairly across groups, or does the miscalibrated minority group get a different abstention rate? Recommend whether the margin should be set globally or per site, and justify the choice from your numbers.

Exercise 37.8.3: Secure Aggregation Versus a Poisoned Site Analysis

Section 4 of this section argued that secure aggregation (which hides individual site updates) and Byzantine-robust aggregation (which needs to see outlier updates to reject them) are in tension. Sketch a concrete attack: one of three sites submits a model update scaled by a factor of $100$ to dominate plain federated averaging. Explain what a coordinate-wise median would do to that update if the coordinator could see all three, and why secure aggregation that reveals only the sum would let the attack through. Then describe, in two or three sentences, what a privacy-preserving robust aggregation protocol would need to compute to defend against the attack without exposing the honest sites' updates, referencing the robust-aggregation methods of Section 35.5.