Part VIII: Case Studies and Capstone Projects
Chapter 37: Federated Medical AI

Monitoring and Drift Across Sites

"I watch the same model age at twelve hospitals, and it ages differently at every one. One site's patients drifted, another's scanner was replaced, a third just started seeing a disease the model never met. I cannot read a single chart of theirs, so they each send me a number, and I decide who needs the model back."

A Monitor, Watching a Model Age Differently at Every Hospital
Big Picture

A federated model does not stop being federated the day it is deployed; its health must be monitored the same way it was trained, by rolling up evidence from sites that will never let raw data leave their walls. The hard part of training the model was that no hospital would share patients; the hard part of operating it is identical. Patient populations shift, clinical practice changes, scanners get swapped, and outcomes that tell you whether a prediction was right arrive weeks after the prediction was made. Each of those signals lives inside a hospital and must be measured there. The monitoring system therefore mirrors the training system: every site computes its own local metrics and drift statistics on data it keeps, and reports only small aggregates and alerts to a fleet dashboard that decides, fleet-wide, when a site has degraded enough to justify re-federating the model. This section applies the MLOps machinery of Chapter 26 and the fleet governance of Section 35.7 to that medical fleet, and shows the drift roll-up in code you can run.

The previous sections of this chapter built and deployed one global model from many hospitals that could not pool their data, and they confronted the heterogeneity of those hospitals head-on in Section 37.5. That heterogeneity does not disappear at deployment; it returns, transformed, as the central problem of this section. A model that was trained to compromise across very different patient populations will not serve all of them equally well, and the populations it serves keep moving after training stops. Monitoring is how we notice. The constraint that shaped everything upstream, that raw clinical data must stay inside each institution, now shapes monitoring too: we cannot ship a validation set to a central server and score the model on it, because there is no central validation set and there never will be. Monitoring itself has to be federated.

1. Monitoring Without Centralizing Data Intermediate

The default MLOps picture from Chapter 26 assumes a serving fleet whose inputs and outputs flow back to a central logging system, where a monitoring service computes drift and performance against a held-out reference. In a hospital federation that pipe does not exist. Predictions are made on patient records that are bound by the same privacy rules that prevented pooling for training (the clinical-privacy constraint developed in Section 37.3 and rooted in the secure-aggregation discipline of Chapter 14). So the monitoring computation has to be pushed to where the data already is. Each site runs the same monitoring code on its own live traffic, computes a fixed set of summary statistics, and emits only those summaries: bin counts for a drift histogram, a confusion-matrix tally once labels arrive, a calibration curve, an alert flag. None of those carry an individual patient, and several of them can be passed through the same secure-aggregation or differential-privacy layer the training used (Chapter 35).

This is the same inversion that defined federated training, applied to evaluation. In Section 37.4 we moved the gradient computation to the data; here we move the metric computation to the data. The fleet monitor never sees a record, only a stream of small reports, and its job is to combine them into a fleet-wide view and to decide where to intervene. The metrics it rolls up are exactly the ones defined in Chapter 5, now computed per site and aggregated rather than computed once on a central set, and the rollup is a streaming aggregation in the sense of Chapter 9: reports arrive continuously and the dashboard maintains running estimates. Figure 37.7.1 shows the shape of the system.

Four hospital sites (data stays local) Site A: local records compute PSI / KL drift, local AUC, calibration Site B: local records same monitoring code, labels arrive weeks late Site C: local records new scanner installed, data drift detected Site D: minority cohort disparity: global model underperforms here Secure aggregate only (n, PSI, AUC, alert) leave a site Fleet dashboard rolled-up drift + per-site disparity Re-federation trigger fire when a site breaches a new FedAvg round, then redeploy
Figure 37.7.1: Federated monitoring mirrors federated training. Each hospital runs identical monitoring code on records that never leave the building, and emits only small aggregates (a count vector for drift, a confusion tally for performance, an alert flag) through a secure-aggregation layer. The fleet dashboard rolls these into a fleet-wide drift estimate and a per-site disparity view, and evaluates a re-federation trigger. The dashed arrow closes the loop: a fired trigger launches a new federated round and redeploys the refreshed model back to the sites.

2. Data Drift Versus Concept Drift Intermediate

Two different things can go wrong after deployment, and they call for different responses, so it is worth keeping them apart. Write the joint distribution of inputs $x$ and outcomes $y$ at a site as $p(x, y) = p(x)\,p(y \mid x)$. Data drift (also called covariate shift) is a change in $p(x)$ with $p(y \mid x)$ unchanged: the patients coming through the door look different, but the relationship between findings and outcome is the same. A hospital that opens a geriatric ward, or replaces a CT scanner with a model that produces subtly different pixel statistics, shifts $p(x)$. Concept drift is a change in $p(y \mid x)$ itself: the same input now maps to a different outcome, because clinical practice changed, a new treatment altered prognosis, or the coding of a diagnosis was revised. Data drift you can detect from inputs alone, immediately. Concept drift you can only confirm once outcomes arrive, which in medicine is slow.

Data drift at a site is measured by comparing the current input distribution against the reference distribution fixed at federation time. Two standard one-feature measures are the Population Stability Index and the Kullback-Leibler divergence over a shared binning. With reference bin proportions $r_b$ and current local proportions $q_b$ across $B$ bins, the site computes

$$\mathrm{PSI} = \sum_{b=1}^{B} (q_b - r_b)\,\ln\frac{q_b}{r_b}, \qquad \mathrm{KL}(q \,\|\, r) = \sum_{b=1}^{B} q_b \ln\frac{q_b}{r_b}.$$

The conventional reading of PSI is that values below $0.10$ indicate a stable population, $0.10$ to $0.20$ a moderate shift worth watching, and $0.20$ or above a significant shift that warrants action. Both quantities are computed entirely from the histogram counts, so a site reports a vector of $B$ small integers and nothing else. The bin edges are part of the shared monitoring schema, agreed once and distributed to every site, which is what makes the per-site numbers comparable in the rollup.

Key Insight: Drift Detection Travels as a Histogram, Not as Data

The entire input-drift signal for a monitored feature compresses to one vector of bin counts per site. That is why federated monitoring is feasible at all: a $B$-bin histogram reveals nothing about any individual patient, yet it carries the full shape of $p(x)$ that PSI and KL need. Fix the bin edges once at federation time, ship them to every site, and the fleet monitor can compare a hundred hospitals on a common ruler while none of them disclose a record. Drift detection is the rare monitoring task that is both privacy-safe and cheap to aggregate.

Concept drift is harder precisely because it lives in $p(y \mid x)$ and so depends on labels. The honest move is to keep two separate alarms: an input-drift alarm that fires fast on $p(x)$ changes from histograms alone, and a performance alarm that fires later, once outcomes let a site recompute the conditional behavior. A scanner swap should light the first alarm and not the second; a shift in how a disease is treated should eventually light the second with the first staying quiet. Conflating them sends you re-federating in response to a population change that the model already handles correctly, or worse, ignoring a genuine concept change because the inputs still look familiar.

3. Per-Site Performance Disparity Advanced

A federated model is a compromise across heterogeneous sites, and a compromise underserves its extremes. The global model that minimizes average loss across hospitals can still be substantially worse at a minority site whose patient mix is unlike the federation's center of mass, an effect that Section 37.5 analyzed at training time and that reappears here as a deployment-time disparity you must keep watching. Once labels arrive, each site $k$ computes its local performance metric $m_k$ (an AUC, an F1, a calibration error, any metric from Chapter 5) on its own outcomes and reports the scalar. The fleet monitor tracks the disparity gap between the best and worst served site,

$$\Delta = \max_k m_k - \min_k m_k,$$

and, just as important, watches whether the worst site is persistently the same site. A large $\Delta$ that always names the same minority hospital is a fairness failure, not noise; it says the global compromise has abandoned that site, and it is a signal that a personalized or site-adapted model (the local-fine-tuning options of Chapter 14) may serve it better than the shared global one. Per-site disparity is therefore a first-class monitored quantity, not an afterthought, and the governance practices of Section 35.7, fleet-wide audit and per-node health tracking, are what give it an owner and a paper trail.

Practical Example: The Model That Was Fine on Average and Failing in One City

Who: A clinical ML lead operating a federated sepsis-risk model across fourteen hospitals.

Situation: The fleet dashboard showed a healthy fleet-average AUC of 0.88 and no input-drift alerts for three months.

Problem: A safety review flagged that one urban safety-net hospital, the smallest site in the federation, had a local AUC of 0.71, far below the rest.

Dilemma: Re-federate the global model in the hope the next round helps that site, which risks regressing the thirteen sites that were fine, or branch a site-adapted model for the outlier, which adds a second artifact to govern and version.

Decision: They kept the global model for the fleet and fine-tuned a personalized head for the outlier site, because the disparity gap $\Delta$ had named the same hospital every week for a quarter, marking it as structural rather than transient.

How: The outlier site's local AUC and calibration curve, already in the monitoring reports, became the acceptance test for the adapted model; nothing about its patients left the building.

Result: The adapted model lifted the outlier's local AUC to 0.84 while the other thirteen sites kept the unchanged global model, and the disparity gap entered the governance log as a tracked, owned metric.

Lesson: A healthy fleet average can hide a failing minority site. Monitor the gap and the identity of the worst site, not just the mean.

4. Label Delay and the Triggers for Re-Federation Advanced

Medical outcomes are slow. Whether a sepsis prediction was correct, whether a tumor flagged as benign stayed benign, whether a discharged patient was readmitted, these labels settle days or weeks after the prediction. This label delay means the performance alarm is structurally lagged: at any moment a site can score the model only on the cohort whose outcomes have matured, and that cohort is weeks stale. Two consequences follow. First, input-drift monitoring carries the early-warning load, because it needs no labels and fires immediately; performance monitoring confirms later. Second, the fleet monitor must track which predictions are still awaiting outcomes, so it does not mistake an empty recent window for good performance. This is a streaming bookkeeping problem of exactly the kind Chapter 9 formalizes: outcomes join predictions on a delay, and the running metric is always over a watermark behind the present.

Given these signals, a sequential change detector is the right tool to separate a real shift from week-to-week noise. A one-sided CUSUM accumulates a monitored statistic $z_t$ (a per-site PSI, or a drop in local performance) above a tolerance $k$ and alarms when the running sum crosses a threshold $h$:

$$S_0 = 0, \qquad S_t = \max\!\left(0,\; S_{t-1} + (z_t - k)\right), \qquad \text{alarm when } S_t > h.$$

A re-federation trigger is then a governed policy over these alarms rather than a single threshold. A retraining round is expensive (it pulls every site into a new federated computation, the very FedAvg loop of Chapter 14) and a clinical redeploy passes through validation gates, so you do not fire it on one noisy week. A defensible policy fires when any single site's PSI breaches the significant-shift threshold, or when a CUSUM on local performance alarms at any site, or when the fleet-weighted drift trends up across consecutive windows, with the decision and its evidence written to the audit trail that Section 35.7 mandates. The model registry of Chapter 26 then records the new global model, the trigger that produced it, and the per-site metrics that justified the redeploy, so the fleet's history is reconstructable.

Thesis Thread: The Roll-Up Is the All-Reduce, One More Time

The fleet drift estimate in this section is computed exactly the way the gradient was in Section 1.1: every site contributes a small local summary, and the monitor combines them into one fleet-wide number, weighting each site by its sample count. The combine step that began life as gradient all-reduce returns here as a metric roll-up, and the privacy inversion that defined federated training, move the computation to the data, defines federated monitoring without change. Operating the model is the same distributed problem as training it, which is why the same primitives keep reappearing.

5. Rolling Up Drift to a Fleet Alert Intermediate

The code below makes the whole pipeline concrete. A reference distribution is fixed and reduced to ten shared bin edges. Five hospitals each draw their own current patient mix, compute a local PSI against the shared reference from histogram counts alone, and report only the pair (sample count, PSI). The fleet monitor combines those reports into a sample-weighted fleet PSI, names the worst site, and fires the re-federation trigger when any site breaches the significant-shift threshold. Two sites carry injected shifts: a geriatric site with a heavy population change, and a site that swapped a scanner. No site ever exposes a record; the monitor sees only the small reports.

import numpy as np

rng = np.random.default_rng(7)

# Reference distribution of one monitored feature (e.g. standardized patient age),
# fixed at federation time and reduced to 10 shared bin edges. Sites receive the
# edges and return counts only; no raw patient record ever leaves a site.
ref = rng.normal(0.0, 1.0, 200_000)
edges = np.quantile(ref, np.linspace(0, 1, 11))      # 10 deciles -> shared schema
edges[0], edges[-1] = -np.inf, np.inf                # open the tails
ref_p = np.histogram(ref, bins=edges)[0] / len(ref)  # reference proportions r_b

def psi(local_p, ref_p, eps=1e-6):
    """Population Stability Index between a local and the reference histogram."""
    a = np.clip(local_p, eps, None)
    b = np.clip(ref_p, eps, None)
    return float(np.sum((a - b) * np.log(a / b)))

# Five hospitals. Site D has an aging, sicker cohort (population drift); site E
# swapped a scanner (a smaller equipment-induced mean shift).
sites = {
    "A-Metro":      rng.normal(0.00, 1.00, 4000),
    "B-Rural":      rng.normal(0.10, 1.05, 1500),
    "C-Pediatric":  rng.normal(-0.15, 0.95, 2500),
    "D-Geriatric":  rng.normal(0.80, 1.40, 1800),    # heavy population drift
    "E-NewScanner": rng.normal(0.35, 1.00, 1200),    # equipment-induced shift
}

ALERT = 0.20   # PSI >= 0.20 is the conventional "significant shift" threshold

print(f"{'site':<13}{'n':>6}{'PSI':>9}  status")
print("-" * 38)
reports = {}
for name, x in sites.items():
    local_p = np.histogram(x, bins=edges)[0] / len(x)   # site computes locally
    s = psi(local_p, ref_p)
    reports[name] = (len(x), s)                          # only (n, PSI) is emitted
    flag = "ALERT" if s >= ALERT else ("watch" if s >= 0.10 else "ok")
    print(f"{name:<13}{len(x):>6}{s:>9.4f}  {flag}")

# Fleet roll-up: only the per-site (n, PSI) aggregates are combined centrally.
n_tot     = sum(n for n, _ in reports.values())
fleet_psi = sum(n * s for n, s in reports.values()) / n_tot   # sample-weighted
worst     = max(reports, key=lambda k: reports[k][1])
n_alert   = sum(1 for _, s in reports.values() if s >= ALERT)

print("-" * 38)
print(f"sites reporting        : {len(reports)}")
print(f"sites in ALERT (>= {ALERT}) : {n_alert}")
print(f"fleet weighted PSI     : {fleet_psi:.4f}")
print(f"worst site             : {worst} (PSI = {reports[worst][1]:.4f})")
print(f"re-federation trigger  : {'FIRED' if n_alert >= 1 else 'clear'}")
Code 37.7.1: Federated drift monitoring end to end. Each site computes PSI from histogram counts on data it keeps and emits only the pair (count, PSI); the monitor combines the reports into a sample-weighted fleet PSI, flags the worst site, and fires the re-federation trigger on any site's significant-shift breach.
site              n      PSI  status
--------------------------------------
A-Metro        4000   0.0012  ok
B-Rural        1500   0.0204  ok
C-Pediatric    2500   0.0294  ok
D-Geriatric    1800   0.4543  ALERT
E-NewScanner   1200   0.1474  watch
--------------------------------------
sites reporting        : 5
sites in ALERT (>= 0.2) : 1
fleet weighted PSI     : 0.1003
worst site             : D-Geriatric (PSI = 0.4543)
re-federation trigger  : FIRED
Output 37.7.1: The geriatric site breaches the 0.20 significant-shift threshold and fires the trigger, while the scanner-swap site sits in the "watch" band; the fleet-weighted PSI of 0.10 alone would not have alarmed, which is why the per-site breakdown, not just the fleet average, drives the decision.

Notice what the fleet average hides. The sample-weighted fleet PSI is only $0.10$, sitting right at the bottom of the "watch" band and well short of an alarm; read alone it would suggest a healthy fleet. The per-site breakdown tells the real story: one hospital has drifted hard. This is the same lesson as the disparity gap in Section 3, now for inputs rather than performance. A monitor that watches only the rolled-up average will miss the single site that needs the model back, which is exactly the site the trigger exists to catch.

Library Shortcut: Evidently Computes the Per-Site Drift Report for You

Code 37.7.1 spells out PSI by hand to show that the signal is just a histogram comparison. In production each site runs a drift library on its local data and emits the resulting report; the library handles binning, multiple features, statistical tests, and the per-feature thresholds. With Evidently the site-local step is a few lines:

# Runs INSIDE one hospital, on data that never leaves the building.
from evidently import Report
from evidently.metrics import DataDriftPreset

report = Report(metrics=[DataDriftPreset()])           # PSI / KS / Wasserstein per feature
result = report.run(reference_data=ref_df, current_data=local_df)
summary = result.dict()                                # only this aggregate is emitted
n_drifted = summary["metrics"][0]["result"]["number_of_drifted_columns"]
Code 37.7.2: The same per-site drift step as Code 37.7.1 using Evidently, which computes PSI, KS, and Wasserstein drift across all monitored features at once. Only the small summary aggregate (drifted-column counts and per-feature scores) leaves the site for the fleet roll-up; the reference and current frames stay local.

6. What Triggers a Round, and What Does Not Intermediate

Bringing the threads together, a deployed federated medical model is governed by a small set of monitored quantities and a policy over them. Input drift per site (PSI or KL from histograms) gives the fast, label-free early warning. Per-site performance and the disparity gap $\Delta$, lagged by label delay, give the slower confirmation and the fairness signal. A CUSUM over either statistic separates a sustained shift from a noisy week. The re-federation trigger fires on a governed combination of these, never on a single noisy reading, and every firing writes its evidence to the audit trail and registers the resulting model. The cost asymmetry is the whole reason for the care: a federated round and a clinical redeploy are expensive and slow, so the policy is tuned to fire when a site genuinely needs the model and to stay quiet otherwise.

Research Frontier: Privacy-Preserving Federated Monitoring (2024 to 2026)

The open problem is making the monitoring reports themselves as private as the training updates. Recent work pushes differential privacy and secure aggregation, the mechanisms of Chapter 35, from the gradient roll-up onto the metric roll-up, so that even the per-site PSI histograms and confusion tallies are aggregated under formal privacy guarantees rather than trusted to a central monitor. A parallel thread studies federated drift detection that distinguishes covariate shift from genuine concept drift without sharing labels, using label-free proxies and conformal methods to bound performance under unlabeled drift, and personalized federated learning that adapts to per-site drift on the fly rather than waiting for a full re-federation round. The unifying question, how to govern and monitor a model whose evaluation data is as locked away as its training data, is precisely the question this case study raises, and it remains genuinely open for safety-critical clinical fleets.

Fun Note

The model never finds out which hospital is unhappy with it. The hospital sends one number, the number says "0.45", and the fleet monitor quietly schedules a retraining round the way a building superintendent schedules a repair from a single blinking light, no idea what the room looks like, certain only that something in it has changed.

Exercise 37.7.1: Two Alarms, One Scanner Conceptual

A hospital replaces an MRI scanner; the new machine produces images with slightly different intensity statistics, but radiologists read them the same way and outcomes are unaffected. Using the $p(x, y) = p(x)\,p(y \mid x)$ decomposition from Section 2, state which of the two alarms (input-drift, performance) should fire and which should stay quiet, and explain why a monitor that fired re-federation on input drift alone would waste an expensive federated round here. Then describe a change at the same hospital that should fire the performance alarm while leaving the input-drift alarm quiet.

Exercise 37.7.2: A Disparity Detector Coding

Extend Code 37.7.1 so that each site, in addition to its PSI, reports a local AUC computed on a labeled cohort (simulate per-site labels with a logistic outcome whose noise level differs across sites, so one site is genuinely harder). At the fleet monitor, compute the disparity gap $\Delta = \max_k m_k - \min_k m_k$ and identify the worst-served site, and fire a separate fairness alert when $\Delta$ exceeds a threshold you choose. Confirm that the worst PSI site and the worst AUC site need not be the same site, and explain in one paragraph what that mismatch tells the operator.

Exercise 37.7.3: Tuning the Trigger Against Label Delay Analysis

Suppose outcomes arrive with a mean delay of three weeks, and your performance metric at any moment is computed only over the matured cohort. Using the CUSUM recurrence from Section 4, argue how the delay affects the threshold $h$ you should pick for a performance-based trigger versus an input-drift-based trigger, and why running both in parallel (fast input alarm, slow performance alarm) is more defensible than either alone. Quantify the trade-off: if you lower $h$ to react faster, what is the cost in false re-federation rounds, and how does the expense of a federated round (every site pulled into a new Chapter 14 round plus a clinical redeploy) bound how aggressive that threshold should be?