"They asked me to learn from a million patients I am never allowed to meet. So I sent my parameters to visit each hospital instead, and came home a little wiser, carrying no one's secrets."
A Global Model That Has Never Seen a Single Record
Federated medical AI is the case where the central move of every prior chapter, gather the data and compute over it, is forbidden by law, so the computation must travel to the data instead of the data travelling to the computation. A diagnostic model wants the statistical power of many hospitals' patients, yet patient records cannot leave the institution that created them: privacy regulation, contractual data-residency clauses, and the simple ethics of clinical confidentiality make a central data lake illegal rather than merely expensive. This chapter takes one such system apart end to end: a predictive model trained across $K$ hospitals that never share a record, only model updates. This opening section states the problem precisely, fixes the requirements that distinguish it from the freely centralized web corpus of Chapter 36, decomposes the federated pipeline into the stages the rest of the chapter builds, and maps each stage onto the six axes of distribution from Chapter 1. Everything that follows is engineering against the constraint written down here: no data movement, ever.
The previous chapter studied web-scale retrieval-augmented generation, where the corpus was enormous but free to move: pages are crawled, copied, cleaned, embedded, and re-indexed without anyone's permission, and the only ceilings were size, freshness, latency, and cost. Medical data inverts the premise. A single hospital's electrocardiograms, chest radiographs, or longitudinal records may be modest in volume, well within the memory of one machine, yet they are the most tightly governed bytes in the building. The difficulty here is not that the data is too big for one machine; it is that the data is locked at its source and cannot be collected onto any machine at all. The interesting model, and the one this chapter trains, is the one that wants every hospital's patients at once and is permitted to hold none of them.
We treat the system as a concrete target. The running specification is a clinical risk model, say a thirty-day readmission or early-sepsis predictor, trained jointly across $K$ hospitals of very different sizes, from a small rural clinic to a large academic medical center. Each site keeps its own patient records behind its own firewall and regulatory boundary. The sites exchange only model parameters or gradients with a coordinator, never a single patient row, and the trained model must predict well, leak nothing recoverable about any individual, satisfy the auditors at every participating institution, and generalize across sites whose patient populations and equipment differ. Those four demands, predictive quality, privacy, regulatory compliance, and cross-site generalization, are the requirements; the rest of the chapter is the design that meets all four under the no-data-movement constraint.
1. The Requirements: Four Demands Under One Hard Constraint Beginner
A problem definition is only useful if it commits to what success means, because the requirements decide which machinery the chapter must assemble. We fix four requirements and one overriding constraint, and carry them through the chapter. The constraint comes first because it dominates everything: no patient record may leave the institution that holds it. This is not a performance preference but a legal boundary, enforced by privacy regulation such as HIPAA in the United States and the GDPR in the European Union, by data-residency clauses in institutional contracts, and by research-ethics approvals that scope each dataset to its owning site. Under that constraint, the four requirements are as follows. The first is predictive quality: the federated model must approach the accuracy a centralized model would reach if pooling were legal, because a model that protects privacy but predicts poorly helps no patient. The second is a privacy guarantee: the updates that do leave each hospital must not allow an adversary, including the coordinator itself, to reconstruct or confirm any individual's record. The third is regulatory compliance: the entire training procedure must be auditable and defensible to the governance board of every participating institution, since any one of them can veto the collaboration. The fourth is cross-site generalization: the model must perform well at every hospital despite differences in patient demographics, disease prevalence, and even imaging equipment, the statistical heterogeneity that makes pooled data so valuable and federated training so delicate.
These four requirements interact, and the interaction is the whole subject of the chapter. Privacy fights predictive quality, because the noise that protects an individual also blurs the signal the model learns from. Generalization fights privacy, because adapting closely to one site's distribution leaks more about that site. Compliance fights everything, because an auditor may reject an aggregation scheme that the engineers find convenient. The contrast with Chapter 36 is sharp and worth stating plainly: there, the data was free to move and the only enemy was scale; here, the data is forbidden to move and scale is almost beside the point. A single hospital's data often fits comfortably on one machine, so the reason we distribute is not a memory or throughput ceiling but a legal wall that no faster chip can remove.
Every earlier case in this book distributed because a resource ran out: memory for the data, memory for the model, throughput for the requests. Federated medical AI distributes for a categorically different reason. The data would fit; the law forbids collecting it. That single change rewrites the design space. The objective is no longer "compute the centralized answer faster" but "compute as close to the centralized answer as possible while the data stays put and the updates leak nothing." Recognizing that the binding constraint is a regulatory boundary rather than a hardware ceiling is the first thing this chapter teaches, because it is what makes federated learning the right axis and a central data lake simply illegal.
2. Why Centralizing Is Off the Table, and What Replaces It Intermediate
Suppose, against the rules, that we could pool the data. Let hospital $k$ hold $n_k$ patient records, and let the total across all $K$ sites be $N = \sum_{k=1}^{K} n_k$. The centralized objective we would like to minimize is the average loss over every patient in the union,
$$L(w) = \frac{1}{N} \sum_{k=1}^{K} \sum_{i=1}^{n_k} \ell(w; x_{k,i}, y_{k,i}) = \sum_{k=1}^{K} \frac{n_k}{N}\, L_k(w), \qquad L_k(w) = \frac{1}{n_k} \sum_{i=1}^{n_k} \ell(w; x_{k,i}, y_{k,i}).$$The right-hand identity is the entire mathematical foundation of federated training, and it is exact: the pooled average loss equals the sample-weighted sum of the per-site average losses $L_k$, with site $k$ weighted by its share $n_k / N$ of the patients. Crucially, computing $L_k$ and its gradient $\nabla L_k$ requires only hospital $k$'s own data, which it already holds locally. So the coordinator never needs the records; it needs only the per-site quantities $L_k$ or $\nabla L_k$, weighted by the counts $n_k$. This is the federated averaging (FedAvg) objective introduced in Chapter 14: minimize the weighted sum of local losses by repeatedly broadcasting the model, training locally, and aggregating the updates. The hard constraint that data cannot move is satisfied for free, because the only things that move are the model and its updates.
The decomposition is exact only when the aggregation respects the sample counts. A natural and wrong simplification is to let every hospital cast an equal vote, averaging the per-site updates uniformly rather than by $n_k / N$. Under the unequal site sizes that are the norm in medicine, a large academic center beside a tiny rural clinic, the unweighted average optimizes a different objective and drifts away from the centralized answer. The code below makes both claims concrete on a linear model whose loss has a closed form. It evaluates the pooled loss and gradient that centralization would produce, then the sample-weighted FedAvg aggregation, then the naive unweighted average, and reports how far each federated quantity sits from the centralized target.
import numpy as np
rng = np.random.default_rng(0)
d, K = 4, 5
w = rng.standard_normal(d) # point where we evaluate loss/gradient
# Unequal site sizes are the medical norm: a rural clinic next to an academic center.
n = np.array([80, 120, 4000, 60, 1740]) # patients per hospital n_k
N = n.sum()
# Each hospital holds its OWN (X_k, y_k); the data never leaves the site. Sites also
# differ in their label-generating shift, the statistical heterogeneity of Section 37.5.
sites = []
for k in range(K):
Xk = rng.standard_normal((n[k], d))
shift = rng.standard_normal(d) * 0.3
yk = Xk @ (w + shift) + 0.1 * rng.standard_normal(n[k])
sites.append((Xk, yk))
# Pooled objective: the centralized answer we are forbidden to compute for real.
Xall = np.vstack([Xk for Xk, _ in sites]); yall = np.concatenate([yk for _, yk in sites])
pooled_loss = np.mean((Xall @ w - yall) ** 2)
pooled_grad = (2.0 / N) * (Xall.T @ (Xall @ w - yall))
# Per-site quantities, each computed AT the hospital from local data only.
local_loss = np.array([np.mean((Xk @ w - yk) ** 2) for Xk, yk in sites])
local_grad = [(2.0 / n[k]) * (sites[k][0].T @ (sites[k][0] @ w - sites[k][1])) for k in range(K)]
weights = n / N # FedAvg weight_k = n_k / N
fedavg_loss = float(weights @ local_loss)
fedavg_grad = sum(weights[k] * local_grad[k] for k in range(K))
unweighted_loss = float(local_loss.mean()) # one-hospital-one-vote: the WRONG rule
unweighted_grad = sum(local_grad) / K
print("hospitals K :", K)
print("patients per site n_k :", n.tolist())
print("total patients N :", N)
print("aggregation weights n_k/N :", np.round(weights, 4).tolist())
print()
print(f"pooled (centralized) loss : {pooled_loss:.6f}")
print(f"FedAvg sample-weighted loss : {fedavg_loss:.6f} abs diff {abs(fedavg_loss-pooled_loss):.2e}")
print(f"unweighted-average loss : {unweighted_loss:.6f} abs diff {abs(unweighted_loss-pooled_loss):.2e}")
print()
print(f"FedAvg grad vs pooled, rel err : {np.linalg.norm(fedavg_grad-pooled_grad)/np.linalg.norm(pooled_grad):.2e}")
print(f"unweighted vs pooled, rel err : {np.linalg.norm(unweighted_grad-pooled_grad)/np.linalg.norm(pooled_grad):.2e}")
hospitals K : 5
patients per site n_k : [80, 120, 4000, 60, 1740]
total patients N : 6000
aggregation weights n_k/N : [0.0133, 0.02, 0.6667, 0.01, 0.29]
pooled (centralized) loss : 0.221104
FedAvg sample-weighted loss : 0.221104 abs diff 0.00e+00
unweighted-average loss : 0.288648 abs diff 6.75e-02
FedAvg grad vs pooled, rel err : 1.81e-15
unweighted vs pooled, rel err : 8.44e-01
The output settles the foundation. With the correct sample weights, the federated objective is identical to the centralized objective up to floating-point rounding, so the no-data-movement constraint costs nothing in the formulation itself: the coordinator can pursue exactly the centralized target using only quantities that each hospital computes locally. The unweighted average, by contrast, is off by a relative gradient error near $0.84$, a different objective entirely, because it silently overweights small sites and underweights the academic center that holds two thirds of the patients. The gap between these two numbers is the difference between a federated system that learns the right model and one that learns a confidently wrong one, and it is why Chapter 14 insists on sample-weighted aggregation. The complexity the rest of this chapter manages is not in this algebra; it is in protecting the updates that cross the boundary, in handling sites whose local objectives $L_k$ genuinely disagree, and in proving the whole thing to an auditor.
This case study marks a distinct turn of the book's thesis. Earlier chapters distributed because the data, the model, or the request load outgrew one machine; the data itself was always free to move, and scale-out was about moving it efficiently. Here the scale-out moment is the inverse: the data is forbidden to move, so intelligence is distributed by sending the model to the data and returning only what the data taught it. The coordinator builds a global model that has never seen a single record, and it is no weaker for it, because, as Output 37.1.1 shows, the federated objective equals the pooled one. When you reach the capstone in Chapter 41, this is the example to invoke whenever the binding constraint is governance rather than gigabytes: the axis to reach for is not "distribute the data" but "distribute the learning while the data stays home."
3. Decomposing the Pipeline Onto the Six Axes Intermediate
The six axes of distribution from Section 1.1, distribute data, distribute training, distribute the model, distribute inference, coordinate the cluster, and distribute intelligence, give us the map onto which the federated pipeline places. This case study loads two axes far more heavily than the others. It distributes training, because the gradient work is spread across hospitals that never share examples, the federated form of data-parallel learning under the unusual condition that the shards are sovereign and immovable. And it distributes intelligence, because each hospital is an autonomous party with its own governance, its own incentives, and its own data distribution, so the system is closer to a coordinated alliance of agents than to an obedient cluster of workers. The other axes appear, but in service of these two. Table 37.1.1 assigns each stage of the chapter to the axis it loads most heavily, names the earlier chapter that owns the underlying machinery, and points to the later section of this chapter that builds it. It doubles as the chapter's table of contents.
| Pipeline stage | Primary axis | Owning earlier chapter | Built in this chapter |
|---|---|---|---|
| Multi-hospital data, kept local | Distribute data | Ch 8 (storage), Ch 5 | Section 37.2 |
| Privacy constraints and threat model | Distribute intelligence | Ch 35 (privacy, DP) | Section 37.3 |
| Federated training setup | Distribute training | Ch 14 (FedAvg) | Section 37.4 |
| Statistical and systems heterogeneity | Distribute training | Ch 14, Ch 34 | Section 37.5 |
| Secure aggregation | Coordinate the cluster | Ch 35 (secure agg) | Section 37.6 |
| Monitoring and drift | Distribute intelligence | Ch 26, Ch 5 | Section 37.7 |
| Safety and governance | Distribute intelligence | Ch 35 | Section 37.8 |
| Capstone project | All axes | Ch 41 | Section 37.9 |
Reading the table top to bottom traces the chapter; reading the third column traces the path back through the book. The multi-hospital data stage is data distribution turned inside out: the shards are immovable, so the storage and loading machinery of Chapter 8 runs once per site rather than once over a pooled lake, with the evaluation discipline of Chapter 5 measuring each site separately. The privacy constraints, the federated setup, and the heterogeneity stages together are the heart of distributed training in its federated form, owned by Chapter 14 with its edge-learning extension in Chapter 34. Secure aggregation, the cryptographic guarantee that the coordinator learns the sum of updates without learning any single one, is the cluster-coordination problem made adversarial, drawing on the differential-privacy and secure-aggregation material of Chapter 35. Monitoring, drift detection, safety, and governance are where the system stops being a training job and becomes a long-lived, multi-party institution, which is why they live on the distribute-intelligence axis. No stage is new machinery; the contribution of this chapter is assembling it under a constraint none of the source chapters faced alone.
Who: A clinical machine learning team at a hospital network coordinating an early-sepsis predictor across five partner institutions.
Situation: Each hospital alone had too few labeled sepsis cases to train a reliable model, but pooling the five datasets would have given ample signal.
Problem: The legal teams of all five institutions refused to let patient records leave their respective firewalls, citing HIPAA and institutional data-residency policy, so the central data lake the engineers had planned was simply not permitted.
Dilemma: Abandon the collaboration and ship five weak single-site models, or train a joint model federated across the sites, which required building update aggregation, privacy accounting, and per-site evaluation that a centralized run would never need.
Decision: They went federated, broadcasting a shared model each round, training it locally inside each hospital, and returning only sample-weighted updates, the exact topology of Figure 37.1.1.
How: They sized the aggregation weights by each site's labeled case count, ran the sanity check of Code 37.1.1 to confirm the weighted objective matched what a pooled run would target, and had each institution's governance board sign off on the no-data-movement design before the first round.
Result: The federated model markedly outperformed every single-site model on held-out cases at each hospital, and because no record ever moved, the collaboration cleared every institution's review without a privacy exception.
Lesson: When the data cannot be centralized, the model can still be. Distributing the learning rather than the data turned five datasets that were legally unmergeable into one trained model that respected every boundary.
In Code 37.1.1 we performed the broadcast, the local computation, and the weighted aggregation by hand to expose the objective. A production federated system never hand-rolls the round loop, the per-site clients, or the network transport; a framework such as Flower supplies them, with sample-weighted FedAvg as the default strategy:
# pip install flwr
import flwr as fl
class HospitalClient(fl.client.NumPyClient): # runs INSIDE each hospital
def get_parameters(self, config): return get_weights(model)
def fit(self, parameters, config): # local training on private data
set_weights(model, parameters)
train_one_round(model, local_loader) # data never leaves this process
return get_weights(model), len(local_dataset), {} # n_k rides along as the weight
def evaluate(self, parameters, config):
set_weights(model, parameters)
return local_eval(model, local_loader)
# Coordinator: FedAvg already weights each update by the returned sample count n_k.
fl.server.start_server(strategy=fl.server.strategy.FedAvg(),
config=fl.server.ServerConfig(num_rounds=50))
len(local_dataset) is the $n_k$ that makes the aggregation match Output 37.1.1). Secure aggregation and differential privacy plug in as additional strategy wrappers, the subject of Section 37.6.4. What This Chapter Builds, Section by Section Beginner
With the constraint fixed, the requirements stated, the foundational objective proven, and the axes mapped, the path through the chapter is set, and it follows the rows of Table 37.1.1. Section 37.2 examines the multi-hospital data itself: how records sit behind each institution's boundary, why their distributions differ, and how to evaluate per site when no global validation set can exist. Section 37.3 states the privacy constraints precisely, naming the threat model and the regulatory demands that the no-data-movement rule is only the start of. Section 37.4 builds the federated training setup, turning the FedAvg objective of Section 2 into a running round protocol, drawing directly on Chapter 14. Section 37.5 confronts heterogeneity, both the statistical kind where sites genuinely disagree and the systems kind where a small clinic's hardware lags an academic center's, the federated edge concern of Chapter 34. Section 37.6 adds secure aggregation and differential privacy so the updates that cross the boundary leak nothing recoverable, building on Chapter 35. Section 37.7 monitors the deployed model for drift across sites and over time. Section 37.8 addresses clinical safety and multi-institution governance, where a wrong prediction harms a patient and a broken agreement ends the collaboration. Section 37.9 turns the whole system into a capstone project. Each section opens with the slice of Figure 37.1.1 it owns and the requirement from Section 1 it must satisfy.
The frontier of this problem advances on every requirement at once. On privacy, the lineage of secure aggregation and differential privacy is tightening the guarantee against an honest-but-curious coordinator, and recent work shows that gradient-inversion attacks can reconstruct training images from unprotected updates, which is precisely why the bare FedAvg of Section 2 is never deployed in the clinic without the protections of Section 37.6. On heterogeneity, methods in the FedProx and personalized-federated-learning families let each hospital keep a locally adapted head while sharing a common backbone, trading a little of the pooled objective for far better per-site fit. On scale and realism, large multi-institution efforts such as the federated brain-tumor segmentation studies and consortium-scale clinical collaborations have demonstrated that a model trained without any data leaving its site can match one trained on the pooled cohort, the empirical confirmation of Output 37.1.1 on real images. On foundation models, the open question is federated fine-tuning of large pretrained clinical models, where communicating full gradients is prohibitive and parameter-efficient updates become the unit that crosses the boundary, connecting this case study forward to the agentic and large-model applications of Chapter 40. The constant across all of it is the four-requirement tension of Section 1: every advance is judged by whether it improves predictive quality, privacy, compliance, or generalization without surrendering the other three, and always under the unmovable constraint that the data stays home.
There is a pleasing inversion at the center of this chapter. In ordinary distributed computing we move data to where the compute is, and the whole craft is making that movement cheap. Here the patient record is the one thing that may never travel, so we send the model on tour instead: it visits each hospital, learns a little from records it is allowed to read but not keep, and rides home to the coordinator carrying lessons but no secrets. Output 37.1.1 is the receipt that the trip lost nothing, the federated objective equals the pooled one to the bit. The data stayed home; the intelligence went everywhere.
For each of the four requirements in Section 1 (predictive quality, privacy guarantee, regulatory compliance, cross-site generalization), name the single stage of Figure 37.1.1 that it pressures hardest, the axis of distribution from Table 37.1.1 that stage sits on, and one concrete failure that occurs if that requirement is ignored. Then identify which two of the four requirements are in the most direct tension with each other and explain why improving one cheaply makes the other harder. Finally, state in one sentence what is fundamentally different about why this system is distributed compared with the web-scale RAG system of Chapter 36.
Starting from Code 37.1.1, make the size imbalance more extreme: set the academic center to ninety thousand patients and each of the four clinics to fifty, then recompute the relative gradient error of the unweighted average against the pooled gradient. Next, simulate a realistic failure where the two smallest clinics drop out of a round (their updates do not arrive) and the coordinator aggregates only the sites that responded, still sample-weighted by the responding $n_k$. Report how far the resulting gradient sits from the full-participation pooled gradient, and explain what this implies about client availability and the fairness of a federated medical model toward small sites. Connect your reasoning to the client-selection and straggler handling of Section 37.5.
The constraint of Figure 37.1.1 lets only model updates cross the boundary, yet an update still carries information about the data that produced it. Argue, without writing the cryptography, why sending a per-site gradient $\nabla L_k(w)$ can leak more about hospital $k$'s patients than sending a single scalar loss $L_k(w)$, and why a hospital with very few patients in a round is at greater risk than the large academic center. Using the sample-weighted objective $L(w) = \sum_k (n_k/N) L_k(w)$ from Section 2, explain why differential privacy adds noise calibrated to the per-site contribution rather than a single global amount, and why that noise trades directly against the predictive-quality requirement. Sketch the shape of the privacy-versus-accuracy curve you would expect, and connect it to the secure-aggregation and differential-privacy machinery forthcoming in Section 37.6 and owned by Chapter 35.