Section 37.9: Project Extension | Building Scalable AI

"I trained on four hospitals I was never allowed to visit. They sent me their lessons, never their patients, and one constraint at a time I learned to be trusted with the wards."
A Simulated Federation, Rehearsing for Hospitals It Has Never Seen

Big Picture

This chapter taught federated medical AI as a sequence of constraints layered onto a learning problem; this final section hands it back to you as a buildable project that grows a centralized single-site model into a privacy-preserving federation across simulated hospitals, one constraint at a time, with a measurable milestone at every step. You will start where the law would never let a real deployment start, with all the data pooled on one machine, because that centralized model is the upper bound every later stage is measured against. Then you will dismantle that convenience: split a public medical-style dataset across $K$ virtual hospitals with induced non-IID skew, train with FedAvg so the data never moves (Section 37.4), add FedProx to tame the heterogeneity that skew creates (Section 37.5), add differential privacy so no patient leaks through an update (Section 37.3), wrap the updates in secure aggregation so the server never sees an individual site (Section 37.6), and finish with per-site monitoring (Section 37.7) and a fairness-and-calibration gate (Section 37.8). The project is federated medical AI in miniature: it distributes the training and the intelligence while the data stays pinned, by law and by design, exactly where it was collected.

A case study read passively teaches the shape of a system; a case study rebuilt teaches the price of each promise it makes. The eight sections before this one walked federated medical AI as a finished artifact, naming the mechanism behind each constraint and the chapters that own it. This section inverts that posture. It gives you a staged construction plan in which you begin with a centralized baseline that no clinical setting would permit but every honest evaluation requires, then add one real-world constraint per stage, measuring what the constraint cost you before moving on. Each stage draws on a specific earlier section, so the project doubles as a guided tour back through the chapter and the book: when you average local models you are applying the FedAvg of Section 37.4 and Chapter 14, when you add the proximal term you are applying Section 37.5, when you clip and add noise you are applying the differential privacy of Section 37.3 and Chapter 35, and when you mask the updates you are applying the secure aggregation of Section 37.6. The discipline that makes the project worth its time is the one Section 1.1 opened with: add a constraint only when the setting forces it, and prove with a number what it cost.

Figure 37.9.1: The staged build. The baseline at the top pools all data on one machine to fix the centralized upper bound; each numbered stage below adds one real-world constraint, drawing on the section that owns that mechanism, and attaches a measurable milestone (accuracy retained, disparity reduced, privacy budget met, server blindness, live drift alarms, a gate that blocks unsafe runs). Carried to the end, the six stages turn a forbidden centralized model into a deployable federation that never moves a patient record, the chapter made buildable.

1. The Centralized Baseline You Are Not Allowed to Ship Beginner

Every honest federated project begins, paradoxically, with the centralized model it exists to avoid. The reason is measurement, not deviance. The centralized model, trained on all sites' data pooled in one place, is the statistical upper bound: it is what you could achieve if privacy and data-residency law did not exist, and every federated number you later report is a gap below it. Without that reference you cannot say whether your federation lost half a point of AUROC to the constraints or ten. So your first artifact is a single process that pools a public medical-style dataset, trains one classifier, and records its global AUROC, its accuracy, and, critically, its accuracy on each cohort you will later split into separate hospitals. You build it knowing you will never deploy it, because in a real clinical setting the pooling step in line one is precisely the thing the law forbids.

The code below is that baseline and its first scale-out move compressed to one dependency-light file. It builds a synthetic two-class dataset, induces non-IID skew by giving each virtual hospital a contiguous slice of the score-ordered data so that disease prevalence ranges from rare to common across sites, trains the pooled baseline, then trains the same model with FedAvg in which each site optimizes only its own records and the server averages the weights (Section 37.4). It reports the federated global AUROC against the pooled baseline and, because skew is the villain of this chapter, the per-site accuracy disparity that motivates every later stage.

import numpy as np

rng = np.random.default_rng(7)
N, d, K = 6000, 12, 4          # examples, features, virtual hospitals
w_star = rng.standard_normal(d)
X = rng.standard_normal((N, d))
logits = X @ w_star
p = 1.0 / (1.0 + np.exp(-logits))
y = (rng.random(N) < p).astype(float)

# Induce non-IID skew: sort by score, give each site a contiguous slice so the
# positive-class prevalence differs sharply from one virtual hospital to the next.
order = np.argsort(logits)
shards = np.array_split(order, K)

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

def fit(Xs, ys, w0, epochs=200, lr=0.5):   # local logistic-regression training
    w = w0.copy(); n = len(ys)
    for _ in range(epochs):
        g = Xs.T @ (sigmoid(Xs @ w) - ys) / n
        w -= lr * g
    return w

def auroc(scores, labels):                 # rank-based AUROC, no sklearn
    o = np.argsort(scores); ranks = np.empty(len(scores))
    ranks[o] = np.arange(1, len(scores) + 1)
    pos = labels == 1; npos = pos.sum(); nneg = len(labels) - npos
    return (ranks[pos].sum() - npos * (npos + 1) / 2) / (npos * nneg)

def acc(scores, labels):
    return float(((scores > 0.5) == (labels == 1)).mean())

# Pooled (centralized) baseline: one machine sees all data. The upper bound.
w_pool = fit(X, y, np.zeros(d)); s_pool = sigmoid(X @ w_pool)

# FedAvg: each site trains locally on ITS shard, the server averages weights
# (size-weighted), broadcasts, repeats. Data never leaves the site.
w_glob = np.zeros(d)
for rnd in range(15):                       # 15 communication rounds
    locals_ = [fit(X[s], y[s], w_glob, epochs=20) for s in shards]
    sizes = np.array([len(s) for s in shards], float)
    w_glob = np.average(locals_, axis=0, weights=sizes / sizes.sum())
s_fed = sigmoid(X @ w_glob)
site_acc = [acc(sigmoid(X[s] @ w_glob), y[s]) for s in shards]

print("sites K                :", K)
print("site pos-prevalence    :", [f"{y[s].mean():.2f}" for s in shards])
print("pooled  AUROC / acc    :", f"{auroc(s_pool, y):.4f} / {acc(s_pool, y):.4f}")
print("FedAvg  AUROC / acc    :", f"{auroc(s_fed, y):.4f} / {acc(s_fed, y):.4f}")
print("AUROC gap (pool-fed)   :", f"{auroc(s_pool, y) - auroc(s_fed, y):.4f}")
print("per-site acc           :", [f"{a:.3f}" for a in site_acc])
print("per-site disparity     :", f"{max(site_acc) - min(site_acc):.4f}")

Code 37.9.1: The federated-in-miniature baseline and its first constraint. The pooled model fixes the centralized upper bound; FedAvg matches its global AUROC without the four sites ever exchanging a record, but the per-site disparity it leaves behind is the heterogeneity problem the next stage attacks.

sites K                : 4
site pos-prevalence    : ['0.08', '0.33', '0.63', '0.93']
pooled  AUROC / acc    : 0.8752 / 0.7865
FedAvg  AUROC / acc    : 0.8752 / 0.7858
AUROC gap (pool-fed)   : 0.0000
per-site acc           : ['0.916', '0.665', '0.633', '0.930']
per-site disparity     : 0.2973

Output 37.9.1: Under severe non-IID skew (site prevalence runs from $0.08$ to $0.93$), FedAvg matches the pooled AUROC of $0.8752$ to four decimals, so distributing the training cost nothing in aggregate ranking. Yet the global model's per-site accuracy ranges from $0.633$ to $0.930$, a disparity of $0.297$: the aggregate looks healthy while two sites are served far worse than the others. That gap is what Stages 2 onward must close.

Key Insight: The Aggregate Can Pass While a Site Fails

Output 37.9.1 is the whole chapter in two numbers. The global AUROC matches the centralized upper bound, so a careless report would call the federation a success. But the per-site accuracy spans thirty points, because a single weight vector cannot be equally right for a hospital where the disease is rare and one where it is common. In medicine the aggregate is not the deliverable; the worst-served site is. This is why a federated medical project is never finished at FedAvg, and why every later stage in this section is judged by a per-site or per-cohort metric, not just a global one. A model that helps the average patient while failing a whole hospital is not safe to ship.

2. Staging the Constraints, Milestone by Milestone Intermediate

With the baseline in hand, you add the real-world constraints one stage at a time, in the order of Figure 37.9.1, never advancing until the current stage hits its milestone. The discipline of one-constraint-at-a-time matters because each constraint trades accuracy for something else (fairness, privacy, confidentiality, safety), and isolating it lets you price that trade exactly: when AUROC drops or disparity moves, exactly one thing changed and you know which section's mechanism to credit or blame. Table 37.9.1 is the project plan. Each row names the stage, the real-world pressure that forces it, the section that supplies the mechanism, and the measurable milestone that tells you the stage is done.

Table 37.9.1: The staged build plan for federated medical AI. Add constraints top to bottom; do not advance until the milestone is met. Each stage layers one real-world requirement onto the baseline and draws its mechanism from the named section.

Stage	Pressure that forces it	Mechanism (section)	Milestone to hit
1. Simulate federation + FedAvg	Data may not leave the hospital	FedAvg (Section 37.4, Ch 14)	Global AUROC within 0.01 of pooled
2. Add FedProx	Sites are non-IID, local drift hurts	Proximal term (Section 37.5)	Per-site disparity cut by a third
3. Add differential privacy	Updates can leak a patient	Clip + noise (Section 37.3, Ch 35)	$\varepsilon \le 8$ at the chosen $\delta$
4. Add secure aggregation	Server must not see one site's update	Masked sums (Section 37.6, Ch 14)	Server reads only the sum, not any term
5. Per-site monitoring	Sites drift apart after deployment	Per-site metrics (Section 37.7)	Drift alarm fires on an injected shift
6. Fairness + calibration gate	A model may be unsafe to release	Release gate (Section 37.8, Ch 5)	Gate blocks a run that violates either

The milestones are quantitative on purpose. "Added privacy" is not a milestone; "the model trains to within a stated accuracy under a measured privacy budget of $\varepsilon \le 8$" is. Stages 1 and 2 are accuracy-and-fairness stages, judged against the centralized upper bound and the per-site disparity of Output 37.9.1. Stage 3 is the privacy stage, where clipping each site's update to a norm bound and adding calibrated Gaussian noise (Section 37.3) spends a quantifiable privacy budget that the milestone caps. Stage 4 is the confidentiality stage, where secure aggregation (Section 37.6) lets the server learn the summed update while learning nothing about any single site's term. Stage 5 is the operations stage, where per-site monitoring (Section 37.7) must catch a deliberately injected distribution shift at one hospital. Stage 6 closes the loop with the release gate of Section 37.8: a model that fails a fairness floor or a calibration check is blocked before it ever reaches a ward, no matter how strong its aggregate AUROC.

Practical Example: The Federation That Passed in Aggregate and Failed a Ward

Who: A research team building a sepsis early-warning model across four partner hospitals under a data-sharing agreement that forbade moving records.

Situation: Their first FedAvg run matched the (internally computed) centralized AUROC almost exactly, and the team prepared to publish.

Problem: A clinician reviewer asked for the per-site numbers, and at the smallest hospital, where sepsis was rarest, sensitivity was far below the others, exactly the disparity pattern of Output 37.9.1.

Dilemma: Ship the strong aggregate and footnote the weak site, fast but indefensible in a clinical setting, or stage the remaining constraints and pay the schedule cost of FedProx, privacy accounting, and a fairness gate.

Decision: They staged it. FedProx pulled the worst site's metric up toward the others, differential privacy was added with a budget the ethics board signed off on, and a fairness gate was made a hard release condition.

How: Each stage was measured against the baseline from the first run; the team advanced only after the per-site disparity, the privacy $\varepsilon$, and the gate each met its target on held-out site data.

Result: The released model gave up a fraction of aggregate AUROC for a far smaller per-site gap and a documented privacy guarantee, and the fairness gate later blocked a retraining run that would have regressed the smallest site.

Lesson: In federated medicine the aggregate is the easy part. Stage the constraints, measure each per site, and let a gate, not a deadline, decide what ships.

3. The Numbers Your Project Must Hit Intermediate

A federated medical project lives or dies by whether its milestones are targets you compute rather than feelings, so each stage gets a number defined in advance. The accuracy milestone is a gap below the centralized upper bound: with pooled ranking quality $\text{AUROC}_{\text{pool}}$ and federated quality $\text{AUROC}_{\text{fed}}$, you require

$$\text{AUROC}_{\text{pool}} - \text{AUROC}_{\text{fed}} \le 0.01,$$

which Output 37.9.1 already meets at a gap of $0.000$ for FedAvg. The fairness milestone is a cap on the spread of a per-site metric. Writing $a_k$ for the global model's accuracy (or sensitivity) at site $k$, the disparity is

$$\Delta = \max_k a_k - \min_k a_k,$$

and the Stage-2 target is to cut the baseline $\Delta = 0.297$ by at least a third, which is what the FedProx proximal term of Section 37.5 buys by keeping each site's local solution close to the shared model. The privacy milestone is an explicit budget: a run satisfies $(\varepsilon, \delta)$-differential privacy when, for adjacent datasets differing in one patient and any output set $S$,

$$\Pr[\mathcal{M}(D) \in S] \le e^{\varepsilon}\,\Pr[\mathcal{M}(D') \in S] + \delta,$$

and your milestone fixes a $\delta$ (commonly $\delta \approx 1/N$) and caps $\varepsilon \le 8$, the loose-but-defensible range Section 37.3 and Chapter 35 discuss. Finally, the cost you trade all of this against is communication, and for synchronous FedAvg it is essentially

$$C \;=\; R \times K \times 2|\theta|,$$

that is, the number of rounds $R$ times the number of sites $K$ times twice the model size $|\theta|$ (download the global model, upload the local one). This single product is why the chapter cares so much about rounds: differential privacy and FedProx both tend to need more rounds $R$ to reach the same accuracy, so each privacy or fairness gain shows up directly as more bytes on the wire. Compute these four targets before you build, so each milestone is a prediction you test rather than a result you rationalize.

Library Shortcut: Flower Runs the Whole Simulated Federation

The hand-rolled round loop in Code 37.9.1 is for understanding; in the real project the orchestration (broadcast, local-fit, weighted-average, repeat, with $K$ virtual clients on one machine) is exactly what a federated-learning framework gives you. Flower's simulation engine spins up $K$ in-process clients, runs FedAvg or FedProx as a named strategy, and lets you swap in differential privacy or secure aggregation without rewriting the loop, which is how Stages 1 through 4 become configuration rather than code:

# pip install "flwr[simulation]"
import flwr as fl

# A client wraps ONE virtual hospital; train()/evaluate() touch only its shard.
class Hospital(fl.client.NumPyClient):
    def get_parameters(self, config): ...
    def fit(self, params, config):      # local epochs on THIS site's data only
        ...
        return new_params, num_examples, {}
    def evaluate(self, params, config):
        ...
        return loss, num_examples, {"auroc": site_auroc}

strategy = fl.server.strategy.FedProx(proximal_mu=0.1)   # swap FedAvg -> FedProx
fl.simulation.start_simulation(                          # K clients, one machine
    client_fn=lambda cid: Hospital(cid),
    num_clients=4, config=fl.server.ServerConfig(num_rounds=15), strategy=strategy)

Code 37.9.2: The same staged federation as Code 37.9.1, now as a Flower simulation. The round loop, the size-weighted average, and the FedAvg-to-FedProx switch collapse to a strategy object and a start_simulation call; differential-privacy and secure-aggregation wrappers plug into the same strategy, so Stages 1 through 4 of Table 37.9.1 become parameters rather than a rewrite.

4. Extension Challenges Worth the Federation Advanced

Once the six stages hit their milestones you have a working privacy-preserving federation, and the project becomes a platform for the harder questions the chapter only gestured at. Each extension below adds one capability that a real cross-hospital deployment needs, and each reaches into a different part of the book, so finishing them turns the capstone from a pipeline into a system. Add SCAFFOLD, the control-variate method that corrects the client-drift FedProx only dampens, and measure how many fewer communication rounds $R$ it needs to reach the same disparity target, which shows up directly in the communication cost $C = R \times K \times 2|\theta|$. Personalize per site by splitting the model into a shared backbone and a site-local head, then quantify how much of the $0.297$ disparity in Output 37.9.1 a per-site head removes that a single global model cannot, the natural answer to the aggregate-passes-while-a-site-fails problem of Stage 1.

Two further extensions test the federation against an adversary rather than mere heterogeneity, and they connect the project to the reliability material of Part VII. Introduce a Byzantine site that sends corrupted or poisoned updates, then defend the aggregation with a robust rule (coordinate-wise median, trimmed mean, or Krum) in place of the plain weighted average, and measure how much corruption the defended federation tolerates before its AUROC collapses, the robust-aggregation idea from Chapter 35. Then compose the defenses: run robust aggregation under secure aggregation and differential privacy at once, and report the combined cost, because in a real federation these constraints are not optional alternatives but simultaneous requirements. Each extension is a small, bounded change to a working system, which is exactly the posture in which distributed-systems concepts are learned best: against a baseline you can measure, in a federation you already understand.

Research Frontier: Where Federated Medical AI Is Heading (2024 to 2026)

The extensions above track live research lines, so a capstone that implements them is working at the current edge. Personalized federated learning has moved past a single global model toward methods that learn a shared representation with site-specific heads (the FedRep and Ditto lineage) and toward clustered federation that groups hospitals with similar distributions, directly targeting the disparity of Output 37.9.1. On privacy, the field is tightening the accounting: Renyi and PRV accountants give far less pessimistic $\varepsilon$ for the same noise, and DP-FTRL-style methods reach useful accuracy at single-digit $\varepsilon$ on realistic clinical tasks, narrowing the privacy-utility gap that Stage 3 pays. Robust-and-private aggregation, long thought to conflict because robustness wants to inspect updates while secure aggregation hides them, is being reconciled by methods that compute robust statistics under masking. And foundation models have reached federated medicine: parameter-efficient federated fine-tuning (LoRA adapters averaged across sites) lets hospitals jointly adapt a shared imaging or clinical-text model while exchanging only the small adapters, which is, as of 2026, the most active frontier in the whole field and the bridge from this case study to the foundation-model training of Chapter 19.

5. Chapter Summary and What You Built Beginner

This section closes Chapter 37, so it is worth stating the through-line the whole chapter built. We began with the problem definition (Section 37.1): train a clinically useful model across hospitals that may not move their patient data, under privacy, heterogeneity, and safety constraints that a centralized project never faces. From there the chapter walked the federation constraint by constraint, and every constraint was the same move applied to a different requirement: distribute the training while pinning the data, then pay a measured price to keep the result private, fair, confidential, and safe. The multi-hospital data setting (Section 37.2) established the non-IID reality that makes a single global model insufficient, and the privacy constraints (Section 37.3) established why raw updates cannot be shared in the clear. The federated learning setup (Section 37.4) applied the FedAvg of Chapter 14 to train without moving data, and the data-heterogeneity treatment (Section 37.5) added FedProx to keep skewed sites from pulling the shared model apart. Secure aggregation (Section 37.6) hid each site's update from the server, monitoring across sites (Section 37.7) watched for the drift that federation makes invisible by default, and safety and responsibility (Section 37.8) gated the release on fairness and calibration, not aggregate accuracy alone. The chapter is, end to end, one distributed system that distributes the training and the intelligence while the data stays exactly where it was collected.

Thesis Thread: Distribute the Learning, Pin the Data

The book's spine is that AI at scale is the engineering of systems whose data, computation, models, inference, and decisions are distributed across many machines, and that each distribution is forced by a constraint, not chosen for elegance. Federated medical AI is the purest expression of that thesis in the book, because here the constraint is absolute: the data physically may not move, so only the learning can be distributed. The privacy arc the book has tracked since secure aggregation in Chapter 14 and differential privacy in Chapter 35 reaches its clinical test here, where the cost of a leaked record is a patient, not a metric. The staged project in this section is the thesis made buildable: you do not read about distributing training under privacy and heterogeneity, you implement it, one constraint at a time, and watch a forbidden centralized model become a federation a hospital could actually trust. That is the bridge into the remaining case studies, each of which pins a different resource in place and distributes the rest.

Key Takeaway: Chapter 37 as a Buildable System

Federated medical AI is not a list of privacy tricks; it is one distributed system in which every stage adds a measured constraint to a learning problem whose data may never move. (1) Pool the data once, on paper only, to fix the centralized upper bound you are forbidden to ship but must measure against. (2) Simulate a federation with non-IID skew and run FedAvg so training distributes while data stays put. (3) Add FedProx so heterogeneous sites stop pulling the shared model apart and the per-site disparity shrinks. (4) Add differential privacy so no patient leaks through an update, at a budget you cap. (5) Add secure aggregation so the server learns the sum and nothing about any one site. (6) Add per-site monitoring and a fairness-and-calibration gate so the worst-served site, not the average, decides what ships. Built in this order against milestones, the project distributes the intelligence while pinning the data, which is federated medical AI in miniature.

Project Ideas: Build the Federation, Then Harden It

Each idea is sized so that carrying it through the staged plan of Table 37.9.1 becomes a capstone in the sense of Chapter 41. Core build: start from the Code 37.9.1 baseline, split a public medical-style dataset (for example a tabular clinical-risk set) across $K$ virtual hospitals with induced non-IID skew, and add the six constraints in order, recording the global AUROC gap, the per-site disparity, the privacy $\varepsilon$, and the communication cost $C = R \times K \times 2|\theta|$ at each stage; the deliverable is a writeup in which every number is measured against the centralized baseline. SCAFFOLD: replace FedProx with the control-variate method and report how many fewer rounds it needs to reach the Stage-2 disparity target. Personalization: split the model into a shared backbone and per-site heads and quantify the disparity it removes that a single global model cannot. Byzantine defense: add a malicious site that poisons its updates and defend with robust aggregation (median, trimmed mean, or Krum), measuring the corruption fraction the federation tolerates before AUROC collapses (Chapter 35). Compose the constraints: run robust aggregation under secure aggregation and differential privacy at once, and report the combined accuracy, privacy, and communication cost, the configuration a real cross-hospital federation actually faces.

Exercise 37.9.1: Name the Constraint, Stage by Stage Conceptual

For each of the six stages in Table 37.9.1, state the single real-world pressure that forces it and the per-site or aggregate metric its milestone is judged by. Then explain why the centralized baseline of Stage 1, which you are forbidden to deploy, is nonetheless the first thing you must build, and what specific later judgments become impossible without it. Finally, identify the one stage whose milestone is about confidentiality rather than accuracy or privacy budget, and explain the difference between hiding an individual site's update from the server (secure aggregation) and bounding what any patient contributes to the released model (differential privacy).

Exercise 37.9.2: Extend the Miniature Federation Coding

Starting from Code 37.9.1, (a) add the FedProx proximal term to the local fit step (add $\frac{\mu}{2}\lVert w - w_{\text{glob}}\rVert^2$ to the local objective, so the gradient gains a $\mu (w - w_{\text{glob}})$ term) and report how the per-site disparity changes as you sweep $\mu \in \{0, 0.01, 0.1, 1\}$. (b) Add a simple differential-privacy step on each site's update: clip the local weight change to an $L_2$ norm bound, add Gaussian noise, and measure the AUROC drop as you increase the noise. (c) Increase $K$ from $4$ to $8$ virtual hospitals while holding $N$ fixed, and report what happens to both the global AUROC and the per-site disparity, relating the result to the Stage-1 and Stage-2 milestones.

Exercise 37.9.3: Price the Constraints in Rounds and Bytes Analysis

A federation has $K = 20$ hospitals and a model of $|\theta| = 5 \times 10^6$ float32 parameters. (a) Using $C = R \times K \times 2|\theta|$, compute the total bytes communicated for a plain FedAvg run that converges in $R = 30$ rounds. (b) Suppose adding differential privacy raises the round count needed for the same accuracy to $R = 50$, and adding FedProx on top raises it to $R = 60$; compute the communication cost of each and express the privacy-and-fairness surcharge as a percentage over the FedAvg baseline. (c) A SCAFFOLD variant reaches the target in $R = 35$ rounds but doubles the per-round payload (it also sends control variates). Compute its cost, compare it to the $R = 60$ FedProx run, and state the condition on round reduction under which sending the extra control variates is still a net win.