Section 37.4: Federated Learning Setup

"A global model, visiting each hospital in turn, learning what it could without ever taking a single chart home."
A Coordinator That Never Saw a Patient

Big Picture

The training system for this case study moves model parameters between institutions instead of moving patient data, so the network of hospitals trains one shared model while every record stays inside the firewall that legally owns it. The privacy and governance constraints established earlier in this chapter rule out the obvious move of pooling all the data on one cluster; what remains is federated learning, the cross-silo variant developed in Chapter 14. This section configures that machinery concretely for clinical sites: it fixes the round structure, chooses a server optimizer among FedAvg, FedProx, and FedAdam, sets the local-epoch budget against the communication cost, picks the model architecture, and lays out the coordinator-and-client software that runs behind each hospital's firewall. We do not re-derive federated averaging here; we treat it as a known primitive and engineer it for the cross-silo medical regime, where a handful of trusted institutions, each with a large local dataset and a reliable network, all participate in every round.

The previous sections of this chapter established what the system must learn and the constraints under which it may learn it: a shared clinical risk model, trained across several hospitals, with the absolute requirement that raw records never leave the institution that holds them. That constraint is what forces distribution here. It is not a throughput ceiling or a model-size ceiling of the kind that drove data parallelism in Chapter 1; it is a governance ceiling. The data physically cannot be co-located, so the computation must come to the data. Federated learning is the discipline that makes this work, and the cross-silo setting (a small number of organizational data silos rather than millions of phones) is the variant that fits a hospital consortium. We now turn the abstract FedAvg of Chapter 14 into a deployable configuration.

1. The Federated Round, Concretely Beginner

Federated training proceeds in synchronized rounds, each one a complete cycle of broadcast, local work, upload, and aggregation. The coordinator holds the authoritative global model $w^{(t)}$. At the start of round $t$ it broadcasts that model to every participating hospital. Each hospital $k$ then trains locally on its own records for $E$ epochs, producing an updated local model $w_k^{(t+1)}$ that it never shares as data, only as parameters. The hospitals upload those parameter vectors (or, equivalently, the deltas $w_k^{(t+1)} - w^{(t)}$) to the coordinator, which combines them into the next global model $w^{(t+1)}$ and begins the following round. Figure 37.4.1 traces one such round across the consortium.

Figure 37.4.1: One cross-silo federated round. The coordinator broadcasts the global model $w^{(t)}$ (solid arrows down) to all hospitals, which participate together. Each hospital trains for $E$ local epochs on records that never leave its firewall, then uploads only its parameter vector (dashed arrows up). The coordinator combines the uploads with a size-weighted average, the reduce operation of Chapter 4, to form $w^{(t+1)}$. Parameters move; data does not.

The aggregation rule is federated averaging, FedAvg, exactly as introduced in Chapter 14. With $K$ hospitals, where hospital $k$ holds $n_k$ records and $n = \sum_k n_k$ is the consortium total, the new global model is the size-weighted mean of the local models,

$$w^{(t+1)} = \sum_{k=1}^{K} \frac{n_k}{n}\, w_k^{(t+1)}, \qquad w_k^{(t+1)} = w^{(t)} - \eta \sum_{e=1}^{E} \nabla F_k\big(w_k^{(t,e)}\big).$$

The size weighting matters because hospitals differ enormously in patient volume; a large academic center should count for more than a small rural clinic, in proportion to the evidence each contributes. This weighted combine is a reduce in the exact sense of Chapter 4: each worker holds a vector, and the system folds those vectors into one with an associative, commutative operation. The inner update, $E$ epochs of gradient steps on the local objective $F_k$, is the local SGD of Chapter 10, run inside each hospital between communications.

Key Insight: The Round Is a Reduce With Local Computation Stuffed Inside It

A federated round is structurally a broadcast followed by a reduce, the same two collectives that drive data-parallel training. The one difference that changes everything is what happens between them: instead of one gradient step, each worker takes $E$ full epochs of local optimization before the reduce. That single design choice converts a network-bound synchronization into a computation-bound one, which is precisely why federated learning tolerates the slow, firewalled links between hospitals that would cripple ordinary distributed SGD.

2. What "Cross-Silo" Changes Beginner

The federated literature is dominated by the cross-device setting: millions of phones, each with a few records, most of them offline at any moment, joined by flaky cellular links. A hospital consortium is the opposite regime on almost every axis, and recognizing this lets us simplify the design considerably. There are few participants, perhaps five to fifty institutions, and they are known and trusted parties bound by a data-use agreement rather than anonymous clients. Every site participates in every round, so there is no client sampling and no need to reason about which subset showed up. Connectivity is reliable, since these are data-center-grade links between known endpoints, not consumer radios. Each site holds a large local dataset, thousands to millions of records, which makes many local epochs both feasible and statistically meaningful. Because each round already does substantial work and the links are good, the system converges in tens of rounds rather than the thousands a cross-device deployment may need.

These properties compound favorably. Full participation removes the variance that client sampling injects into cross-device FedAvg. Reliable links mean a straggler is a rare degraded node, not the norm, so the straggler mitigation of Chapter 18 is a safety net rather than a load-bearing mechanism. Large local datasets mean the dominant statistical problem is not data scarcity per client but data heterogeneity across clients, because each hospital's patient population, equipment, and coding practices differ. That heterogeneity is the central technical risk of this case study, and Section 37.5 is devoted to it; here it shapes one decision, the choice of server optimizer, which we take up next.

3. Choosing the Server Optimizer Intermediate

FedAvg is the baseline, and on identically distributed data it is hard to beat. The complication in a hospital consortium is that the data is not identically distributed: site $k$ minimizes its own local objective $F_k$, and when those objectives disagree, many local epochs pull each $w_k$ toward a different local minimum. The weighted average of models that have drifted apart can land in a poor place, an effect called client drift that slows or destabilizes convergence as heterogeneity grows. Two well-established remedies address this, and they intervene at different ends of the round.

FedProx changes the local objective. Each hospital adds a proximal term that penalizes drifting too far from the broadcast global model $w^{(t)}$, replacing $F_k(w)$ with

$$\min_{w}\; F_k(w) + \frac{\mu}{2}\, \big\lVert w - w^{(t)} \big\rVert^2.$$

The coefficient $\mu$ tunes a leash: $\mu = 0$ recovers FedAvg exactly, while larger $\mu$ keeps local models close to the global one, trading a little local progress per round for stability under skew. This is a minimal, one-line change to the client and nothing on the server, which is why it is the natural first defense against heterogeneity. We adopt it conditionally and revisit the tuning of $\mu$ in Section 37.5.

FedAdam changes the server instead. Rather than averaging models directly, the coordinator treats the aggregated client delta $\Delta^{(t)} = \sum_k \tfrac{n_k}{n}\,(w_k^{(t+1)} - w^{(t)})$ as a pseudo-gradient and applies an adaptive optimizer to it,

$$m^{(t)} = \beta_1 m^{(t-1)} + (1-\beta_1)\,\Delta^{(t)}, \quad v^{(t)} = \beta_2 v^{(t-1)} + (1-\beta_2)\,(\Delta^{(t)})^2, \quad w^{(t+1)} = w^{(t)} + \eta_s \frac{m^{(t)}}{\sqrt{v^{(t)}}+\varepsilon}.$$

Server momentum and per-coordinate scaling smooth out noisy rounds and accelerate convergence, which is valuable precisely because cross-silo training affords only a few rounds. FedProx and FedAdam are not mutually exclusive; the proximal term tames the clients while the adaptive server optimizer stabilizes the aggregate, and serious deployments often combine them. For this case study we begin with FedAvg, keep FedProx ready as the heterogeneity remedy, and reach for FedAdam when round budget is tight. Table 37.4.1 summarizes the choice.

Table 37.4.1: Server-optimizer options for the consortium, the problem each one targets, and where it intervenes in the round.

Method	Where it acts	What it fixes	Cost
FedAvg	Server: plain weighted mean	Nothing extra; the baseline	None; drifts under skew
FedProx	Client: proximal term $\tfrac{\mu}{2}\lVert w - w^{(t)}\rVert^2$	Client drift from local heterogeneity	One hyperparameter $\mu$ to tune
FedAdam	Server: adaptive step on the aggregate delta	Slow, noisy convergence in few rounds	Server state $m, v$; step size $\eta_s$

4. Local Epochs Versus Communication Intermediate

The single most consequential knob is $E$, the number of local epochs each hospital runs before uploading. It sets the ratio of computation to communication, and the two effects pull in opposite directions. Suppose the consortium needs a total of roughly $S$ local gradient passes to reach the target accuracy. The number of communication rounds is then approximately

$$R \approx \frac{S}{E},$$

so doubling $E$ roughly halves the number of rounds, and therefore halves the number of times the (slow, firewalled, audited) cross-institution links must carry a full model. Communication is the expensive resource in this deployment, which argues for large $E$. The opposing force is heterogeneity: with non-identical site data, each extra local epoch lets $w_k$ drift further toward its own local optimum, so very large $E$ degrades the quality of the averaged model, and beyond a point the loss per round stops improving or worsens. The convergence theory of Chapter 14 makes this dependence explicit; informally, the optimality gap after $R$ rounds behaves like

$$\mathbb{E}\big[F(w^{(R)})\big] - F^\star \;\lesssim\; \underbrace{\frac{C_1}{R}}_{\text{fewer rounds hurt}} \;+\; \underbrace{C_2\, E\, \sigma^2}_{\text{drift grows with } E \text{ and heterogeneity } \sigma^2},$$

where $\sigma^2$ measures how much the site objectives disagree. The first term rewards large $E$ (it buys more rounds-worth of progress per communication); the second punishes it, scaled by the heterogeneity $\sigma^2$. The right $E$ is therefore not a constant but a function of how skewed the consortium's data is, which is exactly why $E$ and the FedProx coefficient $\mu$ are tuned together in Section 37.5. For the homogeneous-to-mild regime of the demo below, a modest $E$ in the single digits sits comfortably in the sweet spot.

5. Model Architecture and Privacy Wrapping Intermediate

The architecture must be agreed before the first round, because every site trains the identical model and the coordinator averages parameters position by position; a mismatch in shape makes the reduce meaningless. This constrains the choice toward a single, fixed, consortium-wide architecture rather than per-site customization. For structured electronic-health-record features (labs, vitals, demographics, coded diagnoses) a compact model such as regularized logistic regression or a small multilayer perceptron is the right starting point: it is cheap to communicate, since fewer parameters mean smaller uploads, it is data-efficient enough to train well even at the smaller sites, and its behavior is auditable, which matters for clinical sign-off. For imaging tasks the same federated round wraps a convolutional or transformer backbone, at the cost of much larger parameter uploads, which raises the value of the large-$E$ communication savings of Section 4. The federated machinery is indifferent to the architecture; what changes with model size is only the bytes moved per round.

Critically, the parameter upload is not sent in the clear. Each hospital's update is wrapped by the privacy mechanisms developed in Section 37.3: secure aggregation so the coordinator learns only the summed update and never any single hospital's contribution, and optionally differential-privacy noise added to each update before it leaves the firewall. These protections sit between the local-train and upload steps of every round; the FedAvg arithmetic above is unchanged, but each $w_k^{(t+1)}$ that reaches the aggregator has been encrypted or noised first. The privacy layer and the federated layer are orthogonal, which is what lets this section configure the training dynamics while Section 37.3 owns the confidentiality guarantees.

Thesis Thread: Distribution Forced by Governance, Not by Size

Every earlier form of distribution in this book was driven by a resource ceiling: too much data, too large a model, too many requests. Federated medical AI distributes for a different reason entirely, a legal and ethical one, and yet it reuses the same primitives. The broadcast and the size-weighted reduce are the collectives of Chapter 4; the local epochs are the local SGD of Chapter 10; the coordinator is the parameter-server pattern of Chapter 11. The thesis holds even when the motivation is not scale but sovereignty: the way to make many institutions act as one model is to move the parameters and leave the data home.

6. The Runnable Federated Loop Intermediate

The code below builds the whole round structure on synthetic patient cohorts so the dynamics are visible end to end. Five hospitals each hold a different number of records drawn with a different covariate shift, modeling the heterogeneity of real sites, while sharing the same underlying risk model. Each round broadcasts the global model, runs $E$ local epochs of logistic-regression gradient descent at every site, and combines the results with the size-weighted FedAvg reduce. The script runs both plain FedAvg and FedProx ($\mu = 0.1$) and compares the federated result against a centralized model trained on the pooled data, the privacy-violating baseline the consortium is forbidden from building.

import numpy as np

rng = np.random.default_rng(7)
H, d = 5, 8                       # hospitals, patient features
w_star = rng.standard_normal(d)   # the shared clinical risk model
b_star = 0.3

def make_hospital(n, shift):
    # Site distributions differ (covariate shift); the risk model is shared.
    X = rng.standard_normal((n, d)) + shift
    p = 1.0 / (1.0 + np.exp(-(X @ w_star + b_star)))
    y = (rng.random(n) < p).astype(np.float64)
    return X, y

sizes  = [4000, 2500, 3500, 1500, 5000]          # heterogeneous volumes
shifts = [0.0, 0.8, -0.6, 1.2, -1.0]             # heterogeneous distributions
sites  = [make_hospital(n, s) for n, s in zip(sizes, shifts)]
n_total = sum(sizes)

def sigmoid(z): return 1.0 / (1.0 + np.exp(-np.clip(z, -30, 30)))

def local_loss(W, X, y):
    p = sigmoid(X @ W[:-1] + W[-1]); eps = 1e-9
    return -np.mean(y*np.log(p+eps) + (1-y)*np.log(1-p+eps))

def global_loss(W):
    return sum(local_loss(W, X, y)*len(y) for X, y in sites) / n_total

def local_train(W_global, X, y, epochs, lr, mu=0.0):
    # E local epochs of GD (Ch 10 local SGD); mu>0 is the FedProx proximal pull.
    W = W_global.copy(); n = len(y)
    Xb = np.hstack([X, np.ones((n, 1))])
    for _ in range(epochs):
        grad = Xb.T @ (sigmoid(Xb @ W) - y) / n + mu * (W - W_global)
        W -= lr * grad
    return W

def run_federated(rounds, epochs, lr, mu=0.0):
    W = np.zeros(d + 1); history = []
    for _ in range(rounds):
        updates = [local_train(W, X, y, epochs, lr, mu) for X, y in sites]  # broadcast + local train
        W = sum(u*len(y) for u, (X, y) in zip(updates, sites)) / n_total    # FedAvg reduce
        history.append(global_loss(W))
    return W, history

ROUNDS, EPOCHS, LR = 15, 5, 0.5
print("hospitals H            :", H)
print("total patients         :", n_total)
print("rounds x local epochs  :", f"{ROUNDS} x {EPOCHS}\n")

_, hist_fa = run_federated(ROUNDS, EPOCHS, LR, mu=0.0)
_, hist_fp = run_federated(ROUNDS, EPOCHS, LR, mu=0.1)
print("round   FedAvg loss   FedProx loss")
for r in [0, 1, 2, 4, 7, 11, 14]:
    print(f"{r+1:>4}      {hist_fa[r]:.4f}        {hist_fp[r]:.4f}")

# Centralized reference: pool everything on one machine (forbidden in practice).
Xc = np.vstack([X for X, y in sites]); yc = np.concatenate([y for X, y in sites])
Wc = np.zeros(d + 1); Xcb = np.hstack([Xc, np.ones((len(yc), 1))])
for _ in range(400):
    Wc -= LR * (Xcb.T @ (sigmoid(Xcb @ Wc) - yc) / len(yc))
print("\nfinal FedAvg loss      :", f"{hist_fa[-1]:.4f}")
print("final FedProx loss     :", f"{hist_fp[-1]:.4f}")
print("centralized (pooled)   :", f"{global_loss(Wc):.4f}")
print("FedAvg vs pooled gap   :", f"{hist_fa[-1] - global_loss(Wc):.4f}")

Code 37.4.1: A few FedAvg rounds across five simulated hospital cohorts, with FedProx and a centralized pooled baseline for comparison. The run_federated loop is the entire broadcast / local-train / reduce cycle of Figure 37.4.1; only the size-weighted average and the per-site shift make it federated rather than ordinary minibatch training.

hospitals H            : 5
total patients         : 16500
rounds x local epochs  : 15 x 5

round   FedAvg loss   FedProx loss
   1      0.5393        0.5490
   2      0.4937        0.5013
   3      0.4736        0.4792
   5      0.4570        0.4600
   8      0.4494        0.4507
  12      0.4467        0.4472
  15      0.4462        0.4464

final FedAvg loss      : 0.4462
final FedProx loss     : 0.4464
centralized (pooled)   : 0.4458
FedAvg vs pooled gap   : 0.0004

Output 37.4.1: The global model improves every round and converges in roughly a dozen rounds to a loss of $0.4462$, within $0.0004$ of the centralized model trained on pooled data ($0.4458$). The consortium reached essentially the pooled-data answer without any record ever leaving its hospital. FedProx is marginally slower here because the data skew is mild and its proximal leash is not yet needed; under the stronger heterogeneity of Section 37.5, that ordering reverses.

Practical Example: The Sepsis Model Three Hospitals Could Not Pool

Who: A data-science lead coordinating an early-warning sepsis model across three regional hospitals.

Situation: Each hospital alone had too few positive cases to train a robust model, and a pooled dataset would have been ideal.

Problem: The hospitals' legal teams forbade transferring identifiable records across institutional boundaries, so the pooled dataset could never be built.

Dilemma: Ship three weaker single-site models that each underperform, or stand up a federated consortium that needs a coordinator, on-prem clients, and a governance agreement before a single round can run.

Decision: They built the federated system, because the gap between a single-site model and the pooled-quality model was the difference between a tool clinicians trusted and one they ignored.

How: A coordinator in a neutral cloud account broadcast a shared logistic model; each hospital ran five local epochs behind its firewall and uploaded only secure-aggregated parameters, exactly the loop in Code 37.4.1.

Result: The federated model matched the would-be pooled model to within a fraction of a point of AUROC, just as Output 37.4.1's tiny pooled gap predicts, and no record ever crossed a hospital boundary.

Lesson: When governance forbids pooling, federated training recovers nearly all of the pooled-data accuracy at the price of coordination, not privacy.

Library Shortcut: Flower Runs the Coordinator and Clients for You

The hand-written round in Code 37.4.1 simulates broadcast, local training, and the reduce in one process. In a real deployment those steps run on separate machines across institutional networks, with serialization, transport, retries, and secure aggregation to handle. Frameworks such as Flower and NVIDIA FLARE provide exactly this; in Flower you implement a client class with fit and let the server drive FedAvg over the wire:

# pip install flwr ; one coordinator process + one client process per hospital
import flwr as fl

class HospitalClient(fl.client.NumPyClient):
    def get_parameters(self, config):           # send local weights up
        return get_model_weights()
    def fit(self, parameters, config):          # E local epochs behind the firewall
        set_model_weights(parameters)
        train_locally(epochs=config["local_epochs"])
        return get_model_weights(), num_local_records(), {}

# On each hospital node, behind its own firewall:
fl.client.start_client(server_address="coordinator:8080",
                       client=HospitalClient().to_client())

# On the coordinator (FedAvg is the default strategy; FedAdam is one line away):
fl.server.start_server(
    config=fl.server.ServerConfig(num_rounds=15),
    strategy=fl.server.strategy.FedAvg(min_fit_clients=5, min_available_clients=5),
)

Code 37.4.2: The same federated loop as Code 37.4.1, now distributed across real hospital nodes with Flower. Roughly forty lines of by-hand orchestration collapse to a client class and two start_* calls; the framework owns serialization, the broadcast / reduce transport, client coordination, and the secure-aggregation hooks. Swapping FedAvg for FedProx or FedAdam is a one-word change to the strategy.

7. Coordinator and On-Premise Clients Advanced

The deployment topology follows the algorithm directly. A single coordinator process, hosted in a neutral environment that no single hospital controls (a shared cloud account or a consortium-operated server), holds the global model and orchestrates rounds; it is the parameter-server pattern of Chapter 11, specialized to the case where its clients are whole institutions. Inside each hospital, a client process runs on-premise, with read access to that hospital's training data and outbound connectivity to the coordinator but no inbound exposure: the hospital firewall stays closed, the client dials out, and only model parameters cross the boundary. This dial-out design is what makes the system deployable in real clinical IT, where opening an inbound port into a hospital network is effectively impossible.

Because the parties are few, known, and trusted, the coordinator's role is orchestration rather than defense against adversarial clients; the Byzantine-robust aggregation of Chapter 35 is available if the threat model later demands it, but a trusted consortium does not start there. NVIDIA FLARE targets exactly this cross-silo, on-premise, institution-as-client deployment and adds production concerns Flower leaves to the operator: client authentication, audit logging of every round for regulatory review, and admin controls for starting and stopping jobs across sites. The training algorithm is identical to Code 37.4.1; what these frameworks contribute is the operational shell that lets a real hospital run a client behind its real firewall.

Research Frontier: Federated Foundation Models in Healthcare (2024 to 2026)

The frontier is moving from small federated classifiers toward federated training and tuning of foundation models on multi-institutional clinical data. Parameter-efficient federated tuning, where sites exchange only low-rank LoRA adapters rather than full backbones, slashes the per-round upload that Section 4 identified as the binding cost and makes federating large models practical; recent work adapts FedAvg-style aggregation to these adapters and to heterogeneous client compute. In parallel, the medical-imaging community has demonstrated genuinely multi-hospital and multi-continent federated studies (in the lineage of the NVIDIA FLARE breast-density and EXAM efforts), and current research pushes toward personalized federated models that keep a shared global backbone while letting each hospital retain a site-specific head, directly attacking the heterogeneity that Section 37.5 takes up. The open questions are aggregating across sites with different model capacities and certifying privacy end to end, both active as of 2026.

We now have the training system fully specified: synchronized rounds of broadcast, $E$ local epochs, and a size-weighted reduce; FedAvg as the baseline with FedProx and FedAdam held ready; a fixed shared architecture whose uploads are privacy-wrapped per Section 37.3; and a coordinator-with-dial-out-clients topology that respects every hospital firewall. The one risk this configuration has only deferred, not solved, is the disagreement between site data distributions, which the demo's mild skew let us ignore. Section 37.5 confronts that heterogeneity head on and shows why it is the force that turns FedProx from an option into a necessity.

Exercise 37.4.1: Cross-Silo Versus Cross-Device Conceptual

For each cross-device assumption, state the cross-silo counterpart for a hospital consortium and name one design simplification it permits: (a) only a small random subset of clients participates per round; (b) clients have tiny local datasets; (c) clients may disconnect mid-round; (d) thousands of rounds are needed. Then identify the one cross-device problem that gets worse, not better, in the cross-silo setting, and explain why full participation does not fix it.

Exercise 37.4.2: Sweep the Local-Epoch Knob Coding

Modify Code 37.4.1 to sweep $E \in \{1, 2, 5, 10, 20, 50\}$ at a fixed total compute budget (set ROUNDS so that $\texttt{ROUNDS} \times E$ stays constant). Plot or print the final FedAvg loss against $E$. Then increase the site shifts to make the cohorts far more heterogeneous and repeat. Show that the loss-minimizing $E$ shrinks as heterogeneity grows, and connect your curves to the two competing terms in the convergence bound of Section 4.

Exercise 37.4.3: When Does FedProx Earn Its Coefficient? Analysis

Using the modified heterogeneous setup from Exercise 37.4.2, run FedAvg and FedProx across $\mu \in \{0, 0.01, 0.1, 1.0\}$ at a large $E$ (say $E = 20$). Report the final global loss for each. Identify the heterogeneity level at which FedProx first beats FedAvg, and the $\mu$ beyond which the proximal leash is so tight that progress stalls. Argue, from the proximal term's effect on each local update, why an over-large $\mu$ makes the federated model approach a do-nothing baseline.