"They all sent me a glucose value. One hospital meant milligrams per deciliter, one meant millimoles per liter, and one had stored the units in a free-text note that nobody parsed. I averaged them anyway, and the model learned that diabetes is a rounding error."
A Lab Value, Recorded in Units No Other Hospital Uses
Before a federated model can learn from many hospitals, each hospital must reshape its own data into a shape the others can recognize, because the model is allowed to touch that data only indirectly, through gradients and summaries that never leave the building. A patient record is not a clean tensor waiting to be batched. It is tabular vitals from one electronic health record vendor, imaging in a scanner-specific DICOM dialect, ICU waveforms at a sampling rate that varies by ward, and clinical notes in a dozen abbreviation styles, all coded against terminologies that disagree across the silo. Federation does not remove this mess; it forbids the usual remedy of pooling the raw data centrally to sort it out. The precondition for cross-silo federated learning is therefore local harmonization: every site independently maps its data into a common data model so that a single global model definition makes sense everywhere at once. This section surveys that data reality and shows, in one calculation, why harmonization is the load-bearing step on which the rest of the chapter rests.
Chapter 14 introduced federated learning as an algorithm: keep the data on the client, send the model to the data, and aggregate updates instead of records. Chapter 34 deepened it for the edge, where clients are millions of phones and sensors. This chapter studies a different and, in practice, more common federated regime, the hospital network, and this section establishes the ground truth that every later section in the chapter must respect. The central difficulty is not the optimization, which we develop in Section 37.4; it is that the participating hospitals do not record the same things in the same way, and the federation cannot see the discrepancy directly to correct it. The model learns from data it can only touch through each site's local pipeline, so what that pipeline produces is what the model actually sees.
1. Cross-Silo, Not Cross-Device Beginner
It is worth fixing the regime precisely, because it changes nearly every design decision that follows. The federated edge setting of Chapter 34, and in particular the cross-device analysis of Section 34.6, assumes an enormous population of weak, unreliable, intermittently available clients: a phone joins a round when it is charging on Wi-Fi and may never be seen again. A hospital network is the opposite on every axis. There are few clients, perhaps three to fifty institutions; each is large, holding tens of thousands to millions of patient records; each is reliable, with real servers and staff; each is stateful, participating across many rounds and keeping its identity; and each is, for legal reasons, never anonymous to itself. This is the cross-silo federated setting, the regime Chapter 14 formalized, and Table 37.2.1 contrasts it with the cross-device case so the rest of the chapter can assume it.
| Property | Cross-device (Ch 34.6) | Cross-silo (this chapter) |
|---|---|---|
| Number of clients $K$ | $10^4$ to $10^9$ phones or sensors | $3$ to $\sim 50$ hospitals |
| Data per client | tiny, a few samples | large, $10^4$ to $10^6$ records |
| Availability | intermittent, unpredictable | reliable, always reachable |
| State across rounds | stateless, sampled fresh | stateful, persistent identity |
| Per-client compute | weak, battery-bound | a real ETL cluster per site |
| Dominant heterogeneity | system noise, dropout | schema, coding, and population skew |
The consequence is that the hard problem moves. With few, reliable, stateful clients, the systems-level worry of cross-device federation, sampling a representative cohort from a churning population, largely disappears. What remains, and grows, is statistical and semantic heterogeneity: the fact that each silo is a large, internally consistent, but globally idiosyncratic world. Because each silo owns a real compute cluster, it can afford to run a serious local ETL job, which is precisely what the common data model below requires. The federation's strength, large reliable clients, is exactly what makes per-site harmonization feasible.
2. Four Modalities, Four Kinds of Mess Beginner
A clinical record spans modalities that no single tensor captures, and each arrives with its own form of disagreement across sites. The structured electronic health record (EHR) holds tabular data: demographics, diagnoses, procedures, medications, and laboratory results, stored in vendor-specific relational schemas. Imaging arrives as DICOM studies, radiographs, computed tomography, magnetic resonance, where pixel spacing, slice thickness, contrast protocol, and scanner manufacturer vary by site and even by machine within a site. Physiological waveforms, the electrocardiogram and the ICU monitor streams of Chapter 9, differ in sampling rate, lead placement, and filtering. Clinical notes are free text in the local dialect of abbreviations, templates, and languages. Figure 37.2.1 traces how these modalities, recorded incompatibly at $K$ hospitals, are each mapped by a local pipeline into a common model before any federated round begins.
The four kinds of mess do not reduce to one. Tabular heterogeneity is a problem of terminology and schema; imaging is a problem of acquisition physics and normalization; waveforms are a problem of resampling and alignment; notes are a problem of language. This section concentrates on the tabular case, because it is where the common data model is most mature and where the harmonization arithmetic is cleanest, and because the imaging-and-domain-shift story gets its own deep treatment in Section 37.5. The unifying point is that all four are resolved the same way: not by a central cleaner, which federation forbids, but by a local transform at each site whose output conforms to a shared contract.
3. Schema and Coding Heterogeneity Intermediate
Consider a single clinical concept, the serum potassium result. Hospital A stores it in an Epic schema, keyed by an internal lab code, with the value in milliequivalents per liter. Hospital B stores it in a Cerner schema under a different internal code, possibly with a different reference range attached. The two are the same physical measurement, but no join key connects them and no shared column name exists. Worse, diagnoses are coded against incompatible terminologies: ICD-9 or ICD-10 for billing, SNOMED CT for clinical concepts, and laboratory observations against LOINC, with sites disagreeing on which they populate and how completely. A feature that the model treats as one dimension of $x$ is, at the raw level, $K$ different columns in $K$ different schemas under $K$ coding systems.
Harmonization is the act of making these agree. Formally, let site $k$ hold raw records in a local space $\mathcal{R}_k$. Each site applies a local map
$$\phi_k : \mathcal{R}_k \longrightarrow \mathcal{X}, \qquad x = \phi_k(r), \quad x \in \mathbb{R}^d,$$that targets one shared feature space $\mathcal{X}$ common to every site. The map $\phi_k$ resolves the local code to a shared concept identifier (a LOINC code for a lab, a SNOMED concept for a diagnosis), converts the value to a canonical unit, and places it in the agreed column. The federation never sees $\phi_k$ or $\mathcal{R}_k$; it only ever interacts with the harmonized $x \in \mathcal{X}$. Correctness of the global model rests entirely on the maps $\phi_k$ producing semantically identical features. If site B's $\phi_B$ forgets a unit conversion, the federation has no way to notice, because it cannot inspect B's raw rows; it will simply train on a feature whose distribution is silently wrong at one silo.
In a centralized project you would pool all hospitals' data into one warehouse and reconcile the schemas once, in the open, where you can see every discrepancy. Federation forbids the pooling, so it forbids that central reconciliation. The work does not vanish; it is pushed to the edge, where each site must independently map its raw records into a common data model that it agrees to in advance but executes alone. The shared contract $\mathcal{X}$ replaces the shared warehouse. Every guarantee the global model relies on, that a feature means the same thing everywhere, is now a guarantee about $K$ local pipelines you cannot audit from the center, which is why the common data model and its validation are the real engineering of cross-silo medical AI.
4. A Common Data Model: OMOP and FHIR Intermediate
The shared contract $\mathcal{X}$ is not invented per project; the field has converged on standard common data models. The OMOP Common Data Model (Observational Medical Outcomes Partnership, maintained by the OHDSI community) defines a fixed relational schema and a standardized vocabulary so that a query written once runs unchanged against any conforming site. HL7 FHIR defines a resource-and-API standard for exchanging clinical data with shared coding systems (LOINC for labs, SNOMED CT for conditions, RxNorm for drugs, UCUM for units). A hospital's harmonization pipeline, the local $\phi_k$, is in practice an extract-transform-load job that reads the vendor schema and writes OMOP tables or FHIR resources, mapping each local code to the standard vocabulary and each value to the canonical unit. This is per-site Spark ETL of exactly the kind built in Chapter 7, reading from and writing to the distributed storage and data-loading layer of Chapter 8, run independently inside each silo.
The conceptual core of $\phi_k$, mapping a local lab code to a shared LOINC concept and converting the value to a canonical unit, is a join and a scaled column. With a per-site vocabulary table (the OMOP concept and concept_relationship tables provide exactly this mapping for real deployments), the harmonization of one lab feature reduces to:
import pandas as pd
# local_labs: raw rows from this site's EHR; vocab: site-specific code -> (LOINC, unit, factor)
harmonized = (
local_labs
.merge(vocab, on="local_code", how="left") # resolve local code to a shared concept
.assign(value_canonical=lambda df: df.value * df.to_mgdl_factor) # UCUM unit conversion
.loc[:, ["person_id", "loinc_code", "value_canonical"]] # OMOP-shaped output
)
# every site runs this with ITS OWN vocab; the output schema is identical across sites
The payoff of the common data model is the federation contract: because every site emits the same schema with the same codes and units, a single global model definition, the same feature vector $x \in \mathbb{R}^d$ and the same label $y$, is meaningful at every site at once. That is the precondition the federated training setup of Section 37.4 assumes when it sends one model out to all $K$ silos. Without it, the model that fits site A's columns is undefined at site B.
5. Why Units Alone Can Sink the Model Intermediate
The starkest harmonization failure is the unit mismatch, and it is worth seeing numerically because it shows both the danger and the fix. Fasting glucose is reported in milligrams per deciliter (mg/dL) in much of the world and in millimoles per liter (mmol/L) elsewhere, related by $\text{mg/dL} = 18.0182 \times \text{mmol/L}$. A value of $5.8$ mmol/L and $104$ mg/dL describe the same blood; pooled naively, the model sees a feature with a bimodal, meaningless distribution. The demonstration below simulates five hospitals reporting glucose, two in mmol/L and three in mg/dL, then harmonizes each to the canonical unit and forms the federated, size-weighted summary the server could compute from per-site means alone. Let site $k$ have $n_k$ harmonized records with local mean $\bar{x}_k$; the population-weighted global mean
$$\bar{x} = \frac{\sum_{k=1}^{K} n_k\, \bar{x}_k}{\sum_{k=1}^{K} n_k}, \qquad w_k = \frac{n_k}{\sum_j n_j},$$uses exactly the FedAvg sample-count weights $w_k$ of Chapter 14, and it recovers the true pooled mean despite the federation never touching a raw row. The same weights $w_k$ reappear when the server averages model updates in Section 37.4.
import numpy as np
rng = np.random.default_rng(7)
# Five hospitals. Each records fasting glucose, but two report mmol/L and three mg/dL,
# and the underlying populations differ (a diabetes clinic vs general wards).
sites = {
"Mercy-East (Epic)": dict(unit="mg/dL", n=4200, mu=104, sd=22, pos=0.11),
"Lakeside (Cerner)": dict(unit="mmol/L", n=3100, mu=5.8, sd=1.3, pos=0.09),
"Univ-Hosp (Epic)": dict(unit="mg/dL", n=5600, mu=118, sd=30, pos=0.18),
"Coastal Diabetes Ctr": dict(unit="mmol/L", n=1800, mu=8.9, sd=2.4, pos=0.41),
"Rural-Net (Meditech)": dict(unit="mg/dL", n=2300, mu=99, sd=19, pos=0.07),
}
MMOL_TO_MGDL = 18.0182
local_means_mgdl, weights, all_harmonized, labels_pos = [], [], [], []
for name, c in sites.items():
raw = rng.normal(c["mu"], c["sd"], c["n"])
h = raw * MMOL_TO_MGDL if c["unit"] == "mmol/L" else raw # local phi_k: unit convert
all_harmonized.append(h)
local_means_mgdl.append(h.mean()); weights.append(c["n"]); labels_pos.append(c["pos"])
weights = np.array(weights, float)
local_means_mgdl = np.array(local_means_mgdl)
labels_pos = np.array(labels_pos)
fed_mean = np.sum(weights * local_means_mgdl) / weights.sum() # FedAvg-weighted summary
true_mean = np.concatenate(all_harmonized).mean() # oracle pooled (for check)
global_prior = np.sum(weights * labels_pos) / weights.sum() # population label prior
print(f"federated size-weighted mean : {fed_mean:8.2f} mg/dL")
print(f"oracle pooled mean (all rows) : {true_mean:8.2f} mg/dL")
print(f"weighted vs oracle abs diff : {abs(fed_mean - true_mean):8.4f} mg/dL")
print(f"global label prior pi : {global_prior:8.3f}")
print(f"label-prior skew (max/min) : {labels_pos.max() / labels_pos.min():8.2f}x")
python on a standard NumPy install.federated size-weighted mean : 113.87 mg/dL
oracle pooled mean (all rows) : 113.87 mg/dL
weighted vs oracle abs diff : 0.0000 mg/dL
global label prior pi : 0.156
label-prior skew (max/min) : 5.86x
The lesson of Output 37.2.2 is double. First, once $\phi_k$ converts units correctly, the federated weighted summary is exact, the same all-reduce-then-divide exactness proved for gradients in Section 1.1, now applied to a clinical feature. Second, that exactness is conditional on the harmonization being right: had even one site skipped the conversion, the weighted mean would be a confident, wrong number, and the federation could not detect it from the center. Harmonization is what makes the federated arithmetic both exact and trustworthy.
6. Label Noise, Missingness, and Population Skew Advanced
Even with schemas and units reconciled, three statistical realities remain, and they are not bugs to be cleaned away but properties of the data the model must be designed around. The first is label noise. A diagnosis label derived from billing codes is a proxy, not ground truth; coding practices differ by site, so the same patient might be labeled positive at one hospital and negative at another for reasons of reimbursement rather than physiology. The second is missingness, and crucially it is not missing-at-random: a lab is ordered because a clinician suspected something, so the very presence of a value is informative, and the pattern of what is missing differs across sites with different ordering cultures. Imputing such values with a global mean, blind to site, injects exactly the cross-site bias federation is supposed to respect.
The third, and the one most specific to the cross-silo setting, is population skew. Each hospital draws from a different catchment: a tertiary referral center sees sicker patients than a rural clinic, a specialty diabetes center sees far higher glucose-related prevalence, and demographics, age, and equipment differ throughout. Statistically, the silos are not independent and identically distributed samples of one population; they are samples of $K$ different population priors. Writing $\pi_k = P_k(y = 1)$ for the positive-class prevalence at site $k$, the global prior is the weighted mixture
$$\pi = \sum_{k=1}^{K} w_k\, \pi_k, \qquad w_k = \frac{n_k}{\sum_j n_j},$$and in Output 37.2.2 the $\pi_k$ ranged from $0.07$ at the rural network to $0.41$ at the diabetes center, a $5.86\times$ spread around a global $\pi = 0.156$. A classifier trained naively to a site's local prior will be miscalibrated everywhere else; the label prior is itself a form of non-IID heterogeneity, distinct from the feature shift of imaging, and it is the precise quantity the heterogeneity methods of Section 37.5 are built to handle. The evaluation problem this creates, a model whose accuracy is excellent on the population it was effectively weighted toward and poor elsewhere, is exactly why distributed evaluation must report per-site metrics, not a single pooled number, the discipline established in Chapter 5.
Who: A clinical machine learning team building an early-warning sepsis model across a four-hospital network.
Situation: The federated training loop converged cleanly and the pooled validation AUC looked strong, so the team prepared to deploy at all four sites.
Problem: One site, a large academic center, contributed the majority of records, and its higher sepsis prevalence and richer lab ordering dominated the population-weighted prior. The model had effectively specialized to that site.
Dilemma: Reweight the federation to equalize site influence, losing statistical efficiency from the large site, or keep the size weights and accept that smaller sites are underserved.
Decision: They first fixed the data, not the weights: a per-site audit revealed that two hospitals had skipped a unit conversion on lactate and that missingness patterns differed sharply, so the apparent population skew was partly a harmonization defect masquerading as biology.
How: They corrected each site's $\phi_k$, re-ran the per-site distribution checks, and only then reported per-site AUC alongside the pooled number, following the Chapter 5 evaluation discipline.
Result: Two of the four per-site AUCs rose once the units were right, the genuine prevalence skew that remained was visible and quantified, and the team chose a prevalence-aware aggregation rather than guessing.
Lesson: Before treating cross-site disagreement as population skew to be modeled, rule out harmonization error, which is cheaper to fix and invisible from the center. A pooled metric can hide both.
The community is increasingly treating data harmonization and federated optimization as a joint problem rather than sequential steps. On the data side, the OHDSI ecosystem and large network studies have pushed OMOP harmonization to hundreds of millions of patient records, and FHIR-native feature extraction is becoming standard for clinical machine learning. On the modeling side, federated foundation models for medical imaging and EHR, and benchmarks such as those in the FLamby suite (du Terrail et al., 2022) for realistic cross-silo healthcare tasks, expose how badly methods that ignore label-prior and feature shift degrade across sites. Recent work on personalization and clustered federation lets each silo keep a locally adapted head while sharing a common body, and prevalence-aware and label-shift-corrected aggregation directly targets the $\pi_k$ skew quantified above. The frontier question is no longer only how to aggregate gradients privately, the privacy machinery of Chapter 35, but how to make heterogeneous, imperfectly harmonized silos train one model that is fair and calibrated at each of them.
A frequent finding in real OMOP conversions is a patient cohort with an implausible number of birthdays recorded on January 1, because a legacy system stored an unknown date of birth as the start of the year. The harmonization map dutifully imports it, and a naive age feature suddenly reports a suspicious New Year's Day cluster. The model does not know it is a placeholder; it learns that being born on January 1 is mildly prognostic. Every common data model accumulates a small museum of such default values, and reading them is half of what site validation is for.
With the data reality fixed, the regime named (cross-silo), the modalities and their heterogeneity surveyed, the common data model in place, and the population skew quantified, the chapter can proceed to the part federation actually adds: training one model across these harmonized silos without moving their data. Section 37.3 first fixes the privacy constraints that govern what may cross each hospital boundary, and then Section 37.4 stands up the training loop itself, sending a single global model definition, valid everywhere precisely because of the harmonization built here, out to all $K$ hospitals and aggregating what comes back.
For each scenario, decide whether it is the cross-silo regime of this chapter or the cross-device regime of Section 34.6, and name the one property from Table 37.2.1 that decides it: (a) a wearable-ECG study spanning two million consumer smartwatches; (b) a consortium of eight cancer centers training a shared tumor-segmentation model; (c) a pharmacy chain learning from point-of-sale terminals in ten thousand stores. For each, state which dominant heterogeneity, system noise and dropout, or schema and population skew, you would budget the most engineering effort to handle, and why.
Modify Code 37.2.2 so that exactly one mmol/L site silently skips its unit conversion (treat its values as if already mg/dL). Recompute the federated size-weighted mean and compare it to the oracle pooled mean from the correctly harmonized rows. Report the absolute error in mg/dL and argue why the federation server, seeing only the five per-site means, cannot detect the defect. Then add a simple per-site sanity check, a plausible physiological range for fasting glucose, that the site itself could run before reporting, and show it flags the offending silo.
Using the per-site priors $\pi_k$ and weights $w_k$ from Output 37.2.2, suppose a classifier is trained to minimize population-weighted error and therefore predicts near the global prior $\pi = 0.156$. Estimate qualitatively how its false-negative rate would behave at the Coastal Diabetes Center ($\pi_k = 0.41$) versus the rural network ($\pi_k = 0.07$). Explain why a single pooled accuracy or AUC can look acceptable while the model is clinically unsafe at the high-prevalence site, and connect your answer to why Chapter 5 insists on per-site reporting for heterogeneous federations.