Section 14.1: Motivation for Federated Learning

"For thirteen chapters they shuffled me wherever the compute was. Then a hospital said the data stays put, and for once the model had to come visit."
A Shard That Was Never Allowed to Leave Home

Big Picture

Every distributed method so far began with one central dataset that we were free to shard across workers; federated learning removes that freedom and inverts the whole arrangement. The data is born distributed across millions of phones, dozens of hospitals, or a handful of banks, and it cannot be collected into one place for legal, privacy, or bandwidth reasons. So we stop moving the data to the training and start moving the training to the data: a coordinator sends the current model to each participant, each participant trains locally on data that never leaves its premises, and only model updates travel back to be combined. This section establishes why this constraint is real and increasingly common, shows the basic protocol, and runs a small demonstration that local training plus periodic averaging can recover a centrally trained model without any record ever crossing a boundary.

Through all of Part II and the optimization chapters of Part III, one assumption was so natural that we rarely stated it: the training data sits in a store we control, and we may copy it, partition it, and ship the shards to whichever workers have spare compute. Data parallelism, the parameter server, the distributed solvers of Chapter 10, every one of them takes a central corpus and spreads it out for speed. The deciding factor was always where the compute was; the data went to meet it. Federated learning begins by denying that this is allowed. There are settings where the data physically cannot be centralized, and in those settings the entire flow of the system reverses.

Consider three of them. A keyboard app would like to improve next-word prediction from what people actually type, but the text people type is among the most private data they own, and uploading it would be both a privacy disaster and a bandwidth burden across billions of devices. A consortium of hospitals would like to train a diagnostic model on far more patient scans than any one of them holds, but patient records are bound by regulation (HIPAA in the United States, GDPR in Europe) that forbids pooling them into a shared bucket. Several banks would like a stronger fraud model trained on the union of their transaction histories, but that union is precisely the competitive and regulatory asset none of them may export. In each case the data is valuable, it is distributed by birth, and it is immovable.

Key Insight: The Data Cannot Move, So the Computation Must

Earlier chapters distributed work because one machine ran out of memory or throughput; the data was always ours to relocate. Federated learning distributes work because of a constraint on the data itself: privacy, regulation, or bandwidth makes centralizing it impossible. The remedy is a clean inversion. Instead of bringing data to a central trainer, you bring the model to wherever the data already lives, train locally, and return only what was learned. Communication carries model updates, never raw records, and that single rule is what makes the whole approach acceptable where pooling would not be.

1. From Sharding a Central Corpus to Training In Place Beginner

The federated setup has three moving parts and one strict rule. The parts are a coordinator (often called the server), a population of clients (phones, hospital servers, bank data centers), and a shared model. The rule is that raw data never leaves a client. A round of training proceeds as a loop. The coordinator broadcasts the current global model to a set of clients. Each selected client trains that model on its own local data for several steps of ordinary stochastic gradient descent, producing a locally improved model. Each client sends back only its update (the new weights, or equivalently the change in weights), and the coordinator combines these updates, typically by averaging them, into a new global model. The loop repeats until the global model converges.

This is local-update SGD wearing a privacy constraint. We already met the idea that workers can take several local steps between communication rounds to save network traffic; in the comm-efficient and local-SGD material of Section 10.7 it was an optimization to reduce the cost of the combine step. Federated learning reuses exactly that machinery but for a different reason: here the local steps are not merely a bandwidth saving, they are mandatory, because the only way a client can contribute is to compute on data the coordinator is never permitted to see. The averaging step that fuses the client models is the federated descendant of the all-reduce that opened this book, except that what is averaged is whole models trained in isolation rather than single-step gradients. Figure 14.1.1 shows the round in full.

Figure 14.1.1: One round of federated learning. The coordinator broadcasts the global model down to the selected clients (solid arrows). Each client trains it for several steps on private data that stays inside the green box, then returns only the resulting model update (dashed arrows). The coordinator averages the returned updates into a new global model and the round repeats. Compare this to Figure 1.1.1, where the data was ours to shard and ship; here the data is pinned in place and the model does the traveling.

2. Why This Matters Now Beginner

Federated learning is an old idea made urgent by three converging pressures. The first is law. Privacy regulation has tightened worldwide over the past decade, and the cost of mishandling personal data has risen from an engineering inconvenience to a board-level liability; "do not centralize the data" is increasingly a hard legal boundary rather than a courtesy. The second is where the data now lives. Smartphones, wearables, cars, and clinical instruments generate enormous volumes of data at the edge, far from any datacenter, and shipping all of it upstream is often impossible on the available uplink and undesirable for the people it describes. The third is the appetite for cross-institution collaboration: hospitals, banks, and manufacturers each hold a slice of a problem and would all benefit from a model trained on the union, if only there were a way to train on that union without forming it.

Federated learning is the way. It lets a model learn from data it is never allowed to see, which is exactly the situation that privacy law, edge data gravity, and competitive boundaries keep producing. This is why a technique first deployed at scale for phone keyboards around 2017 has since spread to medical imaging, finance, and industrial sensing, and why it now anchors a whole chapter of this book. The on-device and edge side of the story continues in Chapter 34, and a full clinical deployment, with all of its regulatory teeth, is the subject of the federated medical AI case study in Chapter 37.

Fun Note: The Model Goes on Tour

For thirteen chapters the data was the celebrity and the model was furniture: we shipped terabytes around the cluster and the model sat wherever it was convenient. Federated learning flips the billing. Now the model is the touring act that visits each venue, plays a short set on the local crowd, and moves on, while the data never leaves its hometown. The coordinator is less a warehouse and more a booking agent that collects the night's takings (the updates) and plans the next leg of the tour.

3. A Model Trained Without Moving the Data Intermediate

It is fair to suspect that training in isolated pieces and averaging the results must give up something against training on the pooled data. For data that is identically distributed across clients, that suspicion is largely unfounded, and a small experiment shows why the basic recipe works before we spend later sections on the cases where it strains. We set up a linear-regression problem, split the examples across ten clients, and compare two models. The first is a central baseline trained on all the data at once, the model we would build if pooling were allowed. The second is a federated model produced by the protocol of Figure 14.1.1: each round, every client copies the global weights, runs several local gradient steps on its own shard, and returns only the updated weights, which the coordinator averages. This averaging of locally trained models is the algorithm known as federated averaging, or FedAvg, which Section 14.3 develops in full. No client's data is ever read by the coordinator or by any other client.

import numpy as np

rng = np.random.default_rng(0)
N, d, C = 60_000, 20, 10          # total examples, features, clients
X = rng.standard_normal((N, d))
w_true = rng.standard_normal(d)
y = X @ w_true + 0.1 * rng.standard_normal(N)

# Split the data across C clients. It NEVER leaves the client.
client_idx = np.array_split(rng.permutation(N), C)

def grad(w, Xs, ys):
    return (2.0 / len(ys)) * (Xs.T @ (Xs @ w - ys))

# --- Baseline: one machine sees ALL the data (the answer we want to match). ---
w_central = np.zeros(d)
lr, steps = 0.05, 300
for _ in range(steps):
    w_central -= lr * grad(w_central, X, y)

# --- Federated: coordinator broadcasts w, each client runs LOCAL updates on
#     its own private data, only the updated weight vector returns, the
#     coordinator AVERAGES them (FedAvg). No raw data is ever shared. ---
w_fed = np.zeros(d)
rounds, local_steps = 30, 10
for r in range(rounds):
    client_models = []
    for idx in client_idx:                       # each client, in isolation
        wk = w_fed.copy()                        # receive the global model
        Xs, ys = X[idx], y[idx]                  # private local shard
        for _ in range(local_steps):             # several LOCAL update steps
            wk -= lr * grad(wk, Xs, ys)
        client_models.append(wk)                 # send back ONLY the model
    w_fed = np.mean(client_models, axis=0)       # coordinator averages updates

def mse(w):
    return float(np.mean((X @ w - y) ** 2))

print("clients C                    :", C)
print("central MSE (all data)       :", f"{mse(w_central):.6f}")
print("federated MSE (no data move) :", f"{mse(w_fed):.6f}")
print("central vs federated  ||dw|| :", f"{np.linalg.norm(w_central - w_fed):.2e}")
print("federated vs true     ||dw|| :", f"{np.linalg.norm(w_fed - w_true):.2e}")

Code 14.1.1: Federated averaging from first principles. The baseline w_central trains on the pooled data; w_fed is built entirely from per-client local training and a coordinator-side average of whole models, with the loop structured so that each client only ever touches its own Xs, ys.

clients C                    : 10
central MSE (all data)       : 0.009953
federated MSE (no data move) : 0.009953
central vs federated  ||dw|| : 3.10e-05
federated vs true     ||dw|| : 1.47e-03

Output 14.1.1: The federated model matches the central model to the sixth decimal of mean-squared error and sits within $3.1 \times 10^{-5}$ of it in weight space, even though ten clients trained in isolation and the coordinator never saw a single record. Both land within $1.5 \times 10^{-3}$ of the true generating weights.

The two models are, for practical purposes, the same. The federated procedure reached a mean-squared error indistinguishable from the central baseline while every client trained alone on data the coordinator never read. Writing the global objective makes the target explicit. With $C$ clients, where client $k$ holds $n_k$ examples out of $N = \sum_k n_k$ and has local loss $F_k(w)$, federated training minimizes the size-weighted average

$$F(w) = \sum_{k=1}^{C} \frac{n_k}{N}\, F_k(w),$$

which is exactly the loss a central trainer would minimize on the pooled data, since pooling and then averaging the per-example losses gives the same sum. The clients optimize this shared objective collaboratively, each contributing the gradient signal from its own slice, and the coordinator's average is what stitches those contributions into progress on $F$. The clean agreement in Output 14.1.1 rests on a quiet assumption that the next sections dismantle: here the data was shuffled uniformly across clients, so every client saw the same distribution. When clients hold systematically different data, local models pull in different directions and the simple average drifts from the central answer, the central difficulty we take up in Section 4.

Library Shortcut: Flower and TensorFlow Federated Run the Round for You

Code 14.1.1 hand-rolled the broadcast, the local loop, and the average in one process to expose the mechanism. Real federated systems run those clients on separate machines, handle dropouts and secure transport, and orchestrate the rounds for you. Flower (the flwr package) lets you keep your existing training code and supply two methods, one to train locally and one to report parameters, while it manages selection, communication, and aggregation across thousands of real clients:

# pip install flwr ; the client wraps YOUR local training, data stays put
import flwr as fl

class Client(fl.client.NumPyClient):
    def fit(self, parameters, config):
        set_weights(model, parameters)           # receive global model
        train_locally(model, local_data)         # YOUR code, private data
        return get_weights(model), n_local, {}   # return ONLY weights

# coordinator side: FedAvg aggregation, no client data ever reaches here
fl.server.start_server(
    strategy=fl.server.strategy.FedAvg(),
    config=fl.server.ServerConfig(num_rounds=30),
)

Code 14.1.2: The same round as Code 14.1.1, now framework-driven. The roughly two dozen lines of manual broadcast-train-average collapse to a client class plus one start_server call; Flower (and TensorFlow Federated, for the same pattern) handles client selection, network transport, dropout, and the FedAvg aggregation. The training code and the data never leave the client process.

4. What Federated Learning Makes Harder Intermediate

The clean result in Output 14.1.1 is the best case, and it is best because we engineered the data to be friendly. Bringing training to the data instead of the data to the training introduces three difficulties that datacenter training never had to face, and they are the reason the rest of this chapter exists. The first is non-IID data. In a datacenter we shuffle a central corpus so that every worker sees a representative sample; on real devices that is impossible. One phone's typing, one hospital's patient mix, and one bank's customer base are each systematically different from the others, so client objectives genuinely disagree and the average of locally trained models no longer lands where the central optimum sits. This is the dominant challenge of the field and gets its own treatment in Section 14.4.

The second difficulty is the clients themselves. Datacenter workers are uniform, always available, and connected by a fast interconnect. A federated population is the opposite: clients are heterogeneous and flaky, a phone participates only when it is idle, charging, and on unmetered Wi-Fi, and it may vanish mid-round. Uplink bandwidth is scarce and often metered, so the number of communication rounds and the size of each update become first-class constraints, which is why Section 14.5 is devoted to communication efficiency. The third difficulty is that sending model updates instead of raw data is necessary for privacy but not by itself sufficient, because updates can leak information about the data that produced them; closing that gap with secure aggregation and differential privacy is the subject of Section 14.6. That privacy thread runs straight through to the differentially private distributed training of Chapter 35.

Practical Example: The Keyboard That Learned Without Reading What You Typed

Who: A mobile platform team responsible for the next-word suggestion model on a smartphone keyboard used by hundreds of millions of people.

Situation: Suggestion quality was mediocre because the public text the model was trained on did not match how real people type, with their slang, names, and abbreviations.

Problem: The data that would fix the model, what users actually type, is the single most sensitive stream on the device, and both policy and law forbade uploading it to a central trainer.

Dilemma: Ship a weak model trained on safe but unrepresentative public text, or find a way to learn from real typing without ever collecting it, accepting a far more complex training pipeline in exchange.

Decision: They trained federated. The model was sent to phones, each phone improved it overnight on that user's own recent typing while idle and charging, and only the model updates were returned and averaged.

How: A coordinator selected eligible devices per round, devices ran a few local SGD epochs exactly as in Code 14.1.1, and updates were combined with secure aggregation so the server saw only the sum, never any individual contribution.

Result: Suggestion accuracy improved measurably from real usage while no keystroke ever left a device, and the system became a reference deployment for federated learning at scale.

Lesson: When the data cannot move, moving the model is not a compromise but the whole strategy; the engineering cost of rounds, selection, and secure aggregation buys access to data that would otherwise be entirely off limits.

Thesis Thread: The Same Averaging Primitive, A New Reason to Distribute

This book's spine is that AI at scale is the engineering of systems distributed across many machines, and every chapter so far distributed work to defeat a resource ceiling. Federated learning advances the thesis by adding a second, independent reason to distribute: the data is born apart and must stay apart. The mechanism is familiar, local computation fused by averaging, the descendant of the all-reduce of Chapter 1 and the local-SGD of Section 10.7; the motivation is new. Whenever you meet a constraint that pins data in place, remember that the distribution toolkit still applies, you simply send the model toward the data rather than the reverse.

Research Frontier: Federated Foundation Models and Internet-Scale Collaboration (2024 to 2026)

Two fronts are active right now. The first carries federated learning up to foundation-model scale, where broadcasting and returning full models is far too expensive, so the field federates only small adapters: parameter-efficient schemes such as federated LoRA tuning of large language models (work in the FedIT and related lineage, 2023 to 2025) keep the frozen base model on each client and exchange only low-rank updates, slashing per-round traffic by orders of magnitude. The second pushes toward genuinely cross-organization and over-the-internet collaboration, building on the local-update, communication-sparse training of DiLoCo (Douillard et al., 2024) and on production cross-device systems that now coordinate millions of phones per task. A persistent open problem across both is robustness, since clients you do not control can send poisoned updates, which motivates the Byzantine-robust aggregation we reach in Chapter 35. The takeaway for this section is that the basic loop of Figure 14.1.1 is being stretched, not replaced: send less, train more locally, and trust the participants less.

We now have the inversion that defines this chapter (the data stays, the model travels), the basic protocol of broadcast, local training, and average, and a demonstration that it recovers the central answer when the data is friendly. The next section makes the first sharp distinction that organizes everything after it, between the cross-device regime of countless unreliable phones and the cross-silo regime of a few stable institutions, because almost every design choice in federated learning depends on which of those two worlds you are in. That split begins in Section 14.2.

Exercise 14.1.1: When Is Federated the Right Call? Conceptual

For each scenario, decide whether ordinary centralized distributed training (shard a central dataset across workers) or federated training (bring the model to immovable data) is appropriate, and name the specific pressure (privacy, regulation, bandwidth, or none) that decides it: (a) a streaming service training a recommendation model on its own server-side click logs; (b) a network of three cancer-research hospitals pooling rare-tumor scans they may not legally share; (c) a fitness-band maker wanting to improve heart-rate anomaly detection from data on tens of millions of wrists with limited uplink; (d) a research team training on a 5-terabyte public image corpus they downloaded once. Explain why choosing federated training where it is not needed would only add cost.

Exercise 14.1.2: Break It With Non-IID Data Coding

Modify Code 14.1.1 so the split is no longer uniform. Sort the examples by their target value $y$ before calling np.array_split, so each client holds a contiguous, systematically different slice of the label range (a crude non-IID partition). Rerun and report how federated MSE and the central-versus-federated weight distance change relative to the IID run. Then increase local_steps from 10 to 50 and observe whether more local work helps or hurts under this skew. Explain in two sentences why heavy local training can pull client models apart when their data distributions disagree, previewing the difficulty of Section 14.4.

Exercise 14.1.3: The Cost of a Round Analysis

Suppose the shared model has $P = 5 \times 10^{6}$ parameters at 4 bytes each, the coordinator runs one round per hour with 200 participating clients, and each client's uplink is a metered cellular link at 1 megabyte per second. Estimate the bytes a single client must upload per round and the seconds that upload takes, then the total bytes the coordinator ingests per round across all clients. Argue from these numbers alone why reducing the number of rounds and compressing each update (the agenda of Section 14.5) matters far more in the federated setting than in a datacenter with a fast interconnect, and contrast this with the bandwidth assumptions of the data-parallel training in Chapter 1.