Part III: Distributed Machine Learning
Chapter 14: Federated and Decentralized Learning

Personalized Federated Learning

"They averaged a thousand of us into one model and called it consensus. It fit nobody, least of all the night-shift phone in Jakarta that only ever sees keyboards in the dark."

A Global Model That Pleased No Client
Big Picture

When clients hold genuinely different data, a single averaged model can be the wrong target altogether: the goal shifts from one model that fits everyone to a model tailored to each client that still borrows strength from all of them. Under the strong non-IID conditions of Section 14.4, the FedAvg solution sits in a compromise no client occupies, accurate on average and poor for each individual. Personalized federated learning keeps the shared training that makes federation worthwhile, then bends the result toward each client's own distribution. This section is the bias-variance story of federation: a purely local model has no bias toward others but ruinous variance on scarce on-device data; a purely global model has low variance but a bias that drags every client toward the crowd. Personalization is the family of methods that live between those two failures, and the same global-versus-local tension reappears when we drop the server entirely in Section 14.8 and push learning to the edge in Section 14.9.

The previous sections built a single global model and worked to defend it: FedAvg averaged client updates (Section 14.3), non-IID skew threatened its convergence (Section 14.4), communication budgets constrained how often it could be refreshed (Section 14.5), and secure aggregation protected the updates on the way in (Section 14.6). All of that assumed one model is the right answer. Now we question the assumption. If your thousand phones split into a group that types in English and a group that types in Hindi, no shared keyboard model serves both well, and forcing one is a modeling error, not a privacy or systems problem. Personalized federated learning accepts that each client may need its own model while insisting that those models be learned together, so a client with little data still benefits from everyone else's.

1. When One Global Model Is the Wrong Goal Beginner

Federation, as built so far, minimizes the average loss across clients with one shared parameter vector $w$. Write client $k$'s local risk as $F_k(w)$ and its share of the data as $p_k$; FedAvg targets

$$w^\star = \arg\min_w \sum_{k=1}^{K} p_k \, F_k(w).$$

When every $F_k$ is minimized at roughly the same place, this single $w^\star$ is excellent for all of them, and there is nothing to personalize. The trouble is the non-IID regime, where the client minimizers $w_k^\star = \arg\min_w F_k(w)$ scatter. A point that minimizes the weighted average can then sit far from every individual optimum, fitting the centroid of the clients rather than any client. The averaged keyboard predicts a blend of English and Hindi that neither typist wants. This is not a bug in FedAvg; it is FedAvg correctly solving a problem whose answer no longer matches what each client needs.

The honest alternative at the other extreme is to give up on sharing: let client $k$ train only on its own data and keep $w_k$ entirely private. That removes all bias toward other clients, but on a phone that has typed a few hundred words, the local minimizer is estimated from so little data that it is dominated by noise. The model overfits its own scarce, noisy sample and generalizes poorly to the same client's future inputs. Personalization is the search for a middle point: a per-client model that inherits the shared structure all clients have in common, then adjusts for what makes the client different.

Key Insight: Personalization Is a Bias-Variance Trade, Not a Privacy Trick

The purely global model is low-variance and high-bias: it is estimated from everyone's data, so it is stable, but it is biased toward a population the individual client is not. The purely local model is high-variance and low-bias: it is unbiased for the client but estimated from a handful of noisy examples. Personalized federated learning borrows strength from other clients (cutting variance) without being dragged all the way to their distribution (controlling bias). Every method in this section is a different knob on exactly this trade, which is why "how non-IID are the clients?" and "how much data does each client have?" are the two questions that decide whether personalization helps.

2. Five Ways to Personalize Intermediate

The literature offers a handful of recurring strategies; they differ in how much of the model is shared and where the personalization enters. The simplest, and a startlingly strong baseline, is to fine-tune the finished global model locally: run a few steps of gradient descent on the client's own data starting from $w^\star$. The global model supplies a good initialization estimated from all clients, and a short local pass bends it toward the individual without enough steps to overfit. This is the method we measure in Section 4, and it often recovers most of the achievable gain for a few lines of code.

Meta-learning makes that initialization explicit. Instead of training $w$ to be good as-is, Per-FedAvg (Fallah et al., 2020) trains $w$ so that one or a few gradient steps from $w$ are good on each client, the federated cousin of MAML. The server optimizes for adaptability rather than immediate accuracy, producing a launch point from which every client reaches its own model quickly. Multi-task and clustered federated learning take a structural view: if clients fall into groups (the English typists, the Hindi typists), discover the groups and train one model per cluster, so each client shares strength only with its similar peers rather than the whole population. Clustered FL (Sattler et al., 2020) recursively splits clients whose update directions disagree.

Parameter decoupling splits the model itself. A shared global backbone (the lower layers, a feature extractor) is federated across all clients, while a small personalized head (the final layer or two) is kept and trained locally on each device and never averaged. FedPer and FedRep build on this idea: the representation is common because good features are universal, the classifier is private because the decision boundary is client-specific. The diagram in Figure 14.7.1 shows this split. Finally, interpolation methods keep both a global and a local model and mix them, predicting with $\alpha \, w_{\text{local}} + (1-\alpha)\, w_{\text{global}}$ or regularizing the local model to stay near the global one. The mixing weight $\alpha$ is the bias-variance knob made literal, tuned per client from how much its data justifies trusting itself over the crowd.

Shared backbone, federated; personalized heads, kept local Head A client 1, local only Head B client 2, local only Head C client 3, local only Shared backbone (feature extractor) identical weights on every client, averaged each round Server: all-reduce backbone only heads never leave the device
Figure 14.7.1: Parameter decoupling, the structural form of personalization. The lower backbone (blue) is shared and synchronized with the server by all-reduce, the same collective built in Chapter 4, while each client's head (orange) is trained only on local data and never transmitted. Good features are assumed universal, so they are pooled; the decision boundary is assumed client-specific, so it stays private. The green path shows that only the backbone reaches the server, which also shrinks the per-round upload compared to sending the whole model.

3. The Cost of Personalizing: Generalizing to New Clients Intermediate

Personalization is not free, and its price is paid by clients who were not in the training population. A single global model has a clear story for a brand-new device: hand it $w^\star$ and it works at average quality on the first request, before the device has produced a single gradient. A clustered or decoupled scheme has no head for a client it has never seen, and a meta-learned initialization is good only after the new client takes its adaptation steps, which requires that the client has some data and the compute to use it. The more sharply a method tailors to seen clients, the less it has to say about unseen ones; this is the generalization-to-new-clients axis along which personalization methods are fairly compared, and it is why meta-learning, which yields a deployable initialization, is often preferred over per-cluster models when the client population churns.

There is a systems cost too. Decoupling and interpolation mean the server now manages per-client state (heads, mixing weights, cluster assignments), turning a stateless aggregation into something closer to the per-client model store of a recommendation system, an idea we meet again with sharded embeddings in Chapter 11. The reward for these costs is concrete, and the next section measures it: on non-IID clients with scarce data, a global-then-fine-tuned model beats both the shared global model and the purely local model at the same time.

4. Personalization Versus Global and Local, Measured Intermediate

The code below builds the bias-variance picture concretely with pure Python and NumPy. Six clients share a large common linear concept and add a small per-client tilt, the mild non-IID setting where a backbone is worth sharing but no single model is exactly right for anyone. Each client holds only fourteen noisy training examples, far too few to trust alone, and four hundred held-out test examples. We train a global model by FedAvg, then produce a personalized model per client by fine-tuning that global model on the client's own fourteen points, and compare both against a purely local model trained from scratch.

import numpy as np

rng = np.random.default_rng(7)
K, d = 6, 12                  # clients, features
n_per, n_test = 14, 400       # FEW noisy train points per client, many test points
lr, noise = 0.03, 1.0

# Clients share a large COMMON backbone and add a small per-client tilt -> mildly
# non-IID. The shared part dominates, so borrowing strength pays off, but the
# tilt means no single global model is exactly right for any one client.
common = 2.0 * rng.standard_normal(d)
client_w = []
for k in range(K):
    offset = np.zeros(d)
    offset[k % d] = 1.5 * (1 if k % 2 == 0 else -1)   # client-specific tilt
    client_w.append(common + offset)

def make(k, n):
    X = rng.standard_normal((n, d))
    return X, X @ client_w[k] + noise * rng.standard_normal(n)

train = [make(k, n_per) for k in range(K)]
test  = [make(k, n_test) for k in range(K)]
grad = lambda w, X, y: (2.0 / len(y)) * (X.T @ (X @ w - y))
mse  = lambda w, X, y: float((X @ w - y) @ (X @ w - y) / len(y))

# FedAvg: each round, every client takes a few local steps; the server averages.
w_global = np.zeros(d)
for _ in range(400):
    updated = []
    for k in range(K):
        wk = w_global.copy()
        for _ in range(5):
            wk -= lr * grad(wk, *train[k])
        updated.append(wk)
    w_global = np.mean(updated, axis=0)          # the single shared model

global_test = [mse(w_global, *test[k]) for k in range(K)]

# Personalized: start from the global model, fine-tune on each client's own data.
pers_test = []
for k in range(K):
    wk = w_global.copy()
    for _ in range(200):
        wk -= lr * grad(wk, *train[k])
    pers_test.append(mse(wk, *test[k]))

# Purely local: train from scratch, no shared learning at all.
local_test = []
for k in range(K):
    wk = np.zeros(d)
    for _ in range(200):
        wk -= lr * grad(wk, *train[k])
    local_test.append(mse(wk, *test[k]))

print("client |  global  |  local   | personalized")
for k in range(K):
    print(f"   {k}   | {global_test[k]:8.3f} | {local_test[k]:8.3f} | {pers_test[k]:11.3f}")
print(f"  mean | {np.mean(global_test):8.3f} | {np.mean(local_test):8.3f} | {np.mean(pers_test):11.3f}")
print(f"  worst| {np.max(global_test):8.3f} | {np.max(local_test):8.3f} | {np.max(pers_test):11.3f}")
Code 14.7.1: The simplest personalization, global-then-fine-tune, against the global and purely-local baselines on six mildly non-IID clients with scarce local data. Test mean-squared error is reported per client and aggregated; lower is better.
client |  global  |  local   | personalized
   0   |    3.763 |    4.314 |       1.913
   1   |    3.154 |    3.699 |       3.505
   2   |    3.014 |    4.031 |       1.622
   3   |    2.463 |    2.026 |       2.447
   4   |    2.546 |    5.203 |       2.402
   5   |    4.186 |    7.982 |       3.112
  mean |    3.188 |    4.543 |       2.500
  worst|    4.186 |    7.982 |       3.505
Output 14.7.1: Personalization wins on both summaries. Its mean error (2.500) beats the global model (3.188), which is biased toward a centroid no client occupies, and crushes the purely local model (4.543), which overfits its fourteen noisy points; the worst-client error drops from 7.982 (local) to 3.505 (personalized). The personalized model inherits the shared backbone, then bends toward each client.

Three numbers tell the whole story. The global model's mean error (3.188) is the bias floor: estimated from everyone, it is stable but tuned to a population centroid, so it never fits a client tightly. The local model's mean error (4.543), and especially its worst-client error (7.982), is the variance ceiling: with fourteen noisy points a client cannot estimate its own model reliably and overfits. Personalization (mean 2.500, worst 3.505) beats both at once, exactly the bias-variance middle the key insight promised. It does so for the price of a short local fine-tune, no new server protocol, and no extra communication beyond the FedAvg rounds that produced the global model. That is why fine-tuning is the baseline every more elaborate method must beat.

Library Shortcut: Flower Personalizes With a Fit-Then-Evaluate Hook

In Code 14.7.1 we hand-rolled the FedAvg loop and the per-client fine-tune. The flwr (Flower) framework runs the federation for you and exposes personalization as a one-method override: a client receives the global parameters in fit, and you simply continue training locally before evaluating, so the global-then-fine-tune baseline is the natural default rather than extra work.

# pip install flwr ; the server runs standard FedAvg, unchanged
import flwr as fl

class PersonalizedClient(fl.client.NumPyClient):
    def fit(self, global_params, config):
        self.model.set_weights(global_params)         # start from the shared model
        self.model.train(self.local_data, epochs=1)   # contribute to the global average
        return self.model.get_weights(), self.n, {}

    def evaluate(self, global_params, config):
        self.model.set_weights(global_params)
        self.model.train(self.local_data, epochs=2)    # FINE-TUNE locally, then score
        loss, acc = self.model.test(self.local_test)   # per-client personalized metric
        return loss, self.n, {"personalized_accuracy": acc}
Code 14.7.2: The same global-then-fine-tune personalization as Output 14.7.1, now as a Flower client. The roughly forty lines of manual FedAvg and fine-tuning collapse to two short methods; Flower handles client sampling, parameter transport, secure aggregation (Section 14.6), and metric collection. Frameworks such as FedML and NVIDIA FLARE expose the same fit-then-personalize pattern.
Practical Example: The Keyboard That Stopped Averaging Two Languages Into One

Who: A mobile machine learning team running federated next-word prediction across forty million phones.

Situation: A single global keyboard model was trained with FedAvg and shipped to every device, the design from Section 14.3.

Problem: Users in bilingual regions typed a code-switched mix the averaged model handled poorly, and monolingual users in either language saw suggestions skewed toward the other; average accuracy looked fine while many individuals were badly served.

Dilemma: Ship per-language models (clustered FL), which needed reliable language detection and left brand-new phones with no model, or fine-tune the one global model on each device, which kept a universal fallback but added per-client training and state.

Decision: They kept one federated backbone and personalized by on-device fine-tuning, because a new phone could still launch on the global model and adapt within a day of typing, matching the new-client generalization argument from Section 3.

How: The lower layers stayed federated and averaged each round; the final projection was fine-tuned locally each night on the device's own keystrokes and never uploaded, the parameter-decoupling split of Figure 14.7.1.

Result: Per-user top-3 suggestion accuracy rose most for the bilingual and minority-language users who the global model had served worst, while average accuracy held and new devices still worked on day one, the same worst-client improvement seen in Output 14.7.1.

Lesson: When the population splits, personalize the head and federate the backbone; the global model stays as the cold-start fallback while each client gets a model that fits it.

5. The Tension That Outlives the Server Advanced

The global-versus-local tension this section navigates is not specific to having a central server. It is the same conflict that organizes the rest of the chapter and the federated thread of the book. In decentralized learning (Section 14.8), there is no server to hold a global model at all; clients average only with gossip neighbors, and "how much should I trust my neighbors versus my own data?" is precisely the interpolation knob $\alpha$ of Section 2, now set by the mixing matrix of the communication graph. At the edge (Section 14.9), on-device compute and intermittent connectivity make local adaptation cheap and global synchronization expensive, tilting the same trade toward more personalization. The federated arc continues into Chapter 34 on federated edge learning and culminates in the federated medical case study of Chapter 37, where every hospital is a client whose patient distribution differs and a personalized model per site is not a luxury but a clinical requirement.

Thesis Thread: Distributing Intelligence Without Forcing It to Agree

The book's spine is that intelligence is distributed across machines that coordinate to act as one. Personalized federation refines that spine: the machines coordinate without being forced into a single shared answer. The shared backbone still travels through the same all-reduce that synchronizes data-parallel gradients in Chapter 15, so the collective primitive is unchanged, but what each node keeps after the collective is now partly its own. This is distribution with heterogeneity as a first-class goal rather than an obstacle, and it is the conceptual bridge to the genuinely peer-to-peer learning of the next section, where no node holds the canonical model.

Research Frontier: Personalization Meets Foundation Models (2024 to 2026)

The sharpest recent shift is that the shared backbone is now often a pretrained foundation model, and personalization is parameter-efficient fine-tuning of it. Federated low-rank adaptation, in the lineage of FedLoRA and the heterogeneous-rank variants HetLoRA and FlexLoRA (2024), federates a frozen backbone while each client trains a tiny LoRA adapter, so the personalized head is a few megabytes rather than a full model and the communication cost collapses. A parallel line revisits whether personalization should touch the representation at all: FedRep-style representation-then-head alternation and the pFedHN hypernetwork approach (which generates each client's weights from a learned embedding) are being scaled to thousands of clients with sharper generalization-to-new-clients guarantees. There is also active work on the bias-variance knob itself, learning the interpolation weight $\alpha$ per client from local data statistics rather than tuning it globally, and on benchmarking personalization fairly under churn, since a method that shines on seen clients can fail the cold-start test of Section 3. The open question for 2025 and beyond is how much of a billion-parameter model truly needs to be client-specific; early evidence says startlingly little, which is good news for both privacy and bandwidth.

Fun Note: The Average Person Has 1.9 Children and Zero Patience for the Average Model

The averaged keyboard is the statistical cousin of the famous average household with 1.9 children: a number that describes the population perfectly and no actual family at all. FedAvg's global model is the 1.9-children of machine learning, correct in aggregate and faintly absurd for any individual. Personalization is the polite admission that real clients, like real families, come in whole numbers and refuse to round themselves off to please the server.

We now have the personalization toolkit (fine-tune, meta-learn, cluster, decouple, interpolate), the bias-variance reason it works, the new-client cost it incurs, and a measured win for its simplest form. The same global-versus-local trade is about to outlive the server entirely: the next section removes the central aggregator and lets clients reach agreement by talking only to neighbors. That decentralized world begins in Section 14.8.

Exercise 14.7.1: Which Method for Which Population? Conceptual

For each deployment, choose one personalization strategy from Section 2 (fine-tune, meta-learning, clustered FL, parameter decoupling, interpolation) and justify it against the generalization-to-new-clients cost in Section 3: (a) a medical imaging model across ten hospitals whose scanners differ but whose patient populations overlap; (b) a smart-keyboard fleet where new phones join hourly and must work immediately; (c) a fraud detector for merchants that fall into a few clear business types (restaurants, gas stations, online stores). State explicitly which clients each choice serves worst.

Exercise 14.7.2: Tune the Interpolation Knob Coding

Extend Code 14.7.1 with an interpolation model that predicts using $w(\alpha) = \alpha \, w_{\text{local}} + (1-\alpha)\, w_{\text{global}}$, where $w_{\text{local}}$ is the purely local fit and $w_{\text{global}}$ is the FedAvg model. For each client, sweep $\alpha$ over $\{0, 0.1, \dots, 1\}$ and plot test error against $\alpha$. Confirm that the best $\alpha$ differs across clients and lands strictly between 0 and 1, then relate the client with the highest best-$\alpha$ to how much its tilt offset deviates from the common backbone. Explain why a single global $\alpha$ would underserve some clients.

Exercise 14.7.3: The Data-Quantity Threshold Analysis

In Code 14.7.1, sweep n_per (the per-client training-set size) from 5 up to 2000 while holding everything else fixed, and record the mean test error of the global, local, and personalized models at each setting. Identify the crossover where the purely local model overtakes the global model, and the (different) point where personalization's advantage over fine-tuning the global model becomes negligible. Argue from these two thresholds when personalization is worth its systems cost and when a plain global model or a plain local model is the right call instead.