"They gave me four megabytes, a battery that hates me, and a network that visits twice a day. So I learned to change my mind one number at a time."
A Phone That Fine-Tunes Itself Overnight
On-device learning pushes federated learning to its physical limit: the model trains on the very device that generated the data, under budgets of memory, energy, and connectivity so tight that a full gradient step is a luxury you cannot afford. Everything Chapter 14 built (data that cannot move, FedAvg, non-IID clients, communication frugality, personalization) reappears here multiplied by hardware scarcity. The response is to make every update cheap: freeze the bulk of the model and fine-tune only a small adapter or the last few layers, quantize and sparsify what little you send, and split the heavy computation between the device and the cloud. This section shows, with a runnable demo, that adapting a single well-chosen parameter on the device can personalize a model better than retraining all of it, while shipping a few hundred times less. With that, the on-device frontier closes Chapter 14 and Part III, the part where we learned to distribute the learning itself.
In Section 14.7 we personalized a federated model so that each client kept a head tuned to its own distribution, and in Section 14.8 we removed the central server entirely with gossip averaging. Both still assumed the client was a reasonably capable machine: a hospital workstation, a bank's compute node, a laptop. This section drops that assumption. The client is now a phone, a watch, a doorbell camera, a hearing aid, a microcontroller in a thermostat. It holds a few megabytes of working memory, runs on a battery it must not drain, and sees the network in brief opportunistic windows when it is charging on Wi-Fi at night. The data it owns is the most private and most personal in the whole system, and it never leaves. The question of this section is how a model learns at all under those constraints.
The honest starting point is that ordinary training does not fit. A single backward pass over a modern network stores activations for every layer, which alone can dwarf a microcontroller's entire memory. Optimizer state (momentum, second moments) doubles or triples the parameter footprint. A full gradient upload, even compressed, can exceed a metered data plan. So on-device learning is not "the same training, smaller". It is a different regime in which the update must be made structurally cheap: small in the number of trained parameters, small in the bytes communicated, and small in the energy spent, all at once.
1. The Three Budgets of the Extreme Edge Beginner
Three budgets bind a learner on the extreme edge, and naming them separately keeps the engineering honest because each has its own remedy. The first is the compute-and-memory budget: the device cannot hold the activations and optimizer state of a full backward pass, so we must train far fewer parameters than the model has. The second is the energy budget: training is the most power-hungry thing a battery device can do, so updates must run rarely, briefly, and only when power is plentiful (typically while charging). The third is the connectivity budget: the link is intermittent, slow, and metered, so whatever leaves the device must be tiny, and the device must tolerate going offline for long stretches between contributions.
These budgets compose into a single design rule: make the update small along every axis at once. The dominant technique is partial fine-tuning. Freeze the large pretrained backbone, which the device only needs to run forward, and learn a small set of new parameters: an adapter module, a low-rank correction, a bias-only update, or just the final classification layer. Because only those few parameters have gradients, the backward pass stores almost no extra activations, the optimizer state is negligible, and the upload is a handful of numbers rather than millions. This is the same insight that drives parameter-efficient fine-tuning of large language models, applied here because the device forces it rather than because the cloud prefers it.
A device cannot afford to retrain a model, but it can afford to nudge a few parameters. Freezing the backbone and training only a small adapter collapses all three edge budgets at once: the frozen layers need no gradients (so almost no extra memory), the few trainable parameters cost almost no energy to update, and the resulting delta is small enough to upload over an intermittent link. The model the user runs is still large and capable; the part that learns on the device is deliberately tiny. Personalization at the edge is the art of choosing which small part that is.
2. On-Device Personalization: A Tiny Adapter Beats a Full Update Intermediate
Section 14.7 personalized a federated model by giving each client its own head; here we push that idea onto the device and pay attention to its cost. Suppose a frozen base model predicts a target from a feature vector, and a particular user's data deviates from the population in one consistent direction (they hold their phone differently, they speak with an accent, their typing has a personal rhythm). We can write the personalized weights as the frozen base plus a small correction,
$$w_{\text{user}} = w_{\text{base}} + a \cdot d,$$where $d$ is an adapter direction the server ships once and freezes, and $a$ is a single scalar the device learns from its own data. Only $a$ is trained on the device, and only $a$ is uploaded. Contrast this with the full-update alternative, which retrains all $D$ weights from the user's tiny local dataset. The demo below runs both on the same synthetic user, whose targets follow $w_{\text{base}}$ plus a rank-one personal bias, and reports both the personalization quality (loss on the user's held-out data) and the number of values each method must ship.
import random, math
random.seed(0)
D = 512 # a large frozen base model
def dot(a, b): return sum(x * y for x, y in zip(a, b))
base_w = [random.gauss(0, 1) for _ in range(D)] # frozen, shared by all devices
user_dir = [random.gauss(0, 1) for _ in range(D)] # this user's personal direction
nrm = math.sqrt(dot(user_dir, user_dir))
user_dir = [x / nrm for x in user_dir] # unit vector
user_scale = 2.3 # strength of the personal bias
def make(n):
X, y = [], []
for _ in range(n):
x = [random.gauss(0, 1) for _ in range(D)]
X.append(x)
y.append(dot(base_w, x) + user_scale * dot(user_dir, x)) # base + personal bias
return X, y
Xtr, ytr = make(60) # the device owns only a tiny local dataset
Xte, yte = make(400) # the user's held-out personal data
def mse(w, X, y): return sum((dot(w, x) - t) ** 2 for x, t in zip(X, y)) / len(X)
base_loss = mse(base_w, Xte, yte)
lr, steps = 0.02, 4000
# (A) FULL update: gradient descent on all D weights from 60 examples.
wf = list(base_w)
for _ in range(steps):
g = [0.0] * D
for x, t in zip(Xtr, ytr):
e = dot(wf, x) - t
for j in range(D): g[j] += 2 * e * x[j]
for j in range(D): wf[j] -= lr * g[j] / len(Xtr)
full_loss, full_ship = mse(wf, Xte, yte), D # ships all D weights
# (B) ADAPTER: backbone AND direction d frozen; learn ONLY the scalar a.
d, a = list(user_dir), 0.0 # d is server-provided, not uploaded
for _ in range(steps):
ga = 0.0
for x, t in zip(Xtr, ytr):
e = dot(base_w, x) + a * dot(d, x) - t
ga += 2 * e * dot(d, x)
a -= lr * ga / len(Xtr)
wa = [base_w[j] + a * d[j] for j in range(D)]
adapter_loss, adapter_ship = mse(wa, Xte, yte), 1 # ships ONE number
gap = base_loss
print("base (no personalization) loss :", f"{base_loss:8.3f}")
print("full-model update loss :", f"{full_loss:8.3f}", " shipped numbers:", full_ship)
print("rank-1 adapter loss :", f"{adapter_loss:8.3f}", " shipped numbers:", adapter_ship)
print()
print("loss recovered by full update :", f"{(gap - full_loss) / gap * 100:5.1f}% of the gap closed")
print("loss recovered by adapter :", f"{(gap - adapter_loss) / gap * 100:5.1f}% of the gap closed")
print("upload-size reduction :", f"{full_ship / adapter_ship:.0f}x smaller")
base (no personalization) loss : 6.266
full-model update loss : 5.461 shipped numbers: 512
rank-1 adapter loss : 0.000 shipped numbers: 1
loss recovered by full update : 12.8% of the gap closed
loss recovered by adapter : 100.0% of the gap closed
upload-size reduction : 512x smaller
The result is sharper than "comparable personalization at a fraction of the cost": on the tiny dataset a device actually owns, the structured adapter generalizes far better than the full update, because retraining all $512$ weights from $60$ examples is hopelessly underdetermined and overfits, while the one-parameter adapter has exactly the right inductive bias and converges immediately. The lesson is not that adapters are magic; it is that on the edge the scarcity of data conspires with the scarcity of compute and bandwidth to make a small, well-structured update the right answer along every axis at once. Output 14.9.1 is the whole section in three numbers: full personalization quality, one trained parameter, a $512\times$ smaller upload.
The next-word suggestions on a smartphone keyboard are the most widely deployed on-device learners on Earth. They adapt to your private vocabulary, your friends' names, the way you abbreviate, all from text that never leaves the phone, and they contribute only tiny aggregated updates when the device is idle, charging, and on Wi-Fi. Billions of devices each train a sliver of a model overnight and forget the raw keystrokes by morning. It is federated learning you have been using for years without noticing, which is exactly the point.
3. Split Computing: Share the Work With the Cloud Intermediate
Sometimes even a frozen forward pass is too much for the device, or the useful gradient lives in layers the device cannot hold. Split computing addresses this by cutting the model at a layer boundary: the device runs the first few layers on its raw, private input and sends only the intermediate activation (a compact feature, not the raw data) to the cloud, which runs the rest and, during training, computes the gradient and sends back only what the device needs to update its small on-device part. The raw data never leaves the device, the heavy layers run where there is power and memory, and the link carries a thin activation rather than an image or an audio clip. Figure 14.9.1 shows both the on-device adapter pattern and this device-cloud split.
Split computing trades communication for on-device compute, and the right cut point depends on the same kind of balance we drew throughout Part III: cut too early and the activation is large and the device does too little useful work; cut too late and the device cannot hold its share. The forward-looking treatment of edge and fog architectures, where this split generalizes into multi-tier device-fog-cloud pipelines with caching and offloading policies, lives in Chapter 34; here we only need it as one more lever for fitting learning into a device-sized budget.
Who: A firmware engineer at a hearing-aid maker adding noise-suppression personalization.
Situation: Each wearer's ear, environment, and hearing loss differ, and the shared model sounded generic in the wearer's specific noisy settings (their kitchen, their car).
Problem: Uploading the wearer's audio to retrain in the cloud was a non-starter for privacy and for the device's tiny radio and battery; the chip held a few hundred kilobytes free.
Dilemma: Ship a generic model and accept mediocre fit for everyone, or attempt on-device learning on a microcontroller that cannot run a full backward pass over the suppression network.
Decision: They froze the suppression backbone and trained only a small per-wearer adapter (a handful of gain-and-filter parameters) on the device, during nightly charging, exactly the pattern of Code 14.9.1.
How: The backbone ran forward in fixed-point; gradients flowed only into the adapter, so peak memory stayed within budget; updates ran for a few minutes while charging and never touched the radio except to sync a tiny anonymized delta.
Result: Wearers reported markedly better speech clarity in their own frequent environments, the raw audio never left the device, and battery life was unaffected because training happened only on the charger.
Lesson: The edge does not forbid learning; it forbids large learning. Choose the smallest parameter set that captures the personal variation and train only that, when power is free.
4. Compression Is the Per-Device Budget Made Concrete Intermediate
Everything in this section eventually reduces to fitting a model and its updates into a fixed per-device budget, which is precisely the subject of model compression: quantization to low-precision integers, pruning away unneeded weights, and knowledge distillation into a smaller student. On the edge these are not optional refinements; the device runs the backbone forward in quantized arithmetic because that is the only way it fits, trains in low precision to save energy, and sparsifies and quantizes the adapter delta before uploading it so the intermittent link can carry it. The communication-frugality techniques of Section 14.5 (sparsification, quantization of the update) are the same tools, applied now under even harder limits. The systematic treatment of per-node compression and low-precision inference, the scale-up techniques that make the per-device budget achievable in the first place, is the explicit prerequisite developed in Chapter 22; on-device learning is one of its most demanding customers.
Code 14.9.1 trained an adapter by hand to expose the mechanics. In practice an on-device personalization layer drops onto a frozen backbone with a few lines, and the framework handles freezing, the parameter-efficient module, and the low-precision math the chip needs:
# Freeze the backbone; attach a tiny low-rank adapter; train only that on-device.
for p in backbone.parameters():
p.requires_grad = False # frozen: no gradients, no optimizer state
adapter = LoRA(rank=1, target=backbone.head) # the only trainable parameters
opt = torch.optim.SGD(adapter.parameters(), lr=0.02) # optimizes ~D->rank numbers
for x, y in on_device_stream: # the user's private data, never uploaded
loss = loss_fn(backbone_with(adapter)(x), y)
loss.backward(); opt.step(); opt.zero_grad()
delta = quantize(adapter.state_dict()) # sparse, low-precision upload only
Two fast-moving lines are pushing real training onto small devices. The first is on-device fine-tuning of language models: parameter-efficient methods (LoRA and quantized variants such as QLoRA, Dettmers et al., 2023) make it feasible to adapt a multi-billion-parameter model by training only a few million adapter weights, and 2024 to 2026 work brings these to phones and laptops so that a personal model learns from local data without the prompt ever reaching a server. The second is TinyML training, training (not just inference) on microcontrollers with kilobytes of RAM: the lineage of MCUNet and on-device training under 256 KB memory (Lin et al., 2022) uses sparse gradient updates, quantized backpropagation, and activation recomputation to fit a backward pass into a budget that once held only a forward pass. Both lines treat the device's memory and energy as the hard optimization constraint and the trained parameter count as the knob, which is exactly the trade-off Output 14.9.1 made concrete with a single scalar.
5. Chapter 14 in One View, and the Close of Part III Beginner
This section ends Chapter 14, and Chapter 14 ends Part III, so it is worth seeing the whole arc at once. The chapter began from a constraint the rest of the book never imposed: the data cannot move. Hospitals, banks, and phones hold data that legal, competitive, or privacy limits keep in place, so we learned to bring the training to the data instead of the data to the training. From that single premise the whole chapter unfolded.
We distinguished the two regimes (cross-device, with millions of unreliable phones each holding a little data, and cross-silo, with a few reliable organizations each holding a lot). We built FedAvg, which runs several local steps on each client and averages the resulting models, and we saw why non-IID client data makes that average drift and how to fight the drift. We treated communication as the dominant cost and made updates frugal by reducing rounds and compressing what each round sends. We protected the updates with secure aggregation and the differential-privacy ideas that Chapter 35 develops in full. We personalized the shared model so each client kept what made it different, removed the central server entirely with decentralized gossip averaging, and finally, in this section, pushed the whole apparatus onto the device itself.
Chapter 14. Federated and decentralized learning is what you do when the data cannot move. The core moves: split clients into cross-device (many, small, unreliable) and cross-silo (few, large, reliable); train locally and combine with FedAvg; fight the model drift that non-IID data causes; treat communication as the scarce resource and make every round cheap; protect contributions with secure aggregation and differential privacy; personalize the shared model so clients keep their differences; drop the central server with decentralized gossip; and at the extreme edge, train only a tiny adapter on the device itself.
Part III. The whole of Part III was about distributing the learning, not merely the data of Part II. We distributed the optimization itself (Chapter 10), sharded model parameters and embeddings across a parameter server (Chapter 11), scaled classical and graph machine learning across a cluster (Chapters 12 and 13), and learned without centralizing the data at all (Chapter 14). The unifying thread is that the algorithm, not just the dataset, is split across machines that must communicate to act as one learner, and the same collective from Chapter 4 (sum the partial results, share the answer) sits underneath every method, from synchronous SGD to FedAvg's model averaging to gossip.
Part III honored the book's thesis by scaling out the act of learning: many workers, silos, or devices each compute a partial update, and a combine step (all-reduce, FedAvg averaging, or gossip) fuses them into one coherent learner. Throughout, we assumed each worker could still hold the whole model. Part IV breaks that last assumption. When a single model has hundreds of billions of parameters, no one device can hold it, so we must shard the model itself across machines and route activations and gradients between the shards. The combine step does not vanish; it specializes into reduce-scatter, all-gather, and all-to-all. The learner we just finished distributing becomes, in Part IV, a model too large to live in one place.
1. Cross-silo FL with differential privacy. Simulate three to five silos with non-IID splits of a tabular or image dataset. Implement FedAvg, then add per-client gradient clipping and Gaussian noise to give a differential-privacy guarantee, and chart the accuracy-versus-privacy trade-off as you vary the noise scale and the number of local epochs. Compare against a single centralized model trained on the pooled data as the upper bound the privacy constraint costs you.
2. On-device adapter versus full fine-tuning. Extend Code 14.9.1 to a small real model (a compact image classifier or a tiny language model). On a per-user data split, compare full fine-tuning, last-layer-only, and a low-rank adapter on three axes: held-out personalization accuracy, trainable parameter count, and upload bytes after quantization. Reproduce the qualitative shape of Output 14.9.1 and report the cut point where full fine-tuning starts to win as you add local data.
3. Split-computing latency and privacy study. Take a layered network and sweep the device-cloud cut point. For each cut, measure the activation size (the upload per inference), the on-device compute share, and a simple reconstruction-difficulty proxy for how much the raw input leaks through the activation. Recommend a cut point that balances communication, on-device load, and privacy, and connect your numbers to the edge-fog architectures of Chapter 34.
For each device, state which of the three edge budgets from Section 1 (compute-and-memory, energy, connectivity) binds hardest, and what on-device learning strategy you would choose as a result: (a) a smart watch that is charged nightly but has a fast Bluetooth link to a paired phone; (b) a remote agricultural sensor on solar power with a once-a-day satellite uplink; (c) a flagship phone with abundant memory and compute but a strict metered cellular data plan. Explain why a strategy that fits one device would waste the wrong budget on another.
In Code 14.9.1 the one-scalar adapter beats the full update because only 60 local examples are available. Sweep the local dataset size from 60 up to several thousand examples and plot the held-out loss of both methods against it. Find the dataset size at which the full update catches up to the adapter, and explain in terms of degrees of freedom why the crossover happens there. Then add a second, unrelated personal direction to the user's data and show how the rank-one adapter's advantage shrinks when the personal variation no longer fits its single direction.
A device runs the first $k$ of $L$ equal-cost layers locally and sends the activation at layer $k$ to the cloud. Suppose the activation at layer $k$ has size $s_k$ bytes and the link moves $B$ bytes per second, while the device computes one layer in time $c$. Write the per-inference time as a function of $k$ (device compute plus activation upload plus cloud compute), and find the $k$ that minimizes it when $s_k$ first shrinks and then grows across the network (an hourglass shape). Argue qualitatively how your optimal cut moves if the link $B$ gets slower, and connect this to the communication-cost reasoning of Chapter 10.