Chapter 14: Federated and Decentralized Learning

"I never see the data. A million phones train in the dark and mail me their second thoughts; I average the whispers, send the model back, and hope nobody charged less than half a battery overnight."
A Federated Coordinator Who Has Learned to Trust the Average

Big Picture

Every earlier chapter of this book assumed you could move the data to the computation; federated and decentralized learning train a shared model when you cannot, keeping the data on the devices and silos that hold it and exchanging model updates instead, which turns data heterogeneity, communication scarcity, and privacy from afterthoughts into the central design constraints. The shift is profound. In Chapter 10 a worker held a shard the scheduler handed it, and every shard was drawn from the same shuffled distribution, so a gradient averaged across workers was an unbiased estimate of the true gradient. Here the shards are the phones and the hospitals themselves, each with its own local distribution, its own availability, its own privacy boundary, and none of it shuffled or yours to rearrange. The chapter builds the response in layers. It opens with the motivation, the regulatory, bandwidth, and privacy pressures that make centralizing data impossible and federation the only path. It distinguishes the two regimes that dominate practice, cross-device federation across millions of unreliable phones and cross-silo federation across a handful of trusted institutions. On that footing it develops FedAvg, the local-SGD-and-average algorithm at the heart of the field, and the variants such as FedProx and SCAFFOLD that repair the damage that heterogeneous data does to it. It confronts non-IID data head on, the single deepest difficulty, then the communication constraints that cap how often updates can travel. The second half turns to the constraints federation adds that no earlier chapter faced: privacy and secure aggregation, which let a server combine updates it is never allowed to read individually; personalized federated learning, which gives each participant a model fitted to its own distribution rather than one global compromise; decentralized learning, which removes the central server entirely and averages over a gossip topology; and edge and on-device learning, where the participant is a battery-limited device training in the field. The local-SGD intuition of Chapter 10 is the spine throughout, now run on data you cannot see, cannot move, and cannot assume is independent and identically distributed.

Chapter Overview

This is the fifth and final chapter of Part III, and it closes the part by dropping the assumption that has quietly held through every distributed method so far: that you own the training data and may move it wherever the computation is. Chapter 10 ran distributed SGD over shards the scheduler scattered; Chapters 11 and 12 sharded parameters and classical models the same way; Chapter 13 spread a single graph across a cluster you controlled. In all of them the data was yours to co-locate. Federated learning begins where that ends. The data is generated and retained on the participants, phones, hospitals, banks, and edge sensors, and it must not leave them. The whole chapter is organized around training a shared model under that one binding constraint, and around the three difficulties the constraint creates: heterogeneous data, scarce communication, and privacy.

The nine sections fall into three groups. The first establishes the setting: Section 14.1 motivates why data cannot be centralized, and Section 14.2 separates the cross-device regime of millions of unreliable phones from the cross-silo regime of a few trusted institutions, because almost every later design choice depends on which one you are in. The second group builds the core algorithm and the difficulties it must survive: Section 14.3 develops FedAvg and its variants FedProx and SCAFFOLD, Section 14.4 confronts non-IID data as the central obstacle to convergence, and Section 14.5 treats the communication constraints that make each round expensive and rare. The third group adds the constraints federation introduces that no earlier chapter faced: Section 14.6 develops privacy and secure aggregation, Section 14.7 builds personalized federated learning that fits each participant rather than one global average, Section 14.8 removes the central server with decentralized gossip averaging, and Section 14.9 lands the whole apparatus on battery-limited edge and on-device hardware.

Read in order, the nine sections take you from "the data is on a billion phones and the law forbids collecting it" to "train, personalize, privately aggregate, and decentralize a model that those phones never give up." The thread to watch is that federation reframes every quantity the rest of the book optimized: the gradient is now a biased estimate because the shards are not IID, the communication budget is measured in rounds rather than bytes per step, and the aggregation must reveal nothing about any single participant. The local-SGD-and-average move of Chapter 10 reappears in every section, but it now runs on data you cannot inspect, over participants who come and go, under a privacy boundary you are not permitted to cross.

Prerequisites

This chapter assumes the distributed-optimization background of the earlier part. From Chapter 10: Distributed Optimization you carry the single most important idea, local SGD: the move of taking several gradient steps on local data before communicating, rather than synchronizing every step, because FedAvg is local SGD pushed to its extreme, many local epochs between rare communication rounds, and the convergence reasoning of Sections 14.3 through 14.5 builds directly on the local-SGD analysis you saw there. From the same chapter you carry the data-parallel gradient and the synchronous-versus-asynchronous tradeoff, which return as the server-coordinated rounds of FedAvg and the server-free gossip of Section 14.8. From Chapter 4: Communication Primitives for Distributed Training you carry all-reduce and the cost of a communication round, which decentralized learning in Section 14.8 replaces with neighbor-averaging over a gossip topology. The chapter assumes comfortable Python, a working understanding of mini-batch SGD and of training a neural network, and basic probability, since non-IID data in Section 14.4 is a statement about differing local distributions. The differential-privacy and secure-aggregation material in Section 14.6 is developed from first principles and assumes no prior cryptography.

Learning Objectives

Explain why some training data cannot be centralized, naming the regulatory, bandwidth, and privacy pressures that make federated learning the only viable path, and contrast the data-stays-put model with the data-parallel sharding of earlier chapters.
Distinguish cross-device from cross-silo federated learning along the axes of participant count, reliability, statefulness, and trust, and explain how the regime dictates the algorithm and system design.
Derive FedAvg as local SGD with periodic averaging, implement one communication round, and describe how FedProx and SCAFFOLD correct the client drift that heterogeneous data induces.
Explain why non-identically-distributed client data breaks the unbiased-gradient assumption of distributed SGD and characterize its effect on convergence and final accuracy.
Reason about the communication constraints of federation, why the budget is measured in rounds rather than messages, and apply gradient compression, quantization, and reduced communication frequency to fit it.
Describe how secure aggregation lets a server combine client updates it can never read individually, and how differential privacy bounds what the released model reveals about any one participant.
Contrast a single global model with personalized federated learning, and explain how meta-learning, fine-tuning, and multi-task formulations fit a model to each participant's distribution.
Explain decentralized learning as server-free neighbor averaging over a gossip topology, and reason about how the topology controls the speed of consensus.
Identify the constraints that battery, intermittent connectivity, and limited memory impose on edge and on-device learning, and describe the techniques that make training feasible there.

Remember the Chapter as One Sentence

If you keep one thing from this chapter, keep this: federated and decentralized learning train a shared model without moving the data, by running local SGD on each participant and aggregating only the model updates, which makes data heterogeneity, communication scarcity, and privacy the constraints that everything else must bend around. Read forward, the sections build that craft in layers: first why the data cannot move and which federation regime you are in, then FedAvg and the variants that survive heterogeneous data, the non-IID difficulty itself, and the communication budget that caps every round; then the constraints federation adds, secure aggregation and differential privacy, per-participant personalization, server-free gossip averaging, and the battery-limited edge device. Read as a question, the chapter asks of any model you want to train on data you are not allowed to collect: how do you learn from it without ever seeing it, and what must you give up, in accuracy, in communication, and in trust, to keep it where it is. The roadmap below walks the nine sections that answer it.

Chapter Roadmap

14.1 Motivation for Federated Learning Lays out the regulatory, bandwidth, and privacy pressures that make centralizing training data impossible, and frames federated learning as training a shared model while the data stays on the devices and silos that hold it.
14.2 Cross-Device and Cross-Silo Learning Separates the two regimes that dominate practice, federation across millions of unreliable stateless phones and federation across a handful of trusted stateful institutions, and shows how the regime dictates every later design choice.
14.3 FedAvg and Its Variants Develops FedAvg as local SGD with periodic averaging, the algorithm at the heart of the field, and the variants FedProx and SCAFFOLD that correct the client drift heterogeneous data induces.
14.4 Non-IID Data Confronts the central difficulty of federation, that client data is not identically distributed, and shows why this breaks the unbiased-gradient assumption of distributed SGD and degrades convergence and final accuracy.
14.5 Communication Constraints Treats the communication budget that makes each round expensive and rare, and applies gradient compression, quantization, and reduced communication frequency to fit training into it.
14.6 Privacy and Secure Aggregation Develops secure aggregation, which lets a server combine client updates it can never read individually, and differential privacy, which bounds what the released model can reveal about any single participant.
14.7 Personalized Federated Learning Replaces the single global model with one fitted to each participant, through meta-learning, fine-tuning, and multi-task formulations that respect the local distribution rather than averaging it away.
14.8 Decentralized Learning Removes the central server entirely, averaging models over a gossip topology of neighbor exchanges, and reasons about how the communication graph controls the speed of consensus.
14.9 Edge and On-Device Learning Lands the whole apparatus on battery-limited, intermittently connected, memory-constrained hardware, and surveys the techniques that make training in the field feasible at all.

Read the nine sections in order and you will hold a toolkit for learning from data you are not allowed to collect: Section 14.1 and Section 14.2 establish why the data cannot move and which federation regime you are in, Sections 14.3 through 14.5 build FedAvg and the variants, the non-IID difficulty, and the communication budget, and Sections 14.6 through 14.9 add secure aggregation and privacy, personalization, decentralized gossip averaging, and on-device training. The thread to watch is the local SGD of Chapter 10 reappearing as FedAvg, and the all-reduce of Chapter 4 reappearing, in Section 14.8, as neighbor averaging once the central server is gone.

What's Next?

This chapter closes Part III, and with it the assumption that you own and can co-locate your training data, an assumption that held quietly through distributed optimization, parameter servers, classical machine learning, and distributed graph learning, and finally fell away here. Part IV turns from the question of where the data lives to the question of how to train a single very large model fast, and it returns to the world where you own a tightly coupled cluster and the data is yours to scatter. Chapter 15: Data-Parallel Deep Learning opens that part by taking the data-parallel gradient of Chapter 10 and engineering it for deep networks on many accelerators: replicating the model across GPUs, overlapping the backward pass with gradient all-reduce, bucketing and scheduling the communication, and scaling the batch without wrecking convergence. The local-SGD intuition you built here returns there as one point on a spectrum that runs from synchronous every-step all-reduce to the rare-communication federation you just left. Read it next, and watch the federated coordinator's patient averaging tighten back into the microsecond-budgeted all-reduce of a training cluster that owns every byte of its data.

Bibliography & Further Reading

Foundations and Surveys

McMahan, B., Moore, E., Ramage, D., Hampson, S., Aguera y Arcas, B. "Communication-Efficient Learning of Deep Networks from Decentralized Data (FedAvg)." arXiv:1602.05629, 2016. arxiv.org/abs/1602.05629

The paper that introduced federated learning and the FedAvg algorithm of local SGD plus periodic averaging, the starting point for the entire chapter and the core of Section 14.3.

📄 Paper

Kairouz, P., McMahan, H. B., et al. "Advances and Open Problems in Federated Learning." arXiv:1912.04977, 2019. arxiv.org/abs/1912.04977

The comprehensive survey that frames cross-device versus cross-silo federation, non-IID data, privacy, and communication, the organizing reference behind Sections 14.1, 14.2, and 14.6.

📄 Paper

FedAvg Variants and Heterogeneity

Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Talwalkar, A., Smith, V. "Federated Optimization in Heterogeneous Networks (FedProx)." arXiv:1812.06127, 2018. arxiv.org/abs/1812.06127

The proximal-term variant of FedAvg that stabilizes training under heterogeneous data and stragglers, one of the two corrections developed in Section 14.3 and motivated by Section 14.4.

📄 Paper

Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S. J., Stich, S. U., Suresh, A. T. "SCAFFOLD: Stochastic Controlled Averaging for Federated Learning." arXiv:1910.06378, 2019. arxiv.org/abs/1910.06378

The control-variate method that corrects the client drift caused by non-IID data, the second FedAvg variant of Section 14.3 and a direct response to the difficulty in Section 14.4.

📄 Paper

Privacy and Secure Aggregation

Bonawitz, K., Ivanov, V., Kreuter, B., Marcedone, A., McMahan, H. B., Patel, S., Ramage, D., Segal, A., Seth, K. "Practical Secure Aggregation for Privacy-Preserving Machine Learning." ACM CCS, 2017. dl.acm.org

The protocol that lets a server compute the sum of client updates without learning any individual update, the cryptographic core of the secure aggregation in Section 14.6.

📄 Paper

McMahan, H. B., Ramage, D., Talwar, K., Zhang, L. "Learning Differentially Private Recurrent Language Models." arXiv:1710.06963, 2017. arxiv.org/abs/1710.06963

The work that combines federated training with differential privacy to bound what the released model reveals about any participant, the privacy half of Section 14.6.

📄 Paper

Personalized Federated Learning

Fallah, A., Mokhtari, A., Ozdaglar, A. "Personalized Federated Learning: A Meta-Learning Approach (Per-FedAvg)." arXiv:2002.07948, 2020. arxiv.org/abs/2002.07948

The meta-learning formulation that trains an initialization each client can quickly adapt to its own distribution, the personalization approach developed in Section 14.7.

📄 Paper

Decentralized Learning

Lian, X., Zhang, C., Zhang, H., Hsieh, C.-J., Zhang, W., Liu, J. "Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel SGD (D-PSGD)." arXiv:1705.09056, 2017. arxiv.org/abs/1705.09056

The analysis showing decentralized neighbor-averaging SGD can match centralized SGD while removing the central bottleneck, the foundation of the gossip-averaging view in Section 14.8.

📄 Paper

Frameworks and Tools

Beutel, D. J., Topal, T., Mathur, A., Qiu, X., Fernandez-Marques, J., Gao, Y., Sani, L., Li, K. H., Parcollet, T., de Gusmao, P. P. B., Lane, N. D. "Flower: A Friendly Federated Learning Framework." flower.ai

The framework-agnostic federated learning library used to prototype the FedAvg rounds, client sampling, and aggregation strategies that run through Sections 14.3 to 14.7.

🛠️ Tool

Google. "TensorFlow Federated: Machine Learning on Decentralized Data." tensorflow.org/federated

The open-source stack for expressing federated computations and simulating cross-device training, a reference implementation for the algorithms of this chapter.

🛠️ Tool