Part III: Distributed Machine Learning
Chapter 14: Federated and Decentralized Learning

Cross-Device and Cross-Silo Learning

"They told me I had a million colleagues. I have never met more than a thousand of them at once, and a third of those wandered off before the round finished."

A Server Waiting on Its Clients
Big Picture

Federated learning is not one setting but two, and almost every design decision in this chapter changes depending on which one you are in. In the cross-device regime the participants are millions of phones, watches, and browsers: tiny, unreliable, intermittently online, and each holding only a handful of examples, so only a small random sample joins any given round and many of those drop out before they report. In the cross-silo regime the participants are a few dozen institutions: hospitals, banks, or regional data centers, each holding a large dataset, always reachable, and stable enough to keep state across rounds. The number of clients $N$, the data per client $n$, the availability, the statefulness, and the trust model differ by orders of magnitude between the two, and those differences decide which aggregation rule, which privacy mechanism, and which fault-tolerance strategy you can actually use. This section draws the line cleanly so that the FedAvg variants in Section 14.3 land in the right regime.

In the previous section we motivated federated learning as training a shared model on data that never leaves the machines that own it, coordinated by a server that only ever sees model updates. That definition is regime-neutral: it holds whether the clients are a billion Android phones or twelve hospitals. The moment you try to build the system, though, the neutrality evaporates. A protocol that assumes every client answers every round is fine for twelve hospitals and a disaster for a billion phones; a protocol that pays a heavy per-client cryptographic setup is fine for twelve hospitals and ruinous at a billion. The federated-learning literature settled on two named regimes precisely because the engineering forks so sharply. Naming them, and knowing which constraints bind in each, is the skill this section teaches.

Cross-device: a fleet, a sample reports Server millions of clients, tiny n each, most offline or dropped Cross-silo: a few orgs, all report Aggregator Hospital A large n Bank B large n Lab C large n tens of clients, large n each, always available, stateful
Figure 14.2.1: The two regimes side by side. On the left, a cross-device fleet of millions: the server samples a few clients per round (orange, connected) while the vast majority are offline or drop out mid-round (grey). On the right, a cross-silo federation of a handful of institutions, each holding a large dataset and every one of them participating in every round. The contrast in client count $N$, data per client $n$, and availability drives every system choice discussed below.

1. Two Regimes, One Definition Beginner

Both regimes implement the same loop: the server holds a global model, sends it to a set of clients, each client trains locally on its private data and returns an update, and the server aggregates those updates into a new global model. What separates them is the population of clients and what you can assume about it. Cross-device federated learning targets a fleet of consumer endpoints, the canonical example being Google's mobile keyboard learning to predict the next word across hundreds of millions of Android phones. Cross-silo federated learning targets a small set of organizations that cannot or will not pool their raw data, the canonical example being a consortium of hospitals training a shared diagnostic model while each patient record stays inside its own hospital, a setting we develop fully in the medical-AI case study of Chapter 37.

The two differ along five axes that matter for system design: the scale of the client count $N$ versus the data per client $n$, client availability and sampling, statefulness, the communication pattern, and the trust model. We take each in turn, because each one forces a different remedy, and the remedies are what the rest of the chapter spends its pages on.

2. The Scale of $N$ Versus $n$ Beginner

Write $N$ for the number of clients and $n$ for the number of training examples a typical client holds. In cross-device, $N$ is enormous (millions to billions) and $n$ is tiny (a single user generates a few hundred keystrokes or photos). In cross-silo, $N$ is small (a few to a few hundred) and $n$ is large (a hospital holds millions of records). The total data $\sum_k n_k$ can be comparable in the two settings, but it is split very differently, and the split decides what a single round can accomplish.

When $n$ per client is tiny, the local update from any one client is statistically noisy and reveals little on its own, so the system leans on aggregating many clients to extract signal. When $n$ per client is large, each client's update is already a meaningful, low-variance estimate, so a round of a few dozen silos can move the global model substantially. This is the same average-of-averages reasoning behind the data-parallel gradient identity of Section 1.1, but with a crucial twist: the per-client example counts $n_k$ are wildly unequal, so the aggregation must be a size-weighted average rather than a plain mean. We make that weighting explicit in the code of Section 6 and unpack its consequences for non-IID data in Section 14.4.

Key Insight: The Regime Is Set by the Shape of the Data Split, Not the Total

Two federations can hold the same total number of examples and still demand opposite system designs. What matters is the shape of the split: many clients with tiny $n$ (cross-device) versus few clients with large $n$ (cross-silo). Tiny $n$ forces partial participation, statelessness, and reliance on aggregating thousands of weak updates; large $n$ permits full participation, per-client state, and heavier per-round computation and cryptography. Before you choose an algorithm, classify the split; the regime tells you which assumptions you are allowed to make.

3. Availability, Sampling, and the Dropout Problem Intermediate

A cross-silo client is a data center: it is online, has reliable power and a fast wired link, and can be counted on to finish whatever round it joins. A cross-device client is a phone in someone's pocket: it participates only when it is idle, charging, and on an unmetered network, and it may go offline at any moment because the user picks it up, the battery dips, or the signal drops. The practical consequence is that cross-device rounds sample a small fraction of currently eligible clients (often a few hundred out of millions) and must tolerate that a meaningful fraction of even those sampled will never report back within the round's deadline.

This is exactly the participant-availability problem, and it is the federated face of fault tolerance, the theme introduced in Section 2.4. There, a distributed computation had to make progress despite crashed or slow nodes; here, a training round must produce a usable aggregate despite clients that vanish mid-round. The server cannot wait for stragglers indefinitely (waiting for the last of a million phones would stall every round), so it sets a target count and a deadline, aggregates whoever reported by the deadline, and discards the late arrivals. A round is therefore designed from the start to tolerate dropouts: it over-samples (selects more clients than it strictly needs so that enough survive), and it treats the surviving set, not the selected set, as the true participant population. Cross-silo, by contrast, can demand that every silo reports before the round closes, because a federation of a dozen institutions can afford to block on the slowest one.

Practical Example: The Keyboard Model That Could Not Wait for Everyone

Who: A mobile-AI team training a next-word prediction model across a large fleet of Android phones.

Situation: Each training round selected a few thousand eligible phones (idle, charging, on Wi-Fi) and asked each to train locally for a few minutes and upload a model delta.

Problem: Between selection and the upload deadline, roughly a quarter to a third of the selected phones dropped out: users unplugged them, walked out of Wi-Fi range, or simply started using them again.

Dilemma: Wait longer so more phones report, which stalls the round and lets the freshest data go stale, or close the round early on whoever survived, which risks aggregating too few or a skewed set of clients.

Decision: They over-sampled deliberately, selecting about 30% more phones than the target count, set a fixed deadline, and aggregated the survivors, accepting that the selected set and the reporting set would differ.

How: The server tracked a target of reporting clients per round, over-provisioned the selection to hit it after expected losses, and used a size-weighted average over only the clients that reported by the deadline.

Result: Rounds completed on a predictable schedule, the aggregate stayed close to what a full-participation round would have produced (the gap is quantified in Output 14.2.1), and the model improved steadily despite never once hearing from most of the fleet.

Lesson: In cross-device federated learning you never train on "all clients"; you train on "whoever survived this round", and the protocol must be correct under that reality from the first line.

4. Statefulness and Communication Patterns Intermediate

Because a cross-device client may be sampled once and then never again (out of millions of phones, the chance that a given phone returns next round is vanishingly small), the protocol must be stateless on the client side: a client cannot rely on remembering anything from a previous round, because it almost certainly was not in one. Every cross-device round therefore starts each participating client from the broadcast global model, with no client-specific optimizer state, no per-client control variate carried across rounds, and no assumption of continuity. Cross-silo is the opposite: the same dozen silos participate round after round, so they can be stateful, keeping local optimizer momentum, per-client correction terms (as in SCAFFOLD-style variance reduction, covered in Section 14.3), or even a personalized local head that persists between rounds.

Statefulness changes the communication pattern too. Cross-device communication is bursty and one-shot per client: download the model, train, upload once, disconnect, with the round's bottleneck being the upload over a slow, metered, high-latency cellular link, which is why communication compression is a first-class concern in Section 14.5. Cross-silo communication is repeated and sustained between stable endpoints on fast links, so the bytes per round matter less and the bottleneck shifts toward the per-round cryptographic and coordination overhead the silos are willing to pay for stronger privacy guarantees.

5. Trust Models and How the Regime Maps onto Later Choices Intermediate

The trust model differs sharply. In cross-device, individual clients are anonymous, untrusted, and present in such numbers that some are likely malicious or simply broken, so the system worries about a small fraction of poisoned or garbage updates and leans on robust aggregation and on secure aggregation that hides any single client's contribution in the crowd. In cross-silo, the participants are identifiable, contractually bound institutions that mostly trust each other's intent but are legally forbidden from sharing raw data, so the concern shifts from "is this client malicious?" toward "can I prove to a regulator that no raw record left my walls?", which favors cryptographic guarantees like secure multiparty computation and stronger, auditable differential-privacy accounting.

These regime differences are not academic; they pre-decide the menu for the rest of the chapter. Partial participation and statelessness shape which FedAvg variants converge (Section 14.3); the skew of tiny non-IID per-client datasets is far more severe in cross-device (Section 14.4); the slow metered uplink makes compression mandatory cross-device and optional cross-silo (Section 14.5); and the trust model selects between crowd-hiding secure aggregation and institution-level cryptography for privacy, a thread that runs through Section 14.6 and on into the secure-aggregation and differential-privacy machinery of Chapter 35. Table 14.2.1 collects the full contrast.

Table 14.2.1: The two regimes of federated learning across the axes that drive system design. Each row changes which algorithm, privacy mechanism, and fault-tolerance strategy is feasible.
AxisCross-deviceCross-silo
Clients $N$Millions to billionsA few to a few hundred
Data per client $n$Tiny (one user's data)Large (an institution's data)
Examples: phones, watches, browsersYesHospitals, banks, data centers
AvailabilityIntermittent, unreliableAlways online, reliable
Participation per roundSmall random sampleAll clients
Dropouts mid-roundCommon, must be toleratedRare, can block on slowest
Client state across roundsStateless (likely never sampled again)Stateful (same clients return)
CommunicationBursty, one-shot, slow uplinkRepeated, sustained, fast links
BottleneckBytes uploadedPer-round crypto/coordination
Trust modelAnonymous, some maliciousIdentified, contractual
Privacy emphasisHide one client in the crowdProvable, auditable guarantees

6. A Federated Round Under Both Regimes Intermediate

The cleanest way to feel the difference is to run one round under each regime and watch how partial participation and dropouts move the aggregated update. The code below is pure Python: it builds a small cross-silo federation in which every silo reports, and a million-client cross-device fleet from which it samples one tenth of one percent and then drops 30% of those sampled before they report. Both regimes aggregate with the same size-weighted FedAvg rule; the only difference is who shows up. We measure the cross-device round against an ideal in which the entire fleet reported, to see how far partial participation pulls the aggregate from the full-fleet answer.

import random
random.seed(7)

D = 20  # model dimension

def vec_sub(a, b):
    return [ai - bi for ai, bi in zip(a, b)]

def vec_norm(a):
    return sum(x * x for x in a) ** 0.5

def make_clients(n_clients, n_per_client, spread):
    """Each client carries a local update DELTA and an example count."""
    clients = []
    for _ in range(n_clients):
        n = max(1, int(random.gauss(n_per_client, n_per_client * 0.15)))
        # local optimum, shifted per client (a touch of non-IID heterogeneity)
        local_opt = [random.gauss(1.0, spread) for _ in range(D)]
        delta = vec_sub(local_opt, [0.0] * D)  # update relative to the zero global model
        clients.append((delta, n))
    return clients

def fedavg(updates):
    """Size-weighted average of client deltas: the FedAvg aggregation rule."""
    total = sum(n for _, n in updates)
    agg = [0.0] * D
    for delta, n in updates:
        w = n / total
        for j in range(D):
            agg[j] += w * delta[j]
    return agg, total

def report(name, n_selected, n_reported, agg, examples):
    print(name)
    print(f"  clients selected     : {n_selected:,}")
    print(f"  clients reported     : {n_reported:,}")
    print(f"  examples aggregated  : {examples:,}")
    print(f"  ||aggregated update|| : {vec_norm(agg):.4f}")

# ---- Cross-silo: a few dozen reliable orgs, ALL participate every round ----
silo = make_clients(n_clients=24, n_per_client=50_000, spread=0.5)
silo_agg, silo_examples = fedavg(silo)  # no dropouts, every silo reports

# ---- Cross-device: a huge fleet; sample a fraction, some sampled drop out ----
FLEET = 1_000_000
device_fleet = make_clients(n_clients=FLEET, n_per_client=80, spread=0.5)
FRACTION = 0.001   # 0.1% sampled per round
DROPOUT = 0.30     # 30% of the sampled never report back
n_sample = int(FLEET * FRACTION)
sampled = random.sample(device_fleet, n_sample)
survived = [c for c in sampled if random.random() > DROPOUT]
device_agg, device_examples = fedavg(survived)

# Ideal: the update if the ENTIRE fleet had reported this round.
ideal_agg, _ = fedavg(device_fleet)

report("CROSS-SILO  (all participate, stateful)", len(silo), len(silo), silo_agg, silo_examples)
print()
report("CROSS-DEVICE (sampled fraction, dropouts)", n_sample, len(survived), device_agg, device_examples)
print()
gap = vec_norm(vec_sub(device_agg, ideal_agg)) / vec_norm(ideal_agg)
print(f"cross-device round vs full-fleet ideal : relative gap = {gap:.3f}")
print(f"cross-silo round saw {silo_examples / device_examples:,.0f}x the examples of the cross-device round")
Code 14.2.1: One federated round simulated under both regimes with the same size-weighted FedAvg aggregation. The cross-silo path aggregates all 24 silos; the cross-device path samples 1,000 of a million clients and then discards 30% as dropouts, aggregating only the survivors.
CROSS-SILO  (all participate, stateful)
  clients selected     : 24
  clients reported     : 24
  examples aggregated  : 1,152,832
  ||aggregated update|| : 4.5604

CROSS-DEVICE (sampled fraction, dropouts)
  clients selected     : 1,000
  clients reported     : 687
  examples aggregated  : 54,375
  ||aggregated update|| : 4.4770

cross-device round vs full-fleet ideal : relative gap = 0.018
cross-silo round saw 21x the examples of the cross-device round
Output 14.2.1: The cross-silo round aggregated all 24 clients and roughly 1.15 million examples; the cross-device round heard from only 687 of the 1,000 sampled clients (313 dropped out) and about 54 thousand examples. Even so, sampling and dropouts pulled the cross-device aggregate just 1.8% away from the full-fleet ideal, because averaging hundreds of clients smooths over the missing ones. The cross-silo round, by contrast, drew on about 21 times as much data in a single pass.

Two lessons fall out of Output 14.2.1. First, partial participation is survivable: dropping a third of the sampled clients moved the aggregate by under two percent, because the law of large numbers protects an average taken over hundreds of clients. This is why cross-device protocols can be cavalier about losing individual clients yet still converge. Second, the regimes operate at completely different per-round data scales, and that gap, not the algorithm, is why a cross-silo federation often converges in tens of rounds while a cross-device one needs thousands; Chapter 10 gives the convergence-rate machinery that turns this intuition into bounds.

Fun Note: The Server That Talks to Strangers

A cross-device server is in the strange position of coordinating millions of participants it will likely never hear from twice. It is less a manager with a team and more a radio station taking call-ins: it broadcasts to everyone, a handful call back this hour, and it makes the most of whoever happened to be listening. The protocol has to be polite to no-shows, because almost everyone is one.

Library Shortcut: Flower Picks the Regime for You

The hand-rolled selection-and-dropout loop in Code 14.2.1 is exactly what a federated-learning framework manages internally. In Flower, a regime is a configuration: the strategy class fixes the aggregation rule and the client-manager fixes sampling and the survivor-versus-selected accounting, so switching from cross-silo (all clients) to cross-device (a sampled fraction with a minimum-available threshold) is a few constructor arguments rather than a rewrite.

import flwr as fl

# Cross-device: sample 0.1% of available clients each round, but require a
# minimum number to actually report before the round is allowed to aggregate.
strategy = fl.server.strategy.FedAvg(
    fraction_fit=0.001,         # selection fraction (cross-device sampling)
    min_fit_clients=10,         # survivors required to aggregate a round
    min_available_clients=1000, # wait until the fleet is this populated
)
# Cross-silo is the same class with fraction_fit=1.0 and small client counts:
# every institution participates every round.
fl.server.start_server(strategy=strategy,
                       config=fl.server.ServerConfig(num_rounds=100))
Code 14.2.2: The same selection, dropout-tolerance, and size-weighted aggregation as Code 14.2.1, expressed as four arguments to Flower's FedAvg strategy. The framework handles the client registry, the sampling, the deadline, and the survivor accounting; the regime is a configuration, not a separate codebase.

7. Why the Split Decides Everything Downstream Advanced

It is worth stating plainly why so much hangs on a classification that sounds like mere taxonomy. Every aggregation rule, privacy mechanism, and robustness defense in the rest of this chapter carries an implicit assumption about participation and state, and that assumption is satisfied in one regime and violated in the other. A method that keeps a per-client correction term assumes the client returns; deploy it cross-device and the term is stale on arrival. A secure-aggregation protocol whose setup cost is amortized over many rounds assumes stable membership; deploy it across an ever-changing sample of phones and the setup never pays off. A robustness defense tuned for a handful of identifiable silos assumes you can audit each participant; deploy it across anonymous millions and the audit is impossible. The regime is the contract under which a federated algorithm is correct, and reading that contract first is what keeps the FedAvg variants of the next section from being applied where they cannot hold.

Research Frontier: Blurring the Two Regimes (2024 to 2026)

The clean cross-device versus cross-silo split is increasingly a spectrum rather than a binary. Cross-silo work in 2024 to 2026 pushes toward asynchronous federation, dropping the assumption that every silo reports in lockstep so that a hospital with a slow pipeline no longer stalls the round, which imports cross-device-style dropout tolerance into the silo setting. In the other direction, the FedDP-style line and production accounting tools (such as the differential-privacy reporting behind on-device learning at Google and Apple) tighten privacy guarantees for cross-device fleets to the auditable standard once reserved for silos. A third thread, "cross-silo with many silos", studies federations of hundreds to thousands of mid-size organizations (regional clinics, bank branches) that are stateful like silos but numerous enough to need cross-device-style sampling. The practical takeaway for a system designer: classify your deployment on each axis of Table 14.2.1 independently, because real federations now mix and match, and a 2026 system may be stateful, partially available, and privacy-audited all at once. We return to the secure-aggregation and differential-privacy primitives this frontier relies on in Chapter 4's collective-communication framing and again in Chapter 35.

With the two regimes drawn and their constraints made concrete, we can finally write the aggregation algorithm itself. The next section introduces FedAvg, the workhorse of federated learning, and its variants, each of which we will read against the regime contract established here: which ones tolerate partial participation, which ones require client state, and which ones survive the non-IID skew that the tiny-$n$ cross-device setting makes unavoidable. That development begins in Section 14.3.

Exercise 14.2.1: Classify the Federation Conceptual

For each deployment, place it on the cross-device versus cross-silo spectrum by reasoning through every axis of Table 14.2.1, and name one axis where it does not fit the textbook regime cleanly: (a) a smart-speaker vendor training a wake-word model across 50 million home devices; (b) five national weather agencies jointly training a forecasting model on their proprietary sensor archives; (c) a federation of 800 retail bank branches, each with a stable server but only a few thousand customers. Explain which axis would most strongly steer your choice of aggregation and privacy mechanism in each case.

Exercise 14.2.2: How Much Dropout Can a Round Survive? Coding

Extend Code 14.2.1 to sweep the cross-device dropout rate from 0% to 90% in steps of 10%, holding the sampled count fixed, and plot or print the relative gap between the survivor aggregate and the full-fleet ideal at each rate. Then repeat the sweep with the sampled count reduced from 1,000 to 50. Explain from your two curves why over-sampling protects the aggregate, and identify the regime where dropout starts to matter: is it the dropout rate alone, or the rate combined with a small surviving count?

Exercise 14.2.3: The Cost of Statefulness Analysis

A cross-silo method keeps a per-client correction vector of $P = 10^7$ floats (4 bytes each) on each of 20 silos to reduce client drift. Estimate the total state stored across the federation and argue why this is affordable for 20 silos. Then suppose you tried to carry the same per-client state in a cross-device fleet of $10^8$ phones where each phone is sampled on average once every few thousand rounds. Compute the state that would have to persist and explain, using the statefulness argument of Section 4, why the correction term is worthless by the time a client is sampled again. Tie your conclusion to the fault-tolerance reasoning of Section 2.4.