Section 14.8: Decentralized Learning

"They fired the coordinator and told us to just talk amongst ourselves. It took longer, but nobody could shut us down by unplugging one rack."
A Peer That Only Knows Its Neighbors

Big Picture

Decentralized learning removes the central coordinator entirely: there is no server, each node talks only to a few neighbors on a communication graph, and the global model emerges from repeated local averaging rather than from any single point that holds the answer. Federated learning still kept one server at the center to aggregate updates; that server is a bottleneck, a trust anchor, and a single point of failure. Decentralized methods replace it with gossip, in which every node mixes its model with its neighbors' models a little each round. Under a connected graph this repeated mixing drives all nodes to the same model, the average of what they started with, with no machine ever seeing the whole picture. This section shows why that consensus is exact in the limit, how the graph topology controls how fast it arrives, and what you trade away for deleting the server.

Every coordination pattern in this chapter so far has kept something at the middle. Centralized federated learning has a parameter server; the personalized variants of Section 14.7 still route updates through it. The parameter-server architecture of Chapter 11 makes the center explicit, a set of shards that every worker pushes gradients to and pulls parameters from. That center is convenient: it has a global view, it can apply one optimizer, and it makes the math easy to reason about. It is also a liability. It must stay up, it must be trusted with everyone's updates, and its network links carry traffic from every worker at once. When the workers are spread across the planet, owned by different parties, or joining and leaving unpredictably, a single coordinator stops being the obvious design. Decentralized learning asks what happens when you delete it.

The answer rests on a primitive we have already met from the other side. In Section 1.4 we drew the spectrum from centralized to decentralized coordination; decentralized learning lives at the far end, where coordination is purely peer-to-peer. The operation that holds it together is averaging, the same averaging that all-reduce performs globally, but now done locally and sparsely. Instead of one synchronized exchange that touches every node, each node averages with a handful of neighbors, and the global average assembles itself over many rounds as information diffuses across the graph.

1. The Communication Graph and the Gossip Step Beginner

Picture $n$ nodes, each holding its own copy of the model and its own slice of data. We connect them with a communication graph: an edge between two nodes means they are allowed to exchange models directly. The graph is sparse on purpose. A node might have two neighbors (a ring), a handful (a mesh), or a logarithmic number chosen at growing distances (an exponential graph). No node is connected to all the others, and crucially no node is connected to a privileged center, because there is no center.

Each round, every node performs the same simple act: it replaces its model with a weighted average of its own model and its neighbors' models. Collect the $n$ model vectors as the rows of a matrix $X$, and let $W$ be the mixing matrix whose entry $W_{ij}$ is the weight node $i$ places on node $j$'s model (zero when they are not neighbors). One gossip round is then a single matrix multiply,

$$X^{(t+1)} = W X^{(t)}, \qquad W_{ij} \ge 0, \quad \sum_{j} W_{ij} = 1, \quad W = W^\top.$$

The row-sums of $W$ are one, so each new model is a genuine average and no mass is created or destroyed. Because $W$ is symmetric and doubly stochastic, the column-sums are one as well, which means the average of all the models is preserved exactly at every round: whatever the nodes converge to, it must be the average they started with. This is the decentralized, sparse analogue of all-reduce. Where all-reduce computes the global average in one coordinated collective, gossip computes it as the fixed point of many local averages, never assembling the full sum in any one place.

Figure 14.8.1: Decentralized learning on a six-node ring. Every peer (orange) exchanges models only with its immediate ring neighbors; the central server (blue, struck out) is gone. Each gossip round is one local averaging step, and over many rounds the diffusion of models around the connected graph pulls all peers to the same value, the global average of their starting models.

Figure 14.8.1 shows the ring case, the sparsest connected graph. Notice that P1 never talks to P4 directly; for P1 to feel P4's data, that influence must travel P4 to P3 to P2 to P1, one hop per round. This is exactly why topology matters: the farther apart two nodes sit on the graph, the more rounds it takes for their models to reconcile.

Key Insight: Consensus Is the Fixed Point of Local Averaging

A doubly stochastic, symmetric mixing matrix preserves the global average of the models at every gossip round while shrinking the disagreement between nodes. So the only stable configuration is the one where every node holds the same vector, and that vector must be the average the nodes began with. No node ever computes the global sum; it assembles itself implicitly as the unique fixed point that local averaging cannot move. Decentralized learning is all-reduce performed by diffusion instead of by a coordinated collective.

2. Gossip-Based SGD Intermediate

Pure averaging reconciles models that already exist; learning needs each node to keep improving its model on its own data while it reconciles. Gossip-based SGD interleaves the two. In each round, node $i$ first takes one or more local gradient steps on its private loss $f_i$, then averages the result with its neighbors. Writing $w_i^{(t)}$ for node $i$'s model and $\mathcal{N}(i)$ for its neighbors, one round of decentralized SGD is

$$w_i^{(t+1)} = \sum_{j \in \{i\} \cup \mathcal{N}(i)} W_{ij}\, \Big( w_j^{(t)} - \eta\, \nabla f_j\big(w_j^{(t)}\big) \Big).$$

The local step pushes each node toward fitting its own data; the gossip step pulls neighbors back together. When the graph is connected, these two forces balance, and the nodes converge to a single model that minimizes the average of all the local losses, the same objective a central federated server would target, reached without any server. The catch is in the rate. Disagreement between nodes shrinks each round by a factor governed by the mixing matrix, and that factor is set entirely by the graph.

The relevant quantity is the spectral gap of $W$. Order the eigenvalues of the symmetric mixing matrix as $1 = \lambda_1 > \lambda_2 \ge \cdots \ge \lambda_n > -1$. The largest eigenvalue is always one (the all-ones consensus direction). The disagreement decays like $\rho^t$ where $\rho = \max(|\lambda_2|, |\lambda_n|)$ is the second-largest eigenvalue magnitude, so the number of rounds to reach a target consensus scales with $1/\gamma$, where

$$\gamma = 1 - \rho, \qquad \rho = \max\big(|\lambda_2|, |\lambda_n|\big).$$

$\gamma$ is the spectral gap. A large gap (close to one) means fast mixing and few rounds; a tiny gap means slow mixing and many rounds. The gap is a property of the topology: dense, well-connected graphs have large gaps, while long, thin graphs like the ring have small ones. This is the decentralized cousin of the communication-cost tradeoff from Chapter 10, now expressed as a graph eigenvalue.

Thesis Thread: All-Reduce, Diffused Across a Graph

The all-reduce of Chapter 1 computes a global average with one coordinated collective that touches every worker in lockstep; its cost is one synchronized round but it needs a logically central schedule. Gossip computes the same average with no schedule and no center, paying instead in rounds: each gossip step is a sparse, partial all-reduce over neighbors, and consensus is the limit of repeating it. The two sit at opposite ends of the coordination spectrum this book traces. Whenever a later method needs many machines to agree on one vector, ask whether it pays for agreement in synchronization (all-reduce) or in rounds (gossip); the spectral gap is the exchange rate between them.

The demonstration below makes the topology-versus-rate relationship concrete. It runs pure gossip averaging (the reconciliation half of decentralized SGD, isolated so the consensus dynamics are visible) on twelve nodes, comparing a ring against a denser exponential graph. Both must converge to the same global average; the question is how many rounds each needs, and the spectral gap predicts the gap in round counts.

import numpy as np

rng = np.random.default_rng(7)
n = 12                                  # peers on a communication graph
dim = 5                                 # each peer holds a model vector
X = rng.standard_normal((n, dim))       # heterogeneous local models
global_mean = X.mean(axis=0)            # the consensus target

def ring_W(n):
    # each peer averages with itself and its two ring neighbors
    W = np.zeros((n, n))
    for i in range(n):
        for j in (i, (i - 1) % n, (i + 1) % n):
            W[i, j] = 1.0 / 3.0
    return W

def exp_W(n):
    # exponential graph: neighbors at distances 1, 2, 4, ... (denser)
    offs = [0]
    k = 1
    while k < n:
        offs += [k, n - k]
        k *= 2
    offs = sorted(set(o % n for o in offs))
    W = np.zeros((n, n))
    for i in range(n):
        nb = sorted(set((i + o) % n for o in offs))
        for j in nb:
            W[i, j] = 1.0 / len(nb)
    return W

def spectral_gap(W):
    ev = np.sort(np.abs(np.linalg.eigvals(W)))[::-1]
    return 1.0 - ev[1]                  # 1 - second-largest |eigenvalue|

def gossip(W, X0, tol=1e-6, max_rounds=2000):
    X = X0.copy()
    for r in range(1, max_rounds + 1):
        X = W @ X                        # one synchronous gossip round
        err = np.max(np.linalg.norm(X - global_mean, axis=1))
        if err < tol:
            return r, err
    return max_rounds, err

for name, W in [("ring (2 neighbors)", ring_W(n)), ("exponential (denser)", exp_W(n))]:
    rounds, err = gossip(W, X)
    deg = int(np.count_nonzero(W[0]))    # nonzeros in one row = self + neighbors
    print(f"{name:22s}  row degree {deg}  spectral gap {spectral_gap(W):.3f}  "
          f"rounds to consensus {rounds:4d}  final error {err:.1e}")

print(f"consensus target (global mean): {np.round(global_mean, 4)}")

Code 14.8.1: Gossip averaging from first principles on twelve peers, comparing a ring against an exponential graph. Each round is one mixing multiply W @ X; no node ever forms the global sum. The spectral gap is read directly off the mixing matrix and lined up against the measured round count.

ring (2 neighbors)      row degree 3  spectral gap 0.089  rounds to consensus  149  final error 9.9e-07
exponential (denser)    row degree 7  spectral gap 0.571  rounds to consensus   16  final error 9.1e-07
consensus target (global mean): [-0.3469 -0.3528 -0.1292 -0.3279  0.11  ]

Output 14.8.1: Both topologies reach the identical global average, but the denser exponential graph (spectral gap 0.571) gets there in 16 rounds while the ring (spectral gap 0.089) needs 149. The roughly tenfold gap in rounds tracks the roughly sixfold gap in spectral gap, exactly the topology-controlled convergence the theory predicts.

The two graphs converge to the same vector, the global mean printed on the last line, confirming that the choice of topology changes the speed of consensus but not its destination. The denser graph pays more per round (each node averages with six neighbors instead of two) and converges in far fewer rounds; the ring pays little per round and needs many. That tradeoff, more communication per round against fewer rounds, is the central design dial of decentralized learning, and it is exactly what the spectral gap quantifies.

Practical Example: Training Across Three Data Centers With No Shared Master

Who: An infrastructure team at an organization with GPU clusters in three regional data centers on different continents.

Situation: They wanted to co-train one model across all three sites, but inter-region links were high-latency and a central aggregator in any one region penalized the other two.

Problem: A parameter server in region A forced every gradient from B and C across an expensive intercontinental hop twice per step, and an outage in A stalled the whole job.

Dilemma: Keep a central aggregator and accept the bottleneck plus the single point of failure, or go decentralized and accept slower statistical convergence and harder debugging.

Decision: They went decentralized, organizing the three sites as a fully connected three-node graph and letting each site gossip its averaged model with the other two after a block of local steps.

How: Each site ran data-parallel training internally, then once per block exchanged and averaged model weights peer-to-peer with the other two sites; with only three well-connected nodes the spectral gap was large, so a single gossip round per block kept them tightly synchronized.

Result: No region was a bottleneck or a master, an outage in any one site left the other two still training and reconcilable on its return, and the final accuracy matched a centralized run because the consensus target is the same global average.

Lesson: When nodes are few and well connected, decentralization removes the bottleneck and the single point of failure at almost no convergence cost; the spectral gap tells you in advance whether that holds for your topology.

3. Why Decentralize, and What It Costs Intermediate

The case for deleting the coordinator is concrete. There is no central bottleneck: traffic stays on the sparse edges of the graph instead of converging on one server's network links, which is what let the three data centers avoid funneling every gradient through one region. There is no single point of failure: any node can drop out and, as long as the remaining graph stays connected, the others keep learning and reconcile the returning node later. And there is no central trust anchor, which matters when the peers are mutually distrustful or belong to different owners, as in geo-distributed or open peer-to-peer settings where no party is willing to host everyone's updates. These are the same pressures that motivate federated learning, pushed one step further by removing the last centralized component.

The costs are equally concrete. Mixing is slower than a global collective: information must diffuse hop by hop, so a poorly connected graph can need orders of magnitude more rounds than a single all-reduce, exactly the 149-versus-16 spread of Output 14.8.1 widened on larger, sparser graphs. The analysis is harder: convergence now depends on the interaction of the loss, the step size, the data heterogeneity across nodes, and the spectral gap, where a centralized method only had the first two. And consensus is asymptotic; at any finite round the nodes hold slightly different models, so there is no single canonical checkpoint unless you run an extra averaging pass. Decentralization buys robustness and removes bottlenecks, and it pays in mixing time and analytical complexity.

Library Shortcut: One Decentralized All-Reduce in BlueFog or torch.distributed

Code 14.8.1 built the mixing matrix and the gossip loop by hand. Decentralized training libraries expose a neighbor-average as a single collective that mirrors the all-reduce API. BlueFog, built on top of torch.distributed, wires the communication graph once and reduces the entire gossip step to one call:

# Decentralized SGD step with BlueFog: no central server, neighbors only.
import bluefog.torch as bf
bf.init()
bf.set_topology(bf.ExponentialGraph(bf.size()))   # the W of Code 14.8.1, picked for you

for data, target in local_loader:                 # this peer's own shard
    loss = model(data, target)
    loss.backward()
    optimizer.step()                               # local gradient step
    # one gossip round: average params with graph neighbors, no server involved
    for p in model.parameters():
        p.data = bf.neighbor_allreduce(p.data)     # sparse, decentralized all-reduce

Code 14.8.2: The manual mixing multiply of Code 14.8.1 collapses to one neighbor_allreduce per round. The library builds the doubly stochastic mixing matrix from the chosen topology, schedules only the neighbor exchanges, and never instantiates a central aggregator, turning a dozen lines of hand-rolled gossip into a single collective call.

Fun Note: Gossip Is Named After Exactly What You Think

The term is not a metaphor reached for after the fact. The original gossip protocols from distributed systems were modeled on how a rumor spreads through a social network: each person tells a few neighbors, who tell a few of theirs, and within a logarithmic number of rounds everyone has heard it. Decentralized learning inherits both the name and the dynamics, with one upgrade. A rumor mutates as it spreads; a doubly stochastic average does not, so instead of the story drifting, the peers converge on the one true number, the global mean. It is gossip that, remarkably, makes everyone more accurate.

4. The Frontier: Decentralized Training Over the Open Internet Advanced

For a long time decentralized learning was a theory result with small experiments, because the mixing cost looked fatal for large models. That assumption is now under active assault. The same local-update idea that made federated learning practical, taking many local steps between communications, applies directly here and is the lever that makes decentralized training over slow, wide-area links viable.

Research Frontier: Over-the-Internet Decentralized Training (2024 to 2026)

DiLoCo (Douillard et al., 2024) showed that language models can be trained with workers communicating only every few hundred local steps, slashing the bandwidth a decentralized or geo-distributed setup needs by orders of magnitude, and its streaming and asynchronous successors (Streaming DiLoCo and async variants, 2024 to 2025) relax the synchrony further so slow or distant peers no longer stall the group. Community-run efforts have pushed this onto genuinely open networks: Prime Intellect's INTELLECT-1 (2024) trained a ten-billion-parameter model across volunteer nodes on three continents using a decentralized framework, and decentralized-optimizer lines such as DeMo (2024) co-design the gradient compression and the gossip to cut inter-peer traffic. The throughline is that the spectral-gap and local-step ideas in this section, once a blackboard exercise, now underpin running systems that train frontier-scale models with no central coordinator at all. We connect the optimization side of these methods to the communication-cost models of Chapter 10.

This closes the chapter's arc. We began with a central federated server, relaxed how often it heard from clients, personalized what it sent back, and have now removed it entirely. The next section turns from the coordination topology to the place where these nodes physically live, the resource-constrained edge devices that learn on the data they generate, in Section 14.9.

Exercise 14.8.1: Why the Average and Nothing Else Conceptual

Using the properties stated in Section 1 (the mixing matrix $W$ is symmetric, has non-negative entries, and every row sums to one), argue in words why the quantity $\frac{1}{n}\sum_i w_i$ is unchanged by a gossip round, and why this forces the consensus value to be the average of the starting models rather than any other vector. Then explain what would go wrong with the consensus target if $W$ were row-stochastic but not column-stochastic, as happens when the graph is directed.

Exercise 14.8.2: Topology Versus Rounds Coding

Extend Code 14.8.1 with a third topology: a fully connected graph where every node averages equally with all $n$ nodes. Predict its spectral gap before running (hint: after one round every node already holds the exact mean), then measure its rounds to consensus and confirm. Next, sweep $n$ from 8 to 64 for the ring and plot rounds to consensus against $n$. Describe how the ring's round count grows with the number of nodes and relate that growth to the shrinking spectral gap.

Exercise 14.8.3: Gossip Versus All-Reduce Under Latency Analysis

Suppose one all-reduce across $n$ geo-distributed nodes costs $L$ seconds of latency per call, while one gossip round on your graph also costs $L$ but reaches consensus only after $R$ rounds, where $R \approx 1/\gamma$ and $\gamma$ is the spectral gap. For the ring and exponential graphs of Output 14.8.1, estimate the wall-clock time each spends reaching consensus and compare both to a single all-reduce. Under what condition on $\gamma$ does decentralized gossip lose its wall-clock advantage despite avoiding the central bottleneck, and how does taking more local steps between gossip rounds (as in DiLoCo) change the comparison?