Section 2.9: Distributed System Patterns for AI

"After my third distributed system I stopped learning new architectures. I started recognizing old ones wearing new framework names."
A Coordinator That Has Seen This Shape Before

Big Picture

Almost every distributed AI system in this book is built from one of six recurring structural patterns, and learning to recognize them turns a parade of frameworks into a small, reusable vocabulary. Chapter 2 gave you the primitives: processes and coordinators (Section 2.1), communication and synchronization (Section 2.2), partitioning and replication (Section 2.3), failure and recovery (Section 2.4), consistency (Section 2.5), control-plane consensus (Section 2.6), stragglers (Section 2.7), and locality (Section 2.8). A pattern is a specific, repeated way of wiring those primitives together to solve a class of problem: where the shared state lives, how workers synchronize, how the system survives a dead node, and which workload it fits. This closing section catalogs the six patterns the rest of the book reuses, gives each a one-line "when to use" rule, and points to the chapter that develops it in full. By the end you should be able to look at any distributed AI design and name its pattern, which is the fastest route to predicting its costs and its failure modes.

A pattern is not a framework and not an algorithm; it is the shape of the data flow and the coordination, independent of the library that implements it. The same parameter-server pattern underlies a 2014 recommendation system and a 2025 sparse-embedding store, even though the code shares not a single line. Recognizing the shape is what lets you carry intuition from one system to the next: once you know that all-reduce makes every worker hold the same averaged result in lockstep, you know its synchronization cost and its straggler sensitivity before you read a word of the framework's documentation. Each of the six patterns below names one such shape, states where its state lives and how it synchronizes (the concerns of Section 2.2 and Section 2.3), how it handles a failed node (Section 2.4), and the workload it was born to serve. The forward links are deliberate: this section is a table of contents for the engine room of the book.

1. Six Patterns, One Vocabulary Beginner

We catalog six patterns. They are not exhaustive, but together they account for the large majority of the distributed AI systems you will build or read about, and every later chapter is, in structural terms, a deep treatment of one of them. Read each pattern as a triple: the shape of its state, its synchronization discipline, and the one-line rule for when it is the right choice.

Pattern 1: parameter server (central state, push and pull). One logical store, itself sharded across nodes, holds the authoritative model parameters. Workers pull the parameters they need, compute an update on local data, and push the update back; the server applies it. State is central and mutable; synchronization can be tight or deliberately loosened into bounded staleness (Section 2.5); a failed server shard is recovered from a replica or checkpoint, a failed worker simply stops contributing. When to use: the parameter set is huge and sparse, so each worker touches only a slice of it, as with billion-row embedding tables. This pattern is the subject of Chapter 11.

Pattern 2: all-reduce and collective SGD (no central state, symmetric peers). There is no server. Every worker holds a full replica of the model, computes a gradient on its data shard, and the peers combine their gradients with an all-reduce so that all of them end the step holding the identical averaged result. State is replicated, not central; synchronization is tight and symmetric; a failed worker stalls the collective until the group is reformed. When to use: the model fits on one device and the binding ceiling is training throughput, so you replicate the model and split the data. The collective itself is built in Chapter 4, and it becomes a training loop in Chapter 15.

Pattern 3: MapReduce and scatter-gather (data-parallel, stateless tasks). A coordinator scatters a bounded dataset into independent tasks, each processes its shard with no inter-task communication, and the results are gathered (and optionally shuffled and reduced) into the output. State lives in the data, not in the workers; there is no per-step synchronization, only a barrier between phases; a failed task is simply re-executed because it is deterministic and side-effect-free. When to use: an embarrassingly parallel pass over a dataset too big for one machine, such as deduplicating or tokenizing a web crawl. This pattern, and the re-execution fault tolerance that makes it robust, is Chapter 6.

Pattern 4: actor-learner (decoupled generation and training). Many actor processes interact with environments to generate experience, and one or a few learner processes consume that experience to update a policy, which is then shipped back to the actors. State is split: transient experience on the actors, authoritative policy on the learner; synchronization is loose and asynchronous by design, so a slow actor never blocks the learner; a failed actor is just restarted and its lost experience is regenerated. When to use: reinforcement learning, where collecting experience is the throughput bottleneck and must scale independently of the gradient step. This is the spine of Chapter 20.

Pattern 5: sharded and replicated serving fleet (stateless replicas behind a balancer). The model is replicated across many serving nodes (and, for large models, each replica is itself sharded across devices); a load balancer routes each request to a healthy replica. State is the (read-only) model weights plus per-request transient state; there is no cross-replica synchronization on the hot path; a failed replica is removed from rotation and its traffic shifts to the survivors. When to use: the binding ceiling is inference throughput or tail latency under a strict budget. This is Chapter 23.

Pattern 6: hierarchical and decentralized coordination (peers, gossip, and aggregation trees). No single node holds all the state or coordinates all the others. Workers organize into a tree or a peer graph and exchange partial results with neighbors, aggregating up a hierarchy or mixing through gossip, so that information spreads without any central bottleneck or single trust domain. State is partitioned and locally owned; synchronization is partial and often asynchronous; the structure tolerates node loss because no node is irreplaceable. When to use: when no party may hold all the data (privacy), when nodes are geographically far apart, or when a central coordinator would be a bottleneck or a single point of failure. Federated and decentralized learning, in Chapter 14, is the canonical instance.

Key Insight: A Pattern Is Defined by Where State Lives and How Peers Sync

Strip away the framework and two questions identify any distributed AI pattern: where does the authoritative state live (one central store, replicated everywhere, in the data, or partitioned across owners), and how do the workers synchronize (tight lockstep, bounded staleness, phase barriers, or fully asynchronous). Those two answers predict the rest: a central store with loose sync gives you the parameter server's staleness trade-off; replicated state with tight sync gives you all-reduce's straggler sensitivity. You can read a system's cost profile and its failure behavior off these two axes before you ever see its code.

2. The Patterns Side by Side Beginner

Listing the patterns one by one hides what is clearest in a single view: they differ along exactly the axes Chapter 2 spent its sections building. Table 2.9.1 lays all six against where their state lives (Section 2.3), their synchronization model (Section 2.2 and Section 2.5), how they handle a failed node (Section 2.4), and the workload each fits best. Read down any column to compare the patterns on one concern; read across any row to get a pattern's full profile at a glance.

Table 2.9.1: The six recurring distributed AI patterns mapped to where their state lives, their synchronization model, their failure handling, the best-fit workload, and the chapter that develops each. The columns are the Chapter 2 concerns; the rows are the patterns the rest of the book reuses.

Pattern	Where state lives	Sync model	Failure handling	Best-fit workload	Developed in
Parameter server	Central, sharded store (mutable)	Tight or bounded-stale push/pull	Replica or checkpoint per server shard; workers re-join	Huge sparse parameters (embeddings)	Ch 11
All-reduce / collective SGD	Full replica on every peer	Tight, symmetric, lockstep	Group reformed on loss; step retried from last state	Model fits one device, scale training throughput	Ch 4, Ch 15
MapReduce / scatter-gather	In the data; tasks are stateless	Phase barriers only, no per-step sync	Deterministic task re-execution	One big embarrassingly parallel data pass	Ch 6
Actor-learner	Experience on actors, policy on learner	Asynchronous, decoupled by design	Restart actor, regenerate lost experience	Reinforcement learning at scale	Ch 20
Sharded/replicated serving	Read-only weights plus per-request state	None on the hot path (independent replicas)	Drain replica, shift traffic to survivors	High-throughput, low-latency inference	Ch 23
Hierarchical/decentralized	Partitioned, locally owned by each peer	Partial, gossip or tree, often async	No node irreplaceable; structure self-heals	Privacy-bound or geo-distributed learning	Ch 14

Two structural contrasts in Table 2.9.1 are worth naming because they recur throughout the book. The first is central versus symmetric state: the parameter server keeps one authoritative copy that workers read and write, while all-reduce keeps a full copy on every peer and reconciles them each step. The same training job can be expressed either way, and the choice is a communication-cost trade-off that Chapter 4 makes precise. The second is synchronous versus asynchronous coupling: all-reduce is lockstep, so one straggler (Section 2.7) stalls everyone, whereas the actor-learner pattern deliberately decouples generation from training so that a slow actor costs only its own share of experience. Figure 2.9.1 draws the three most contrasting shapes so you can see the data flow rather than read it.

Figure 2.9.1: Three of the six patterns contrasted by data flow. Left, the parameter server keeps authoritative state in a central (sharded) store; workers pull parameters and push updates (the double-headed arrows). Middle, all-reduce has no central node; symmetric peers exchange gradients so each ends the step holding the identical averaged result. Right, the actor-learner pattern decouples experience generation (actors, in orange) from policy updates (one learner), with the new policy shipped back asynchronously so a slow actor never stalls the learner.

3. Matching a Workload to Its Pattern Intermediate

The practical skill is not memorizing the six patterns but matching a workload to the one that fits. A workload exerts pressure along a few concerns: how much one authoritative copy of state must be shared, how synchronous the steps must be, how much the work fans out into independent shards, and how much the binding pressure is request throughput. Each pattern is good at a particular profile of those concerns. The program below scores five representative AI workloads against the six patterns using a simple cosine match between each workload's pressure vector and each pattern's strength vector, and reports the best fit. It is a teaching toy, not a production planner, but it makes the matching logic explicit and reproducible.

# Six distributed AI patterns scored against one workload's pressure profile.
# Each pattern is rated on how well it fits a given pressure vector, then ranked.
import numpy as np

# A workload's pressure on four concerns, each in [0, 1].
#   central_state : how much one authoritative copy of state must be shared
#   sync_tight    : how synchronous the steps must be (1 = lockstep, 0 = loose)
#   fan_scatter   : how much the work fans out into independent shards
#   serve_qps     : how much the binding pressure is request throughput
workloads = {
    "data-parallel LLM training": dict(central_state=0.5, sync_tight=0.9, fan_scatter=0.4, serve_qps=0.1),
    "huge sparse recsys embeddings": dict(central_state=0.95, sync_tight=0.3, fan_scatter=0.6, serve_qps=0.3),
    "batch corpus dedup":          dict(central_state=0.1, sync_tight=0.1, fan_scatter=0.95, serve_qps=0.0),
    "online RL from a simulator":  dict(central_state=0.6, sync_tight=0.2, fan_scatter=0.8, serve_qps=0.2),
    "low-latency model serving":   dict(central_state=0.1, sync_tight=0.1, fan_scatter=0.3, serve_qps=0.95),
}

# Each pattern is a weight vector over the same four concerns: what it is good at.
patterns = {
    "parameter server":          np.array([0.95, 0.30, 0.50, 0.30]),
    "all-reduce / collective":   np.array([0.40, 0.95, 0.50, 0.10]),
    "MapReduce / scatter-gather":np.array([0.10, 0.10, 0.95, 0.05]),
    "actor-learner":             np.array([0.55, 0.20, 0.90, 0.25]),
    "sharded/replicated serving":np.array([0.15, 0.10, 0.40, 0.95]),
    "hierarchical/decentralized":np.array([0.30, 0.25, 0.60, 0.40]),
}
keys = ["central_state", "sync_tight", "fan_scatter", "serve_qps"]

for wname, w in workloads.items():
    v = np.array([w[k] for k in keys])
    scores = {p: float(np.dot(v, pw) / (np.linalg.norm(v) * np.linalg.norm(pw)))
              for p, pw in patterns.items()}
    best = max(scores, key=scores.get)
    print(f"{wname:30s} -> {best:28s} (fit {scores[best]:.3f})")

Code 2.9.1: A pattern-matcher that scores five AI workloads against the six patterns of Table 2.9.1 by cosine similarity of their concern vectors. The strength vectors encode each pattern's profile (the parameter server scores high on central state, all-reduce on tight sync, and so on); the workload with the closest profile selects the pattern.

data-parallel LLM training     -> all-reduce / collective      (fit 0.992)
huge sparse recsys embeddings  -> parameter server             (fit 0.997)
batch corpus dedup             -> MapReduce / scatter-gather   (fit 0.999)
online RL from a simulator     -> actor-learner                (fit 0.995)
low-latency model serving      -> sharded/replicated serving   (fit 0.995)

Output 2.9.1: Each of the five workloads selects exactly the pattern the prose argued for it: replicated training picks all-reduce, sparse embeddings pick the parameter server, a batch data pass picks MapReduce, RL picks actor-learner, and latency-bound serving picks the replicated fleet. The high fit scores reflect that these are deliberately clean cases; real workloads sit between patterns and combine them.

The clean separation in Output 2.9.1 is a feature of the chosen examples, not a claim that every system maps to exactly one pattern. Real systems compose patterns, and the most demanding ones use several at once. Training a large language model uses all-reduce for the data-parallel dimension, a sharded (parameter-server-like) layout for the optimizer state, and a MapReduce-style pass to prepare the corpus, then serves the result from a sharded replica fleet. The point of the catalog is not to force a system into one box but to give you the names for the boxes, so that you can describe a complex system as a composition of patterns whose individual costs and failure modes you already understand.

Library Shortcut: Each Pattern Is a Framework You Call, Not a System You Build

You almost never implement these patterns from scratch; each has a mature library that owns it, and naming them turns "which pattern" into "which import." The hand-rolled matcher in Code 2.9.1 collapses to a lookup once the pattern is chosen, because the pattern is the framework:

# Distributed AI pattern -> the production library that implements it.
PATTERN_TOOL = {
    "parameter server":          "TorchRec / a PS-backed embedding store  # shard huge sparse tables",
    "all-reduce / collective":   "torch DDP over NCCL                      # gradient all-reduce",
    "MapReduce / scatter-gather":"PySpark / Ray Data                       # scatter, shuffle, reduce",
    "actor-learner":             "Ray RLlib                                # actors generate, learner trains",
    "sharded/replicated serving":"vLLM / Ray Serve                         # replicate, batch, autoscale",
    "hierarchical/decentralized":"Flower / a federated-learning runtime    # aggregate without central data",
}
chosen = "all-reduce / collective"          # e.g. the verdict for data-parallel training
print(PATTERN_TOOL[chosen])

Code 2.9.2: The six patterns mapped to the production library that owns each, the same patterns named in Table 2.9.1. Selecting a pattern selects a dictionary key, and the value is a one-line starting point; the framework supplies the process-group setup, sharding, and fault handling that the pattern's row in the table describes.

Practical Example: The Team That Picked a Pattern by Name and Skipped a Month

Who: A small ML platform team standing up training for a new ads ranking model.

Situation: The model had a dense tower of a few hundred megabytes and an embedding table of tens of billions of sparse parameters, far too large for any single accelerator's memory.

Problem: Their first instinct, copied from a tutorial, was plain all-reduce data parallelism, which requires a full model replica per worker and therefore could not hold the embedding table at all.

Dilemma: Force the embeddings to fit by aggressive hashing and lose accuracy, or step back and ask which pattern the embedding table actually wanted.

Decision: They named the shapes from Table 2.9.1. The dense tower fit the all-reduce pattern; the giant sparse table fit the parameter-server pattern, where each worker pulls only the rows its batch touches. The system was a composition of two patterns, not one.

How: They sharded the embedding table across a parameter-server-style store and kept the dense tower in all-reduce data parallelism, exactly the hybrid that Chapter 11 formalizes, using an off-the-shelf sharded-embedding library rather than writing their own.

Result: Training fit in memory on the first try and ran at the throughput target; the month they had budgeted for a custom sharding scheme was spent on the model instead.

Lesson: Recognizing that a system is two patterns composed, not one pattern stretched, is often the entire design decision. The catalog exists to make that recognition fast.

4. Why These Patterns Persist Intermediate

It is fair to ask why a handful of patterns, several of them more than a decade old, keep reappearing under new framework names. The reason is that each pattern is the natural answer to a specific combination of the constraints Chapter 2 made precise, and those constraints do not change when the hardware does. The parameter server exists because some state is too large and too sparse for every worker to hold a copy; that is a property of the data, not of any GPU generation. All-reduce exists because, when the model does fit, replicating it and reconciling gradients is provably exact (the gradient identity of Section 1.1, seen again as a BSP barrier in Section 2.2) and avoids a central bottleneck. MapReduce persists because deterministic, side-effect-free tasks are the cheapest possible thing to make fault tolerant, and a web-scale data pass needs exactly that. The patterns are stable because the trade-offs they resolve are stable.

What does change is the engineering around each pattern, and that is where the current frontier lives. The patterns are fixed points; the research is in lowering their costs and blurring their boundaries.

Research Frontier: The Patterns Blur and Recompose (2024 to 2026)

The newest distributed AI systems are increasingly hybrids that compose the classic patterns rather than picking one. On the training side, fully sharded data parallelism (FSDP and the ZeRO family) reads as an all-reduce pattern whose state is also sharded parameter-server style, so reduce-scatter and all-gather replace the single all-reduce; we develop this directly in Chapter 15. On the serving side, disaggregated inference (the prefill-decode split in systems like DistServe and Splitwise, 2024) splits the sharded-serving pattern across heterogeneous machines to optimize latency and cost at once. Decentralized training has moved from theory toward practice: local-update methods such as DiLoCo (Douillard et al., 2024) and its open replications let the hierarchical pattern's peers communicate far less often, enabling genuinely geo-distributed, over-the-internet training that the lockstep all-reduce pattern forbade. Actor-learner infrastructure has scaled to power large-model RL from human feedback and from verifiers, with frameworks in the lineage of Ray RLlib and OpenRLHF (2024) orchestrating thousands of asynchronous actors against a handful of learners. The pattern vocabulary is not being replaced; it is being recombined, which is precisely why learning the six shapes now pays off across every chapter ahead.

5. Chapter Summary Beginner

This section closes Chapter 2, so it is worth restating the arc the whole chapter built. We started from the actors of a distributed system, processes, nodes, workers, and coordinators (Section 2.1), and the ways they talk and stay in step (Section 2.2). We saw how state is spread by partitioning, sharding, and replication (Section 2.3), and why that spreading makes failure a design problem rather than an afterthought, met with checkpointing, re-execution, and recovery (Section 2.4). We made precise what "consistent" can mean when a parameter store is read and written by many workers, from staleness bounds to the CAP trade-off (Section 2.5), and separated the control plane, where consensus and leader election rule (Section 2.6), from the data plane, where collectives do. We named the stragglers and bottlenecks that decide real performance (Section 2.7) and the locality, of data and of compute, that you arrange to avoid moving bytes you did not have to (Section 2.8). This final section folded all of those concerns into six reusable patterns and showed that every distributed AI system you will meet is one of them, or a composition of a few.

Thesis Thread: The Concepts Become the Patterns Become the Book

The spine of this book is that AI at scale is the engineering of systems whose data, computation, models, inference, and decisions are distributed across many machines. Chapter 1 named the six axes that work can be distributed along; Chapter 2 supplied the systems vocabulary; this section turned that vocabulary into the six patterns that the rest of the book elaborates one at a time. The parameter server becomes Chapter 11, all-reduce becomes Chapters 4 and 15, MapReduce becomes Chapter 6, actor-learner becomes Chapter 20, the serving fleet becomes Chapter 23, and decentralized coordination becomes Chapter 14. Whenever a later chapter introduces a method, ask which pattern it is an instance of; the answer is almost always one of these six, and naming it tells you the costs and failures to expect.

Key Takeaway: Chapter 2 in Six Patterns

Distributed AI systems are built from a small, stable vocabulary of structural patterns, and recognizing the pattern predicts the cost and the failure mode. (1) Parameter server: central sharded state, push/pull, for huge sparse parameters. (2) All-reduce: replicated state, lockstep peers, for data-parallel training when the model fits one device. (3) MapReduce: stateless tasks, phase barriers, deterministic re-execution, for one big data pass. (4) Actor-learner: decoupled generation and training, asynchronous, for reinforcement learning. (5) Sharded/replicated serving: stateless replicas behind a balancer, for high-throughput low-latency inference. (6) Hierarchical/decentralized: partitioned, locally owned state with gossip or tree aggregation, for privacy-bound or geo-distributed learning. Identify a system's pattern (or composition of patterns) by where its state lives and how its peers synchronize, and the rest of its behavior follows.

With the concepts of Chapter 2 compressed into patterns, the next question is quantitative: given a pattern, exactly how well does it scale, and where does adding machines stop helping? That is the subject of Chapter 3, which gives you Amdahl's and Gustafson's laws, the roofline model, and the alpha-beta cost of communication, the arithmetic that turns "this pattern should scale" into a number you can defend.

Exercise 2.9.1: Name the Pattern Conceptual

For each system, name the primary pattern from Table 2.9.1 and state where its authoritative state lives and how its workers synchronize: (a) a nightly job that converts a 50-terabyte image archive into resized thumbnails with no cross-image dependency; (b) training a 1-billion-parameter language model that fits on one GPU across 64 GPUs to go faster; (c) a recommender whose 200-billion-parameter embedding table cannot fit on any single machine; (d) a robot-control policy learned from thousands of parallel physics simulators. For any one system, name a second pattern it would also need in a complete deployment and why.

Exercise 2.9.2: Extend the Matcher Coding

Extend Code 2.9.1 with two new workloads: a privacy-constrained model trained across ten hospitals that may not share raw data, and a geo-distributed model trained across three continents with expensive inter-region links. Add the concern cannot_centralize (in $[0,1]$) to every workload and pattern vector, give the hierarchical/decentralized pattern a high weight on it, and confirm both new workloads select that pattern. Then report the second-best pattern for each and explain in two sentences what that runner-up tells you about the composition the real system would use.

Exercise 2.9.3: Cost a Pattern's Failure Tax Analysis

Two of the six patterns handle a dead node very differently: all-reduce stalls the whole group until it is reformed, while MapReduce simply re-executes the one lost task. Assume a job of $T = 1000$ independent units of work on $K = 100$ nodes, where each node fails independently with probability $p = 0.01$ during the job. For the MapReduce pattern, estimate the expected extra work from re-executing failed units. For the all-reduce pattern, argue qualitatively why the cost of a single failure is far higher even though the failure probability is the same, and connect your answer to the straggler discussion of Section 2.7. State which pattern you would prefer for a long-running job on cheap preemptible nodes, and why.

Project Ideas: Build and Compare the Patterns

These projects turn the catalog into something you run. Each is sized as a starting point; one carried far enough becomes a strong portfolio piece or a seed for the capstone in Chapter 41.

1. Two patterns, one model. Take a small model with both a dense layer and a large embedding table (a tiny recommender works well). Implement data-parallel all-reduce for the dense part and a simple parameter-server pull/push for the embedding rows each batch touches, in a single multiprocess script. Measure memory and per-step time as you grow the embedding table, and show the point where a pure all-reduce design runs out of memory but the hybrid does not. This is the practical-example story of Section 3 made real, and a direct on-ramp to Chapter 11.

2. A pattern-recognizer for real systems. Extend Code 2.9.1 into a small tool that reads a short structured description of a system (state size, sync requirement, request rate, trust boundaries) and outputs the recommended pattern, a runner-up, and the library from Code 2.9.2. Validate it by feeding in five real systems from papers or blog posts and checking that its verdict matches the architecture the authors actually chose. Document the cases where it disagrees and why.

3. Failure-tax benchmark. Implement a toy MapReduce (scatter-gather) and a toy all-reduce over the same set of worker processes, then inject random worker failures at a controlled rate. Measure how total completion time degrades with the failure rate for each pattern, reproducing the qualitative contrast of Exercise 2.9.3 with real numbers, and write up which pattern you would deploy on preemptible hardware and why. This previews the elastic and fault-tolerant training of Chapter 18.