"They told me I was one of four very different systems. I checked, and every one of us was holding a coordination barrier with the same tired hand."
A Scheduler That Has Served Every Workload
The six axes of distribution are not a taxonomy to memorize; they are coordinates, and every real distributed AI system is a point in that space. Once you can locate a system by which axes it spreads across machines, two questions answer themselves: why it had to be distributed at all (which ceiling bound) and what its hardest engineering problem will be (which axis it leans on most). This section plots four very different systems, a web-scale retrieval-augmented question answerer, a planetary-scale recommender, a foundation model trained on thousands of accelerators, and a drone swarm acting at the edge, onto the same six coordinates from Section 1.2. The systems share almost no code, yet they all live in one space, and reading their coordinates is the skill that the design-space checklist in Section 1.8 turns into a method.
The previous sections built the vocabulary one piece at a time: the six axes (Section 1.2), scale-out versus scale-up (Section 1.3), and the throughput, latency, cost, and reliability targets a system is judged against (Section 1.6). This section spends that vocabulary on four concrete systems. For each one we ask the same four questions: what it does, which ceilings force it off a single machine, which axes it distributes, and one striking number that makes the scale tangible. The point is not the systems individually; it is that one set of coordinates describes all of them, and that the coordinates predict where the difficulty lives before you have read a line of the system's code.
1. Vignette One: Web-Scale Retrieval-Augmented Question Answering Beginner
A retrieval-augmented generation (RAG) system answers a question by first retrieving relevant passages from a very large corpus and then feeding those passages, plus the question, to a language model that writes the answer. At web scale the corpus is a crawl of billions of documents, far too large to fit in one machine's memory or even on one machine's disk, so the corpus is chunked, embedded into vectors, and the resulting index is sharded across many retrieval servers. The binding ceilings are data (the corpus does not fit) and model (the answering model is large enough to need sharding across accelerators), and a third, throughput, because user queries arrive continuously under a latency budget. This system therefore distributes data, the model, inference, and the cluster coordination that ties retrieval to generation; only training and multi-agent intelligence sit idle. The retrieval mechanics are developed in Chapter 25, and the full system is the subject of the case study in Chapter 36.
The striking number is the index. A web crawl of roughly $10^{10}$ passages, each embedded as a 768-dimensional float vector, is about $10^{10} \times 768 \times 4$ bytes, near 30 terabytes of raw vectors before any compression, which is why the index must be partitioned across dozens to hundreds of retrieval shards that are searched in parallel and merged. The latency budget for the retrieval step is often a few tens of milliseconds, so the merge across shards is itself a distributed-systems problem, not an afterthought.
You do not need the source code to guess where a distributed AI system will hurt. Locate it on the six axes first. A system that distributes data and the model but serves under a tight latency budget will spend its engineering effort on the merge-and-coordinate step that stitches sharded results back into one answer in time. A system that distributes training across thousands of workers will spend it on the collective that keeps those workers consistent. The axis a system leans on hardest is the axis where its failures, its costs, and its cleverest engineering all concentrate.
2. Vignette Two: A Planetary-Scale Recommender Beginner
A large recommender ranks items (videos, products, posts) for each user, and it is the purest example of a system distributed along almost every axis at once. Its models are often modest in raw compute but enormous in memory, because the embedding tables that map billions of users and items to vectors can reach terabytes, far beyond one accelerator, so the tables are sharded across a fleet of parameter servers (the technique developed in Chapter 11). The interaction logs that train it are petabyte-scale, forcing distributed data processing; the model trains across many workers; and the ranking service answers an immense request volume under a latency budget. Data, training, the model, inference, and coordination are all distributed at once; only multi-agent intelligence is absent. This is the system whose case study is Chapter 38.
The striking number is the embedding table. A recommender that embeds $10^9$ users and $10^9$ items into 128-dimensional vectors holds $2 \times 10^9 \times 128 \times 4$ bytes of parameters, roughly a terabyte, in the embeddings alone, dwarfing the dense layers on top. No single accelerator holds a terabyte of parameters, so the table must be sharded by key across many servers, and a single training step gathers exactly the rows it touches from across the fleet. That sparse, key-addressed gather is what makes a recommender's distribution pattern different from a dense model's, and it is why the embeddings-at-scale arc threads from Chapter 11 through to the case study.
3. Vignette Three: Foundation-Model Pretraining on Thousands of GPUs Intermediate
Pretraining a foundation model is the system that distributes along the training-side axes hardest of all. A model with hundreds of billions of parameters cannot fit, with its optimizer state and activations, in one accelerator's memory, so it must be split across devices (model parallelism); the training corpus is trillions of tokens, so the data is partitioned across workers (data parallelism); and the whole run occupies thousands of accelerators for weeks, so cluster coordination, checkpointing, and fault tolerance become first-class concerns. Data, training, the model, and coordination are all distributed; inference and multi-agent intelligence are not part of the training job itself. The combination of three parallelism strategies at once is exactly why Chapter 19 can only be written after the chapters that own each strategy: data parallelism builds on the all-reduce of Chapter 4, model sharding on Chapter 16, and expert routing on Chapter 17.
The striking number is the failure rate, not the FLOP count. At a scale of thousands of accelerators running for weeks, hardware failures are not exceptional events but a near-daily certainty: with thousands of components each having a small per-day failure probability, the expected time between failures for the job as a whole drops to hours. A run that cannot survive a single dead worker would never finish, which is why elastic and fault-tolerant training (Chapter 18) is not a luxury but the precondition for the run existing at all.
Across our four systems, five of the six axes get distributed by at least two of them, and coordination is distributed by all four. Exactly one axis, multi-agent intelligence, is claimed by only a single system, the drone swarm in the next vignette. It is the loneliest axis in this section, which is fitting: distributing the decision-making itself, rather than the data or the math, is the newest and least settled of the six, and it earns the entirety of Part VI.
4. Vignette Four: A Multi-Robot Drone Swarm Intermediate
A swarm of drones performing search, mapping, or inspection is the one system here that distributes intelligence itself. Each drone runs its own perception and control model locally, at the edge, because a round trip to a central server would blow the control-loop latency budget and because connectivity to a central server cannot be assumed. The drones must then coordinate, agreeing on coverage, avoiding collisions, sharing what they have seen, without a single machine that holds the whole plan. The binding ceiling here is neither data size nor model size but latency and autonomy: decisions must be made locally and fast. So this system distributes inference (each drone infers on-board), coordination (the swarm agrees on a joint plan), and intelligence (no central brain decides for all), while data, training, and model parallelism play no role in the deployed system. Its case study is Chapter 39, building on the swarm-intelligence and multi-agent-coordination material of Chapter 31.
The striking number is the control-loop budget. A drone avoiding obstacles at speed may need to perceive and react within roughly 50 milliseconds; a round trip to a cloud data center is often 50 to 100 milliseconds before any computation, so the perception-to-action loop physically cannot live in the cloud. The speed of light and the network turn the design decision into a constraint: intelligence must be distributed to the edge because the latency budget forbids centralizing it.
5. Four Systems, One Map Beginner
Placing the four vignettes side by side on the six axes makes the shared structure visible. Table 1.7.1 maps each system (rows) to each axis (columns), marking with a filled circle the axes the system actively distributes and leaving the others blank, with a short note on the ceiling that binds each system. The marks are exactly the coordinates discussed in the four vignettes above, and the rightmost column counts how many axes each system spreads across machines.
| System | Data | Training | Model | Inference | Coordinate | Intelligence | Axes |
|---|---|---|---|---|---|---|---|
| Web-scale RAG | ● | ● | ● | ● | 4 | ||
| Planetary recommender | ● | ● | ● | ● | ● | 5 | |
| Foundation pretraining | ● | ● | ● | ● | 4 | ||
| Drone swarm | ● | ● | ● | 3 |
Two patterns jump out of the table, and the short script below confirms them by counting the marks rather than trusting the eye. First, no axis is universal except one: cluster coordination is distributed by all four systems, because the moment work lives on more than one machine, something has to schedule it, detect failures, and keep the pieces consistent, which is why coordination earns space in both Part I and Part VII. Second, the systems cluster into two families: the three data-center systems (RAG, recommender, pretraining) all distribute data and the model and lean on heavy collectives, while the swarm distributes none of those and instead owns the intelligence axis alone. The map predicts which chapters each system will draw on long before you build it.
# Map four real distributed AI systems onto the six axes of distribution
# (Section 1.2) and count, per system, how many axes are actively distributed.
AXES = ["data", "training", "model", "inference", "coordination", "intelligence"]
# 1 = this system actively distributes along that axis; 0 = it does not.
systems = {
"Web-scale RAG": dict(data=1, training=0, model=1, inference=1, coordination=1, intelligence=0),
"Recommender": dict(data=1, training=1, model=1, inference=1, coordination=1, intelligence=0),
"Foundation pretrain": dict(data=1, training=1, model=1, inference=0, coordination=1, intelligence=0),
"Drone swarm": dict(data=0, training=0, model=0, inference=1, coordination=1, intelligence=1),
}
print(f"{'system':22s} " + " ".join(f"{a[:5]:>6s}" for a in AXES) + " axes")
for name, m in systems.items():
row = " ".join(f"{m[a]:>6d}" for a in AXES)
print(f"{name:22s} {row} {sum(m.values()):>4d}")
# How often is each axis distributed across the four systems?
print("\nper-axis count across the 4 systems:")
for a in AXES:
c = sum(m[a] for m in systems.values())
print(f" {a:13s}: {c}/4")
system data train model infer coord intel axes
Web-scale RAG 1 0 1 1 1 0 4
Recommender 1 1 1 1 1 0 5
Foundation pretrain 1 1 1 0 1 0 4
Drone swarm 0 0 0 1 1 1 3
per-axis count across the 4 systems:
data : 3/4
training : 2/4
model : 3/4
inference : 3/4
coordination : 4/4
intelligence : 1/4
The same data reads more clearly as a picture. Figure 1.7.1 plots the four systems as rows and the six axes as columns, filling a cell when the system distributes that axis, so the shared coordination column and the lone intelligence cell are visible at a glance.
Who: A staff engineer asked to size the infrastructure for a new product before any code exists.
Situation: The product is a live-video moderation service: incoming streams are transcribed, embedded, matched against a policy corpus, and flagged in near real time.
Problem: Leadership wants a one-page answer to "what kind of system is this, and what will be hard?" within a day, with no prototype to measure.
Dilemma: Guess the architecture from intuition and risk anchoring the whole team on a wrong shape, or run a real benchmark that does not exist yet and miss the deadline.
Decision: The engineer located the product on the six axes instead. The policy corpus is large and sharded (data, model for the matcher), streams arrive continuously under a latency budget (inference, coordination), and no multi-agent reasoning is involved.
How: Recognizing the coordinates as a near-match to the web-scale RAG vignette, the engineer reused that system's shape: a sharded vector index plus a serving model, with the merge-under-latency step flagged as the hardest part exactly as Table 1.7.1 predicts.
Result: The one-pager named the binding ceilings (data and latency), the axes to distribute, and the merge-and-coordinate step as the risk, all before a prototype, and the later benchmark confirmed the shape.
Lesson: The six axes are a design tool, not just a description. Plotting a system you have never built tells you which existing system it resembles and where its difficulty will concentrate.
Each of the four systems has a production stack that absorbs most of the distribution machinery, so the engineering effort goes to the axis that binds rather than to reinventing collectives. The hand-rolled all-reduce of Section 1.1 becomes one line in each:
# Web-scale RAG retrieval: a sharded vector index, searched in parallel.
import faiss # index shards + parallel search
# Recommender / foundation-model training: data + model parallel in a few lines.
import torch.distributed as dist # all-reduce, reduce-scatter, all-gather
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP # shard a huge model
# Drone swarm: a multi-agent coordination runtime at the edge.
import ray # actors, messaging, fault detection
faiss shards and searches the RAG index, torch.distributed and FSDP carry the recommender and pretraining collectives (Chapter 16), and ray hosts the swarm's coordinating agents.The four vignettes were cleanly separated for teaching, but recent work is merging their coordinates. Agentic RAG systems (surveyed widely through 2024 and 2025) wrap the web-scale retriever of vignette one inside a multi-agent controller, lighting up the intelligence axis that RAG left blank and pulling it toward the swarm. Geo-distributed and over-the-internet training, in the lineage of DiLoCo (Douillard et al., 2024) and the open Prime Intellect INTELLECT-1 run (2024), spreads foundation-model pretraining across data centers that are not even co-located, stretching the coordination axis of vignette three to planetary scale. Mixture-of-experts serving (the lineage behind models such as DeepSeek-V3, 2024) gives the recommender's sparse, key-addressed routing to dense language models at inference time. The lesson for this section is durable even as the examples blur: the systems move, but they move within the same six-axis space, and naming their coordinates is how you track them. Section 1.8 turns that space into the design-space method directly.
Add a fifth row to Table 1.7.1 for a federated medical-imaging system in which many hospitals train a shared model without their patient data ever leaving the hospital (the case study of Chapter 37). Mark which of the six axes it distributes, and justify each mark in one sentence. Which axis does it distribute that none of the four systems in this section do, and which axis that all three data-center systems distribute does it deliberately refuse to? Explain why that refusal is the whole point of the federated design.
Starting from Code 1.7.1, add code that reports, for each pair of systems, how many axes they distribute in common (the size of the intersection of their axis sets). Print the most-similar and least-similar pairs. Verify against the prose claim that the three data-center systems form one cluster and the swarm stands apart, and state numerically how far the swarm sits from each of the other three.
The drone swarm distributes intelligence because a cloud round trip of 50 to 100 milliseconds exceeds its roughly 50-millisecond control-loop budget. Suppose a competitor proposes centralizing the swarm's decisions in a nearby edge server reachable in 8 milliseconds round trip, leaving only 42 milliseconds for perception and planning. Argue from the latency budget alone whether this re-centralization is viable, what single failure it reintroduces that the fully distributed swarm avoided, and how the answer would change for a slow-moving inspection swarm with a 2-second control loop. Connect your reasoning to the reliability discussion of Section 1.6.
Four systems, one map. The next section stops touring finished systems and turns the six axes into a method: given a new problem, how do you decide which axes to distribute, in what order, and at what cost? That is the distributed AI design space, and it is where Section 1.8 takes the coordinates you have just learned to read and makes them coordinates you can choose.