Section 41.4: Designing the Distributed Version

"A design, in the end, is just two decisions written down: who holds what, and who talks to whom. Everything else is me trying to make those two decisions survive contact with the network."
A Coordinator Sketching Its Own Topology

Big Picture

A distributed design is the map from one-machine logic to many-machine logic: it decides how the work is partitioned across nodes, what pattern those nodes use to communicate, how strongly their views of the shared state must agree, who coordinates them, and how the whole thing survives the failures that scale makes routine. You arrive at this section already holding two things: a distribution axis chosen in Section 41.2 and a single-machine baseline measured in Section 41.3. The design is what turns those into a system. The discipline this section teaches is to make every one of these decisions on paper, and to predict the speedup the design can reach, before a line of distributed code is written. A design you cannot defend with a number is a guess, and a guess that fails at three hundred nodes is expensive to discover.

By now your capstone has a shape. You picked the axis along which the work must spread, you built the version that runs on one machine, and you measured where that version runs out of room: the memory it cannot grow past, the throughput it cannot reach, the dataset it cannot hold. This section is where those measurements become architecture. The baseline told you which ceiling binds; the design says exactly how many machines will share the load, how they will split the data and the model between them, what they will say to one another and when, and what happens when one of them dies mid-step. None of this is improvisation at the keyboard. It is a small number of decisions, each with a name, each tied to a chapter of this book, and each answerable in advance.

We treat the design as a template with six slots: partitioning, communication pattern, consistency model, coordination, fault tolerance, and load balancing. Filling the six slots is the whole job. The case studies of Chapters 36 through 40 are each one filling of this template, and seeing the same six slots answered five different ways is the fastest way to learn what a good answer looks like. Figure 41.4.1 shows the template you will fill.

Figure 41.4.1: The distributed-design template. The chosen axis and the baseline (top) feed six decision slots, each with its menu of standard options and the chapter that develops it. A complete design is one option picked, and justified, in every slot. The case studies in Section 7 of this section are each one row through this menu.

1. Partitioning: Who Holds What Beginner

The first decision is the one the epigraph calls "who holds what." Partitioning is how the single-machine state, the data, the model, and the intermediate work, gets cut into pieces that live on different nodes. The axis you chose in Section 41.2 already points at the answer. If the binding ceiling was data volume, you partition the data: each node owns a disjoint shard of examples and the model is replicated, the pattern of data parallelism. If the ceiling was model size, you partition the model: each node owns a slice of the parameters and the data streams past all of them, the pattern of model and sharded parallelism. If the ceiling was a throughput or coordination problem over many independent tasks, you partition the work: each node owns a subset of keys, requests, or agents.

The partition is not just a division; it is a contract about locality. A good partition keeps the data a node needs close to the computation that needs it, so that most work is local and only the unavoidable minimum crosses the network. The concepts here, sharding by hash versus range, co-locating a key with its data, replicating the small and partitioning the large, were laid out in Chapter 2 and made concrete for datasets in Chapter 8 and for parameters in Chapter 11 and Chapter 16. The decision you record is one sentence: what is the unit of partition (an example, a parameter range, a key, an agent), how many partitions, and what is replicated rather than split.

Key Insight: The Partition Decides the Communication Bill

Partitioning and communication are not two independent choices; the first sets the budget for the second. Whatever a node does not hold locally, it must fetch or exchange over the network, so the partition fixes how much traffic the design generates per step. Partition along the axis where most of the work is already local and the communication bill is small; partition across that axis and every step pays to move state. Choose the partition first, then the cheapest communication pattern that the partition allows. The two later sections on predicted speedup are, in effect, pricing the bill this decision created.

2. The Communication Pattern: Who Talks to Whom Intermediate

Once the state is partitioned, the nodes must exchange the part each one lacks. The shape of that exchange is the communication pattern, and there are four canonical ones, each suited to a different partition. All-reduce is the collective of data parallelism: every node holds a full copy of the model, computes a partial result over its data shard, and a single symmetric operation sums those partials and returns the total to everyone. It is peer-to-peer, has no central bottleneck, and is the workhorse of Chapter 4 and every parallel-training chapter that builds on it. Parameter server is the push-pull pattern of Chapter 11: workers push gradients to and pull parameters from a set of server shards that own the canonical state, which fits naturally when the parameters are too large or too sparse to replicate, as with giant embedding tables. Gossip is the decentralized pattern of Chapter 14: each node talks only to a few neighbors and the global view converges over rounds, the right choice when there is no central authority or the topology is wide and unreliable. Scatter-gather is the coordinator pattern: one node fans a task out to many and collects the results, the shape of MapReduce and of most query and inference fan-outs.

The pattern is chosen by the partition and the consistency you can tolerate, not by taste. A replicated model with synchronous training wants all-reduce. A sharded embedding table wants a parameter server. A pool of privacy-isolated clients wants gossip or federated aggregation. A fan-out query over index shards wants scatter-gather. Record the pattern, the primitive it uses, and the volume it moves per step, because that volume is the input to the speedup model in Section 5 of this section.

Practical Example: Two Capstones, Two Patterns, Same Axis Name

Who: Two students who both wrote "data parallelism" on their axis slip in Section 41.2.

Situation: Student A was scaling a vision classifier; student B was scaling a click-through-rate model with a hundred-million-row embedding table.

Problem: Both assumed "data parallel" meant "all-reduce," and both wrote that in slot 2 of the template.

Dilemma: Student B's model would not fit replicated on every worker, because the embedding table alone exceeded one GPU; all-reduce assumes a full replica per node.

Decision: Student A kept all-reduce. Student B split the design: dense layers stayed data-parallel with all-reduce, but the embedding table moved to a parameter-server shard set, with workers pulling only the rows their batch touched.

How: Student B partitioned the table by feature-hash across server shards (Chapter 11) and kept all-reduce only for the dense gradients.

Result: The hybrid moved a few thousand embedding rows per step instead of the whole table, and the predicted communication bill dropped by more than an order of magnitude before any code ran.

Lesson: The axis names the partition; it does not name the pattern. A large or sparse partition can force a parameter server even inside a "data-parallel" design.

3. The Consistency Model: How Closely Views Must Agree Intermediate

Communicating costs time, and the consistency model is the decision about how often you are willing to pay it. In a synchronous design, every node waits at a barrier until all have exchanged their state, so every node always computes on an identical, up-to-date view. Synchrony is simple to reason about and gives results identical to the single-machine baseline, but it is only as fast as the slowest node, which is why stragglers, introduced in Chapter 2 and priced in Chapter 3, are its enemy. In an asynchronous design, nodes proceed without waiting, reading whatever version of the shared state is currently available, which removes the barrier and the straggler problem at the cost of computing on stale information. Between them sits bounded staleness, the stale-synchronous middle ground of Chapter 10, where a node may run ahead but never more than a fixed number of steps past the slowest, capturing much of async's speed while keeping a provable bound on the error staleness introduces.

The right choice depends on how sensitive your computation is to stale state. Gradient descent tolerates a little staleness and often converges fine asynchronously; a counter or an inventory that must never double-spend tolerates none. Record which one your capstone can accept and why, because this single decision determines whether the communication cost of Section 5 is paid every step (synchronous) or amortized across several (bounded or asynchronous).

4. Coordination: One Brain or Many Intermediate

Someone has to decide which node does what, detect when one has failed, and rebalance when the work shifts. That role is coordination, and the choice introduced in Section 1.4 is between a central coordinator and a decentralized scheme. A central coordinator, the driver in Spark, the rank-zero process in a training job, the scheduler in a cluster, holds the authoritative plan and is simple to build and reason about; its weakness is that it is a single point of failure and a potential bottleneck if every decision routes through it. A decentralized scheme spreads the decision across peers, often through a consensus protocol or a gossip of state, which removes the single point of failure at the cost of the agreement machinery that Chapter 2 covers.

Most capstones, sensibly, use a central coordinator for control while keeping the data path decentralized: a driver decides the plan, but the heavy gradient or query traffic flows peer-to-peer and never touches the coordinator. That split, central control plane and decentralized data plane, is the default worth defending unless your project specifically cannot tolerate a single coordinator (a federated or adversarial setting), in which case the consensus cost is the price of survival. Record who coordinates, and what happens to the system if that coordinator disappears, which is the question the next slot answers.

5. Predicting the Speedup Before You Build Advanced

This is the slot that separates a design from a wish. Before committing to a node count, you can predict the speedup the design will reach from two numbers you already have: the single-machine compute time per step from your baseline, and the communication volume per step from your pattern. Let $T_{\text{comp}}$ be the compute time one machine needs for a step. Split across $p$ nodes, each does a $1/p$ share, so the compute time per node falls to $T_{\text{comp}}/p$. But communication does not fall; it grows. Writing $T_{\text{comm}}(p)$ for the per-step communication time, the wall-clock per step becomes

$$T_p = \frac{T_{\text{comp}}}{p} + T_{\text{comm}}(p), \qquad S(p) = \frac{T_1}{T_p} = \frac{T_{\text{comp}}}{\dfrac{T_{\text{comp}}}{p} + T_{\text{comm}}(p)}.$$

The compute term shrinks with more nodes while the communication term grows, so $S(p)$ rises, peaks, and falls. For the most common pattern, a ring all-reduce of a model whose one-pass transfer cost is $\beta$ with a per-hop latency $\alpha$, the communication term takes the form used in Chapter 4 and the alpha-beta model of Chapter 3:

$$T_{\text{comm}}(p) = \alpha\,(p-1) + 2\beta\,\frac{p-1}{p}.$$

The node count that maximizes speedup is the $p^\star$ that minimizes $T_p$; setting $\mathrm{d}T_p/\mathrm{d}p = 0$ for the dominant latency term gives the scaling intuition $p^\star \sim \sqrt{T_{\text{comp}}/\alpha}$, more workers are worth adding only until the latency they add cancels the compute they remove. The honest summary statistic is the communication-to-computation ratio

$$R(p) = \frac{T_{\text{comm}}(p)}{T_{\text{comp}}/p},$$

which starts near zero, crosses $1$ at the point where moving data costs as much as the compute it enables, and grows without bound after. Past $R(p) = 1$ the design is communication-dominated and adding nodes buys almost nothing. The demo below computes all three quantities, $S(p)$, $p^\star$, and the $p$ where $R(p)$ first reaches $1$, for one set of design numbers.

import math

# Predict the speedup of a distributed design BEFORE building it.
# T_comp: single-machine compute for one step (normalized to 1.0).
# Across p nodes each does T_comp/p of the compute.
# Ring all-reduce communication time grows with p:
#   T_comm(p) = alpha*(p-1) + 2*beta*(p-1)/p
#   alpha = per-hop latency, beta = cost to move the model once.
T_comp = 1.0
alpha  = 0.0008      # latency per hop (fraction of single-machine step)
beta   = 0.040       # bandwidth cost to move the model once

def t_comm(p):
    return 0.0 if p <= 1 else alpha*(p-1) + 2.0*beta*(p-1)/p

def t_step(p):       return T_comp/p + t_comm(p)
def speedup(p):      return T_comp / t_step(p)
def comm_to_compute(p): return t_comm(p) / (T_comp/p)

print("Predicted scaling of the distributed design (T_comp=1.0, ring all-reduce)")
print(f"{'p':>5} {'T_comp/p':>9} {'T_comm':>8} {'T_step':>8} {'speedup':>8} {'comm/comp':>10}")
for p in [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]:
    print(f"{p:>5} {T_comp/p:>9.4f} {t_comm(p):>8.4f} {t_step(p):>8.4f} "
          f"{speedup(p):>8.2f} {comm_to_compute(p):>10.3f}")

# Optimal node count: the p that minimizes per-step wall-clock.
p_star = min(range(1, 4097), key=t_step)
print(f"\nOptimal worker count p* = {p_star}   (max speedup S(p*) = {speedup(p_star):.2f}x)")
print(f"  efficiency at p*: {speedup(p_star)/p_star*100:5.1f}%   "
      f"comm/compute at p*: {comm_to_compute(p_star):.2f}")

# Communication-dominated point: smallest p where comm/compute ratio reaches 1.
p_dom = next(p for p in range(1, 4097) if comm_to_compute(p) >= 1.0)
print(f"Communication dominates from p = {p_dom} onward (comm/compute first reaches 1.0)")
print(f"  speedup already plateaus near S = {speedup(p_dom):.2f}x there")

Code 41.4.1: A predicted-speedup calculator. It evaluates $S(p)$, the optimal node count $p^\star$, and the communication-dominated point $R(p)=1$ from a compute term and an alpha-beta communication term, all before any distributed code exists.

Predicted scaling of the distributed design (T_comp=1.0, ring all-reduce)
    p  T_comp/p   T_comm   T_step  speedup  comm/comp
    1    1.0000   0.0000   1.0000     1.00      0.000
    2    0.5000   0.0408   0.5408     1.85      0.082
    4    0.2500   0.0624   0.3124     3.20      0.250
    8    0.1250   0.0756   0.2006     4.99      0.605
   16    0.0625   0.0870   0.1495     6.69      1.392
   32    0.0312   0.1023   0.1336     7.49      3.274
   64    0.0156   0.1291   0.1448     6.91      8.266
  128    0.0078   0.1810   0.1888     5.30     23.165
  256    0.0039   0.2837   0.2876     3.48     72.624
  512    0.0020   0.4886   0.4906     2.04    250.186
 1024    0.0010   0.8983   0.8993     1.11    919.882

Optimal worker count p* = 34   (max speedup S(p*) = 7.49x)
  efficiency at p*:  22.0%   comm/compute at p*: 3.54
Communication dominates from p = 13 onward (comm/compute first reaches 1.0)
  speedup already plateaus near S = 6.24x there

Output 41.4.1: For these numbers the design tops out near $7.5\times$ at $p^\star=34$ nodes, and communication overtakes compute from $p=13$. Past the optimum, adding nodes makes the step slower: at $1024$ nodes the speedup has collapsed back toward $1\times$. The design's verdict ("use about three dozen nodes, not a thousand") is read off the model, not discovered on a cluster bill.

The lesson of Output 41.4.1 is the one Chapter 3 insisted on: the honest node count is almost always smaller than the one you can afford. A design that names $p^\star$ and the communication-dominated point in advance is a design you can defend in a review. Plug your own baseline's $T_{\text{comp}}$ and your pattern's $\alpha$ and $\beta$ into Code 41.4.1, and the optimal node count for your capstone falls out as a number.

Library Shortcut: Measure the Communication Term Instead of Guessing It

Code 41.4.1 needs real $\alpha$ and $\beta$ values to give a trustworthy $p^\star$. Rather than guess them, measure them with the framework's own benchmark, then feed the numbers back into the model. PyTorch ships a collective benchmark, and one timed all-reduce gives you both terms:

# Run with: torchrun --nproc_per_node=<p> bench.py
import torch, torch.distributed as dist, time
dist.init_process_group("nccl")
x = torch.randn(250_000_000, device="cuda")   # a model-sized tensor
torch.cuda.synchronize(); t0 = time.perf_counter()
for _ in range(20):
    dist.all_reduce(x)                         # the exact collective your design uses
torch.cuda.synchronize()
if dist.get_rank() == 0:
    print("measured per-all-reduce seconds:", (time.perf_counter() - t0) / 20)

Code 41.4.2: The communication term, measured rather than assumed. Running this at two node counts (say 2 and 16) gives you $\alpha$ and $\beta$ by fitting the alpha-beta form, replacing the guessed constants in Code 41.4.1 with numbers from your actual interconnect. The framework handles the transport and topology selection that Chapter 4 details.

6. Fault Tolerance and Load Balancing: Surviving and Staying Even Advanced

At one machine, failure is rare enough to ignore in the design. At a hundred, something is broken at almost every moment, so the design must say what happens when a node dies, and it must keep the survivors evenly loaded. The fault-tolerance slot picks from a small menu. Checkpoint and restart, the baseline of every long job, periodically writes the distributed state to durable storage and reloads it after a crash; its cost is the checkpoint interval, and its weakness is the work lost since the last write. Elastic execution, the subject of Chapter 18, lets the job continue with fewer nodes by re-ranking the survivors and re-sharding the work, so a single failure costs a rebalance rather than a full restart, the right choice on preemptible or spot capacity. Replication and re-execution, the MapReduce instinct, keeps redundant copies or simply recomputes a lost partition from its inputs, and underpins the Byzantine-robust aggregation of Chapter 35 when a node may not merely fail but lie.

Load balancing is the same problem viewed in steady state rather than at a crash. Even with no failures, an uneven partition leaves some nodes idle while others grind, and the synchronous barrier of slot 3 means the whole system runs at the speed of the slowest. The remedies, traced through Chapter 3 and Chapter 18, are an even partition to begin with (hash or range sharding that spreads the work uniformly) and straggler mitigation when evenness is not enough: backup or speculative execution that launches a second copy of a slow task and takes whichever finishes first. Record the checkpoint or elastic policy, the recovery cost, and the straggler strategy, and the six slots are full.

Thesis Thread: The Design Is the Whole Book, Read Once

The six slots of this template are the book in miniature. Partitioning is Part II and the sharding of Parts III and IV; the communication pattern is the collectives of Chapter 4 and the parameter server of Chapter 11; consistency is the synchronous-versus-asynchronous thread from Chapter 2 to Chapter 10; coordination is Section 1.4; fault tolerance runs from MapReduce re-execution to the elastic training of Chapter 18 and the robust aggregation of Chapter 35. To design a distributed system is to choose one option in each of these arcs and to defend the choice with the cost model of Chapter 3. The capstone is where every thread of the book is pulled at once.

7. How the Case Studies Filled the Template Intermediate

The fastest way to calibrate your own answers is to see the six slots filled by systems that work. Table 41.4.1 reads each case study of this part as one pass through Figure 41.4.1. The same template, answered five different ways, is the proof that the slots are the right ones: no case study needed a seventh, and none could skip one of the six.

Table 41.4.1: The five case studies of Part VIII, each read as one filling of the design template. Reading down a column shows the menu of real choices for that slot; reading across a row reconstructs a working design.

Case study	Partition	Communication	Consistency	Coordination	Fault tolerance
Ch 36 Web-scale RAG	data + index shards	scatter-gather	eventual	central driver	re-execute / replicate index
Ch 37 Federated medical	data by hospital	federated aggregation / gossip	bounded (round-based)	decentralized	client dropout tolerated
Ch 38 Recommendation	embedding-table shards	parameter server	bounded staleness	central + sharded servers	server replication + checkpoint
Ch 39 Robot swarm	work by agent	gossip / local broadcast	asynchronous	decentralized	graceful agent loss
Ch 40 Agentic LLM	model shards + task fan-out	all-reduce + scatter-gather	synchronous (train), eventual (serve)	central orchestrator	checkpoint + elastic

Notice the pattern. The partition column is dictated by the binding ceiling: web data is huge so it shards, recommendation parameters are huge so they shard, robots are independent so the work shards by agent. The communication column then follows the partition: sharded indices want scatter-gather, sharded parameters want a parameter server, isolated clients want federated aggregation or gossip. Consistency loosens exactly where the application allows it (eventual for retrieval, asynchronous for robots) and tightens where correctness demands (synchronous for the LLM training pass). This is the template working as intended: fix the partition from the ceiling, derive the pattern from the partition, and set consistency from what the application can tolerate.

Research Frontier: Automating the Template (2024 to 2026)

The six decisions you are making by hand are increasingly made by search. Automatic-parallelism systems such as Alpa and the auto-parallel passes folded into recent releases of PyTorch and JAX treat the partition-and-communication choice as an optimization problem, searching the space of data, tensor, and pipeline splits to minimize a cost model very like the one in Code 41.4.1. The 2024 to 2026 line on geo-distributed and over-the-internet training (the DiLoCo lineage and its descendants) pushes the consistency slot toward heavily bounded-stale, local-update designs that communicate rarely enough to run across continents, and the elastic-training work behind production foundation-model runs makes the fault-tolerance slot adaptive rather than fixed. The frontier is not replacing the template; it is learning to fill it automatically from the same compute and communication numbers you are estimating by hand here. Knowing the slots by hand is what lets you read, and trust, what an auto-parallel planner proposes.

Fun Note: The Whiteboard Test

A working rule among systems engineers: if you cannot draw your design on a whiteboard as boxes (the partitions) and arrows (the communication) in under two minutes, it is not yet a design, it is a hope. Figure 41.4.1 is six slots precisely because six boxes and their arrows are about as much as one whiteboard, and one reviewer's patience, can hold at once.

8. From Design to Tools Beginner

You now hold a complete design: a partition, a communication pattern, a consistency model, a coordination scheme, a fault-tolerance plan, a load-balancing strategy, and a predicted speedup with an optimal node count to defend it. That design is deliberately framework-neutral; it says what must happen, not which library makes it happen. The translation from the six slots to concrete tools, which collective library implements your pattern, which framework gives you elastic recovery for free, which scheduler places your nodes, is the subject of Section 41.5. A good design makes that translation almost mechanical: an all-reduce slot maps to PyTorch DDP or NCCL, a parameter-server slot to a Ray or framework-native server, a checkpoint-and-restart slot to the framework's checkpoint hooks. Design first, in the vocabulary of this section; choose tools second, in the next.

Exercise 41.4.1: Fill the Template for Your Own Capstone Conceptual

Write the six slots of Figure 41.4.1 for your capstone in one sentence each: the unit and number of partitions, the communication pattern and the primitive it uses, the consistency model and why your computation tolerates it, the coordination scheme, the fault-tolerance policy, and the load-balancing strategy. Then justify slot 2 from slot 1: show in one sentence that your communication pattern is the cheapest one your partition allows, exactly as the recommendation student did in the Practical Example. If any slot reads "synchronous all-reduce" purely by default, state the alternative you rejected and why.

Exercise 41.4.2: Find Your Optimal Node Count Coding

Take $T_{\text{comp}}$ from your single-machine baseline in Section 41.3 (the per-step compute time) and plug it into Code 41.4.1 along with a first guess at $\alpha$ and $\beta$. Report the predicted $p^\star$, the speedup at $p^\star$, and the node count where the communication-to-computation ratio $R(p)$ first reaches $1$. Then run Code 41.4.2 (or, if you have no cluster, vary $\alpha$ and $\beta$ by a factor of ten each way) and show how sensitive $p^\star$ is to the communication constants. State the node count you will actually request and defend it against the model, not against your budget.

Exercise 41.4.3: Read a Case Study as a Template Row Analysis

Pick one case study from Table 41.4.1 and open its chapter. Verify each of the five filled slots against the chapter's actual design, and find the sixth slot (load balancing) that the table omitted for space. Then change one slot, for example, force Chapter 38's consistency from bounded-stale to fully synchronous, and argue, using the speedup model of Section 5, how that one change would alter the system's predicted scaling and which other slot it would force you to revisit.