Section 41.1: Choosing a Distributed AI Problem

"They gave me an empty repository and a single term. I could have filled it with a clever model on one laptop and called it done. Instead I went looking for the place where one machine runs out of room, because that is the only place a distributed system has any reason to be born."
A Blank Repository, Waiting to Become a System

Big Picture

A capstone in this book is not "build an AI model"; it is "find a place where one machine genuinely runs out, and build the smallest honest distributed system that gets past it." Every chapter so far handed you a problem already known to need scale-out. The capstone reverses the burden of proof: you must choose a problem and then defend, with arithmetic rather than enthusiasm, that distributing it actually pays. This first section is about that choice. It names the four ways a single machine runs out (data too big, model too big, throughput or latency too high, too many agents to coordinate), gives you a quantitative test that rejects distributing anything that fits comfortably on one node, shows where to source problems and reproducible baselines, and tells you how to scope the whole thing to one term with a single-machine baseline you can actually beat. The rest of the chapter, from axis to report, is execution; this section is the decision everything else depends on.

The five chapters before this one, Chapter 36 through Chapter 40, each took a problem whose scale was already a settled fact: a ten-billion-document corpus, a fleet of hospitals that cannot share data, a recommendation system answering millions of requests, a swarm of robots, an agentic application fanning work across machines. You watched experts decompose those problems onto the six axes of distribution from Chapter 1. The capstone asks you to do the same thing, but from the other end: nobody hands you a problem that obviously needs a cluster. You go and find one, and the first and most consequential mistake you can make is to pick a problem that did not need distributing at all. A beautifully engineered all-reduce that synchronizes a model fitting on one GPU is not a distributed-systems project; it is a slower way to train a small model. This section exists to keep you out of that trap.

1. Four Ways One Machine Runs Out Beginner

A problem benefits from scale-out only when some specific resource on the best single machine you can reasonably buy is exhausted. Section 1.1 named three such ceilings for training and inference; for a capstone that may also involve many cooperating agents, a fourth joins them. The first is data: the dataset does not fit in the memory, or even on the disk, of one node, so storage and processing must be partitioned, the world of Chapter 6 and Chapter 7. The second is the model: the parameters plus optimizer state plus activations exceed one accelerator's memory, forcing the sharded and pipeline parallelism of Chapter 16. The third is throughput or latency: the request volume or the deadline exceeds what one server delivers, the replicated-serving problem of Chapter 24. The fourth is coordination: the task is inherently many interacting decision-makers, the multi-agent and swarm setting of Chapter 29 and Chapter 31, where distribution is not an optimization but the structure of the problem itself.

These four are independent, and a good capstone usually binds on exactly one of them clearly. Binding on one is enough: it gives you a sharp thesis, a single axis to defend, and a baseline to beat. Trying to bind on all four at once is how a one-term project becomes a system nobody finishes. The decision tree in Figure 41.1.1 walks the four ceilings in the order that is cheapest to check, and routes each to the axis and the chapter that owns its remedy. Read it as the first thing you do to any candidate problem.

Figure 41.1.1: The problem-selection decision tree. A candidate capstone is tested against the four ceilings in increasing order of how expensive they are to check; the first ceiling it clears routes the project to a distribution axis (developed in Section 41.2) and the earlier chapter that owns the remedy. If the problem clears none of the four, the honest verdict is that it does not need a cluster, and the right move is to choose a different problem rather than distribute for its own sake. A problem may clear more than one test; pick the one ceiling you can measure and defend most cleanly.

The figure also doubles as an idea menu. Reading the right-hand column top to bottom, the four axes (data, model, inference, intelligence) are four genres of capstone, each anchored to a worked case study you have already seen: data-bound projects resemble Chapter 36, agent-coordination projects resemble Chapter 39, and so on. When you are short on ideas, start from the axis you most want to learn and work backward to a problem that loads it.

Key Insight: The Capstone Reverses the Burden of Proof

In every prior chapter the problem arrived pre-certified as needing scale-out. In the capstone, distribution is guilty until proven innocent. Before you design anything, you must point at one resource on the best single machine you could plausibly use and show, in numbers, that it is exhausted. If you cannot name that resource, you do not yet have a distributed-systems project; you have a model you happen to want to run on more than one machine, which is a different and usually worse idea. The whole of this section is machinery for finding and certifying that one exhausted resource.

2. The Justification Test: Does Distributing Actually Pay? Intermediate

Clearing one of the four ceilings in Figure 41.1.1 says distribution is possible. It does not yet say distribution pays. Adding machines buys parallel compute but charges communication, and the charge can exceed the purchase. The discipline for settling this is the scalability machinery of Chapter 3, and the cleanest summary is Amdahl's law with a communication term. Let $s$ be the serial fraction of the work that cannot be parallelized, let the remaining $1 - s$ split across $p$ workers, and let each added worker cost a fixed communication overhead $c$ measured as a fraction of the original single-machine runtime. Normalizing the single-machine time to one, the parallel time and the resulting speedup are

$$T(p) = s + \frac{1 - s}{p} + c\,(p - 1), \qquad S(p) = \frac{1}{T(p)}.$$

The justification test is then a single inequality: distribute only if $\max_{p > 1} S(p) > 1$ by a margin that matters. If the best achievable speedup over all worker counts is barely above one, the communication tax has eaten the parallel gain, and the project is a single machine wearing a cluster as a costume. Two limiting cases bound the picture. Setting $c = 0$ recovers the classical Amdahl ceiling: no matter how many workers you add, speedup cannot exceed

$$S_{\infty} = \lim_{p \to \infty} \frac{1}{s + (1-s)/p} = \frac{1}{s},$$

so a problem that is forty percent serial can never run more than $2.5$ times faster, and that ceiling is the same whether you own two machines or two thousand. The other case, $c > 0$, is worse and more realistic: speedup rises, peaks at a finite optimal worker count $p^\star$, then falls, because past $p^\star$ each new worker adds more communication than computation it removes. A capstone whose $p^\star$ is two or three is a capstone that did not need a cluster. The code below computes both the ceiling and the break-even $p^\star$ for a range of problems, so you can locate any candidate on the map before committing a term to it.

import math

# Amdahl with a per-worker communication tax. s is the serial fraction;
# (1 - s) parallelizes over p workers; each worker adds a fixed overhead c
# (as a fraction of the single-machine runtime). Single-machine time = 1.
def speedup(s, p, c):
    return 1.0 / (s + (1.0 - s) / p + c * (p - 1))

def best_p(s, c, pmax=4096):                      # the worker count that maximizes speedup
    p = max(range(1, pmax + 1), key=lambda q: speedup(s, q, c))
    return p, speedup(s, p, c)

print("Justification test: only distribute if best achievable S(p) > 1")
print(f"{'serial s':>9} {'comm c':>8} {'p*':>6} {'S(p*)':>8} {'verdict':>22}")
for s, c in [(0.02, 0.001), (0.10, 0.002), (0.40, 0.002), (0.90, 0.02), (0.97, 0.05)]:
    p, sp = best_p(s, c)
    verdict = "distribute" if (sp > 1.05 and p > 1) else "stay single-machine"
    print(f"{s:>9.2f} {c:>8.4f} {p:>6d} {sp:>8.2f} {verdict:>22}")

print()
for s in [0.02, 0.10, 0.40]:                      # Amdahl ceiling 1/s: the hard cap
    print(f"Amdahl ceiling at s={s:.2f}: max speedup -> {1.0 / s:6.1f}x")

print()
s, c = 0.05, 0.002                                # a healthy capstone-sized problem
p, sp = best_p(s, c)
print(f"With s={s}, comm c={c}/worker: optimal p* = {p}, S(p*) = {sp:.2f}x")
print(f"  (adding workers past p*={p} makes the job SLOWER, not faster)")

Code 41.1.1: The justification test as runnable arithmetic. For each $(s, c)$ pair it finds the speedup-maximizing worker count $p^\star$, reports the best achievable speedup, and renders a verdict; it then prints the pure Amdahl ceiling $1/s$ and the break-even $p^\star$ for one healthy capstone-sized problem.

Justification test: only distribute if best achievable S(p) > 1
 serial s   comm c     p*    S(p*)                verdict
     0.02   0.0010     31    12.25             distribute
     0.10   0.0020     21     5.47             distribute
     0.40   0.0020     17     2.14             distribute
     0.90   0.0200      2     1.03    stay single-machine
     0.97   0.0500      1     1.00    stay single-machine

Amdahl ceiling at s=0.02: max speedup ->   50.0x
Amdahl ceiling at s=0.10: max speedup ->   10.0x
Amdahl ceiling at s=0.40: max speedup ->    2.5x

With s=0.05, comm c=0.002/worker: optimal p* = 22, S(p*) = 7.40x
  (adding workers past p*=22 makes the job SLOWER, not faster)

Output 41.1.1: Real output. The first three rows are healthy capstones: a small serial fraction lets many workers earn real speedup. The last two are warnings: a ninety-seven-percent-serial task has $p^\star = 1$, meaning the optimal "cluster" is one machine, and distributing it would only add cost. The bottom block shows a well-posed project peaking at twenty-two workers and then degrading, the exact shape your capstone's scaling curve should have.

The output is the section's whole argument in one table. A problem with a small serial fraction and a modest communication tax (the top rows) earns a real, defensible speedup, and that is the curve Section 41.6 will ask you to measure and Section 41.7 will ask you to explain. A problem that is mostly serial (the bottom rows) has an optimal worker count of one or two: the honest verdict is to keep it on a single machine, exactly the red terminal box of Figure 41.1.1. Note the subtle trap the last row exposes: even a problem that clears a ceiling can fail the justification test if its serial fraction is high, which is why clearing a ceiling and passing the test are two separate gates a capstone must pass.

Thesis Thread: The Capstone Is Where You Defend the Thesis Yourself

The spine of this book is that AI at scale is the engineering of systems whose work is distributed across machines, and that scale-out is chosen because a ceiling forces it, never for elegance. Every chapter advanced that thesis on a problem the authors selected. The capstone is the one place where you select the problem and the defense is yours to make. Output 41.1.1 is the form that defense takes: a serial fraction you measured, a communication cost you measured, an optimal worker count those two numbers imply, and a speedup curve that rises and then, as the arithmetic demands, falls. When you reach Section 41.9 and write the report, this is the argument the report is built around: not "I used a cluster," but "here is the ceiling, here is why distributing past it pays, and here is the measured curve that proves it."

3. Sourcing Problems and Reproducible Baselines Intermediate

A good capstone problem is one where a real ceiling meets an available dataset and a baseline you can reproduce. The three must coincide, because a ceiling with no data is a thought experiment, and a result with no baseline is unmeasurable. Datasets that naturally exceed one machine are the richest source: a multi-terabyte web crawl such as Common Crawl, a large image or video corpus, a graph with billions of edges, a continuous event stream. The benchmark and dataset appendix of this book (Appendix D) and the standard public repositories give you corpora whose size alone clears the data ceiling without any contrivance. When the ceiling is the model rather than the data, the source is a published open-weights model too large for one accelerator, which lets you build a model-parallel capstone in the lineage of Chapter 16. When the ceiling is throughput, the source is a serving workload with a stated request rate and latency target, the setting of Chapter 24.

The reproducible baseline matters as much as the dataset, and beginners underweight it. A distributed result means nothing in isolation; it means something only against a single-machine baseline that you ran yourself, on the same data and the same metric. The baseline is what your speedup is a speedup over, and Section 41.3 is devoted to building it before you write a line of distributed code. Source baselines from papers with released code, from framework example repositories, and from the case studies of Chapter 36 through Chapter 40, each of which is a worked template you can scale down to a single-term version. Pick a problem where the single-machine baseline runs, even if slowly, because a baseline that does not run on one machine gives you nothing to compare against and nothing to prove your distribution actually improved.

Library Shortcut: Let the Dataset Loader Tell You the Scale

Before committing to a problem, you can size it in a few lines without downloading the whole thing, because the major dataset hubs expose metadata and streaming. The Hugging Face datasets library reports the number of examples and on-disk bytes from the metadata alone, and streams examples without materializing the corpus, so you can confirm a dataset really does exceed one node before you build anything:

# pip install datasets
from datasets import load_dataset_builder

info = load_dataset_builder("HuggingFaceFW/fineweb", "default").info
gb = (info.dataset_size or 0) / 1e9
print(f"examples: {info.splits['train'].num_examples:,}")   # billions of rows
print(f"on-disk size: {gb:,.0f} GB")                        # far past one node
# stream without downloading: load_dataset(..., streaming=True) yields rows lazily

Code 41.1.2: Sizing a candidate dataset from metadata before downloading it. One call to load_dataset_builder returns the row count and byte size, turning "is this dataset too big for one machine?" from a guess into a number, which is the data-ceiling test of Figure 41.1.1 done in three lines instead of a provisioning experiment.

Practical Example: The Capstone That Almost Distributed a Toy

Who: A graduate student choosing a one-term capstone for this course.

Situation: The first idea was an ambitious data-parallel training pipeline for an image classifier, wrapped in DistributedDataParallel across eight rented GPUs, because data parallelism felt like the canonical distributed-AI project.

Problem: The chosen model and dataset fit comfortably on one GPU and trained in twenty minutes there. The serial fraction of the run, dominated by per-epoch validation and checkpointing, was high, and a quick pass of Code 41.1.1 with the measured numbers put the optimal worker count at two and the best speedup near $1.1$ times.

Dilemma: Push ahead with the data-parallel plan because it was already half-built and looked impressive, or abandon it because the justification test said the cluster would barely beat one machine and might lose to it once real network latency entered.

Decision: They abandoned it and reselected, walking Figure 41.1.1 from the top. The data did not exceed one node, the model did not exceed one accelerator, but the target serving workload they actually cared about (real-time inference on a video stream at a stated frame rate) blew past one server's throughput budget.

How: They repointed the capstone at the throughput ceiling, built a single-server baseline that measured the maximum sustainable frame rate, and designed a replicated-inference fleet against it, the Chapter 24 pattern scaled to one term.

Result: The rebuilt project had a sharp thesis (one server tops out at a measured frame rate; the fleet sustains many times that), a clean baseline to beat, and a speedup curve that rose and plateaued exactly as a healthy distributed system should.

Lesson: Run the justification test on the idea you already love before you fall further in love with it. The cheapest term you will ever save is the one you do not spend distributing a problem that fit on one machine.

4. Scoping for a Single Term Beginner

A capstone that passes the justification test can still fail by being too large to finish. Scoping is the act of carving a problem down to something achievable in one term, measurable against a baseline, and still genuinely distributed. The discipline is to fix the smallest version of the problem that still binds on its chosen ceiling. If the ceiling is data, you do not need the full Common Crawl; you need a slice large enough that it does not fit on one node, which may be a single-digit number of terabytes rather than petabytes. If the ceiling is the model, you need a model one accelerator cannot hold, not the largest model in existence. The case studies are calibrated examples here: Chapter 36 specified a ten-billion-document system to teach the full shape, but its opening section showed how a two-hundred-million-document version exercises every axis at one-fiftieth the cost. Scale the template down until it fits a term, never up until it impresses.

Three properties make a scoped capstone safe. It is achievable: the single-machine baseline runs end to end on hardware you have, even if slowly, so the distributed version always has something to beat. It is measurable: there is one primary metric (a wall-clock, a throughput, a tail latency, a cost) on which the distributed system is supposed to win, plus the quality metric it must not lose, the pairing Section 41.6 formalizes. And it is singly-binding: it stresses one ceiling clearly rather than gesturing at four, so the design defends one axis well instead of four poorly. With those three settled, the path through the rest of this chapter is fixed, and Figure 41.1.1 has done its job: it told you which axis you are on, and everything downstream is execution against the baseline you scoped here.

That path is the spine of Chapter 41. Section 41.2 turns the axis Figure 41.1.1 selected into a concrete distribution strategy. Section 41.3 builds the single-machine baseline you will beat. Section 41.4 designs the distributed system around that axis. Section 41.5 picks the tools and frameworks. Section 41.6 defines the metrics, pairing a scale-out metric with a quality metric. Section 41.7 analyzes the results, including the scaling curve of Output 41.1.1. Section 41.8 makes the whole thing reproducible. Section 41.9 writes the report, and Section 41.10 prepares the presentation. Every later section assumes the problem you choose here is real, justified, and scoped; this section is the foundation the other nine stand on.

Research Frontier: Where Good Capstone Problems Are Being Born (2024 to 2026)

The most fertile capstone territory right now sits where the four ceilings are actively shifting. On the data axis, openly released web-scale corpora such as FineWeb and DCLM have made petabyte-class deduplication and quality-filtering pipelines reproducible by students, not just by labs, so a data-ceiling capstone over a real crawl is now within reach of one term. On the coordination axis, multi-agent LLM frameworks and the agentic systems of Chapter 40 have turned "many cooperating models under a coordinator" into a problem with off-the-shelf scaffolding, opening the distribute-intelligence axis to projects that were research-only two years ago. On the training axis, communication-avoiding and geo-distributed schemes in the DiLoCo lineage (Douillard et al., 2024) let a capstone study scale-out over slow, unreliable links rather than a pristine datacenter fabric, which is both cheaper to provision and closer to the real edge settings of Chapter 34. The constant across all three is the test of this section: a frontier problem is a good capstone only when you can still point at one exhausted resource and measure the speedup that distributing it buys.

Fun Note: The Most Impressive Capstone Sometimes Has the Fewest Machines

Output 41.1.1 says a problem peaking at twenty-two workers is healthier than one peaking at two. The instinct is to chase the largest cluster you can rent, but the grade and the engineering both reward the project that found the right number of machines and proved it, then stopped. A capstone that reports "we measured $p^\star = 22$ and adding a twenty-third worker slowed us down, here is the curve" understands distribution better than one that throws a thousand nodes at a problem and never asks whether the thousandth one helped. The blank repository becomes a real system at the worker count where the math says stop, not the worker count your budget allows.

Exercise 41.1.1: Certify or Reject Three Candidates Conceptual

For each candidate capstone, walk Figure 41.1.1 and state the first ceiling it clears (or that it clears none), the axis it lands on, and one concrete single-machine failure that justifies distributing it: (a) training a sentiment classifier on a dataset of two hundred thousand reviews that fits in eight gigabytes of RAM; (b) serving a code-completion model to a class of three hundred students who each expect responses under half a second during a live exam; (c) deduplicating and quality-filtering a four-terabyte slice of a web crawl before any model training. For the candidate that clears no ceiling, explain what you would change about its specification to make it a legitimate distributed-AI capstone, and what you would refuse to change because it would be distributing for its own sake.

Exercise 41.1.2: Plot Your Own Scaling Verdict Coding

Extend Code 41.1.1 to take a measured serial fraction $s$ and per-worker communication cost $c$ from your own candidate problem (estimate $s$ as the fraction of single-machine wall-clock spent in unparallelizable steps such as final reduction, checkpointing, or sequential I/O). Compute and print the speedup $S(p)$ for $p$ from $1$ to $64$, mark the optimal $p^\star$, and report the speedup at $p^\star$ and the Amdahl ceiling $1/s$. Then add a second communication cost $c' = 4c$ (a slow, geo-distributed link instead of a fast datacenter fabric) and report how $p^\star$ and the best speedup change. State, from your two curves, whether your candidate passes the justification test on each kind of link.

Exercise 41.1.3: The Cost of a Costume Cluster Analysis

Suppose a candidate capstone has serial fraction $s = 0.6$ and you run it on $p = 16$ workers, ignoring communication cost. Using $S(p) = 1 / (s + (1-s)/p)$, compute the speedup and the parallel efficiency $E(p) = S(p)/p$. Interpret the efficiency number: how much of your sixteen rented machines is actually doing useful work, and where did the rest go? Then argue, without running anything, why this problem should probably be rejected as a capstone even though it technically "runs on a cluster," and connect your reasoning to the Amdahl ceiling of Chapter 3 and the efficiency metric you will be asked to report in Section 41.6.