Section 19.1: Foundation Models as Distributed Systems

"They told me I was one model. I have since met my data shards, my pipeline stages, my optimizer partitions, and the spare node that took over when I crashed at 3 a.m. I contain multitudes, and most of them are on a different rack."
A Shard That Believes It Is the Whole Model

Big Picture

A foundation-model training run is the integration test for this entire book. A single run simultaneously distributes the data (Part II), the training computation (Part III), and the model itself (Chapters 15 to 18), all coordinated on one cluster (Part VII) and all running long enough that hardware failure is a certainty, not a risk. Nothing in this chapter is a new primitive; it is the assembly of primitives you have already met into one system that must work end to end for weeks. That is exactly why this chapter sits where it does: it could not be written until the chapters that own each axis were in place. This section walks the full pipeline, from a raw web crawl to an aligned model, and shows that every stage is rate-limited by the same thing: distribution. Then it sizes a real run in one short calculation, so the scale of the undertaking is a number you can hold, not a slogan.

Up to this point, Part IV has developed each axis of large-model training in isolation. Chapter 15 made the gradient exact across data-parallel replicas. Chapter 16 split a model too big for one device across pipeline stages and sharded its parameters and optimizer state. Chapter 17 routed tokens to experts living on different machines. Chapter 18 kept a long run alive when nodes vanished. Each was a clean idea studied on its own. A foundation model is what happens when you must run all of them at once, on the same cluster, against the same fifteen trillion tokens, for the better part of two weeks. The composition is not a footnote. It is the engineering problem, because the axes interact: the parallelism layout you choose changes the communication pattern, which changes the failure blast radius, which changes how often you checkpoint, which changes throughput, which changes how many tokens you can afford. This section frames the chapter as that assembly, and the rest of the chapter fills in each stage.

1. The Pipeline End to End Beginner

A foundation model is not trained in one step; it is produced by a pipeline whose stages each live in a different part of this book. You begin by building a corpus: crawling or licensing web-scale text, then cleaning and filtering it, a distributed-data problem in the lineage of Chapter 6, developed for this setting in Sections 19.3 and 19.4. You tokenize that corpus into the integer sequences the model actually consumes, itself a distributed pass over terabytes (Section 19.5). You pretrain the model with composed parallelism, the heart of the run, drawing on Chapters 15 through 18 and orchestrated in Section 19.6. Finally you fine-tune the pretrained base on narrower data and align it to human preferences (Sections 19.7 and 19.8), turning a raw next-token predictor into something usable. Figure 19.1.1 lays out these stages and, crucially, labels each one with the axis of distribution that bottlenecks it.

Figure 19.1.1: The end-to-end pipeline. Five stages run left to right; the orange center stage (distributed pretrain) is the dominant cost and the one that composes Chapters 15 through 18. Each stage carries its distribution axis in italics and the resource that bottlenecks it underneath. The recurring pattern, visible across the whole row, is that no stage is limited by raw computation on a single machine; each is limited by the cost of moving data between machines.

Read Figure 19.1.1 left to right and one fact dominates: the bottleneck label under every box names a communication or data-movement cost, never a single-machine arithmetic cost. Corpus construction is bounded by how fast a cluster can stream and shuffle bytes. Tokenization is bounded by I/O over a distributed filesystem. Pretraining is bounded by the all-reduce and all-to-all collectives that synchronize parameters and route tokens, plus the checkpoint writes that protect against failure. Even alignment, the smallest stage, is bounded by the coupled training-and-inference loop that generates and scores samples across machines. This is the chapter's thesis in one picture: at foundation-model scale, distribution is not an implementation detail layered on top of training; it is the thing that sets the clock.

2. Why This Chapter Comes Last in Part IV Beginner

The dependency structure of this chapter is unusual and worth stating plainly. Most chapters in this book teach one technique you could apply on its own. This one teaches no new technique; it teaches integration. Composed (or "3D") parallelism stacks three independent ways of splitting the work: data parallelism replicates the model and splits the batch (Chapter 15), tensor and pipeline parallelism split a single model layer and the layer stack across devices (Section 16.9 assembles exactly this 3D layout), and when the model is a mixture of experts, expert parallelism adds a fourth axis that scatters experts across the cluster (Chapter 17). On top of all of it sits the elastic, checkpoint-driven recovery of Chapter 18, because a run this long will lose nodes mid-flight. You cannot reason about the composition until you understand each component, which is why the components come first and the assembly comes here.

Key Insight: A Foundation Model Is an Integration, Not an Invention

The hard part of training a frontier model is rarely a single clever algorithm; it is making data loading, three or four axes of parallelism, fault tolerance, and checkpointing all hold together at the same time, at a scale where each one interacts with the others. The parallelism layout sets the communication pattern; the communication pattern sets the failure blast radius; the failure rate sets the checkpoint cadence; the checkpoint cadence taxes throughput. Tune one in isolation and you usually pay for it somewhere else. The skill this chapter builds is reading the whole system at once and finding the layout where all the constraints are satisfied together.

3. The Recurring Constraint: Every Stage Is Bottlenecked by Distribution Intermediate

It is tempting to picture a training run as a giant pile of matrix multiplications, with the network as a minor overhead. At small scale that picture is roughly right. At foundation-model scale it inverts. Consider the three stages in turn. Building the corpus moves petabytes through deduplication and quality filters; the arithmetic per byte is trivial, so the run is gated by aggregate disk and network bandwidth across the cluster, a throughput problem inherited directly from Part II. Pretraining synchronizes a gradient of tens of billions of numbers after every step, and routes tokens to experts on remote machines within every layer; those collectives must finish before the next step starts, so the per-step time has a communication floor that no faster accelerator can lower. Checkpointing writes hundreds of gigabytes of parameters and optimizer state to durable storage often enough to survive the next crash, stealing bandwidth and stalling the step. The unifying observation is that at this scale you are not paying mainly for FLOPs; you are paying for the movement of data between machines and for the insurance against losing machines.

This is the same tension the book opened with in Section 1.1: scale-out buys you capacity but charges you in communication and in failure. A foundation-model run is simply the most extreme point on that curve in the entire book. Everything you learned about minimizing the communication tax (the cost models of Chapter 3, the collective algorithms of Chapter 4, the overlap tricks of Chapters 15 and 16) is deployed here at once, against the largest bill the book ever presents.

Fun Note: The Run Is Mostly Waiting

A useful sanity check on any large training job is the model-FLOPs utilization, the fraction of the cluster's theoretical peak arithmetic the run actually achieves. Frontier runs frequently report figures in the 35 to 50 percent range. Read that the other way around: for more than half of the wall-clock, an enormous fleet of the world's fastest accelerators is sitting idle, waiting on a collective, a data load, or a checkpoint. The single biggest lever in foundation-model engineering is buying back those idle fractions, which is to say, attacking distribution overhead. The compute was never the scarce thing.

4. Sizing the Undertaking: Compute, GPUs, Wall-Clock, and Failures Intermediate

To make the scale concrete, we estimate the four numbers that decide whether a run is feasible: the total compute, the cluster size, the wall-clock, and the expected number of hardware failures over that wall-clock. The compute estimate rests on a single rule of thumb that holds remarkably well for dense transformer training. Training a model with $N$ parameters on $D$ tokens costs approximately

$$C \approx 6 N D \text{ FLOPs},$$

where the factor of six counts roughly two FLOPs per parameter per token for the forward pass and four for the backward pass. This 6ND rule is the back-of-the-envelope that governs every scaling decision in the next section; we use it here only to size the run, and Section 19.2 turns it into the compute-optimal scaling laws that decide how to spend a fixed compute budget between $N$ and $D$. Given $C$, the wall-clock follows from dividing by the cluster's effective throughput, which is the per-GPU peak times the realized utilization times the number of GPUs. The expected failure count follows from Chapter 18: aggregate GPU-hours divided by the mean time between failures of one GPU. The script below does this arithmetic for a Llama-3-class run.

import math

# A foundation-model training run, sized as one estimate.
N_params   = 70e9        # 70-billion-parameter model
D_tokens   = 15e12       # 15-trillion-token budget (Llama-3-class)
gpu_pflops = 9.89e14     # H100 BF16 dense, FLOP/s (989 TFLOP/s)
mfu        = 0.40        # realized model-FLOPs utilization on a real cluster
n_gpus     = 16384       # cluster size
mtbf_gpu_h = 50000       # mean time between failures per GPU, hours

# The 6ND rule: forward + backward dense training costs ~6 FLOPs per
# parameter per token.
flops = 6.0 * N_params * D_tokens

# Wall-clock = total FLOPs / (effective per-GPU throughput * number of GPUs).
eff_per_gpu = gpu_pflops * mfu
wall_seconds = flops / (eff_per_gpu * n_gpus)
wall_days = wall_seconds / 86400.0
wall_hours = wall_seconds / 3600.0

# Expected hardware failures: GPU-hours / mean-time-between-failures.
gpu_hours = n_gpus * wall_hours
expected_failures = gpu_hours / mtbf_gpu_h
mean_uptime_between_failures_min = (wall_hours / expected_failures) * 60.0

print(f"model parameters N        : {N_params:.2e}")
print(f"token budget D            : {D_tokens:.2e}")
print(f"total training FLOPs (6ND): {flops:.3e}")
print(f"cluster size              : {n_gpus} GPUs")
print(f"effective FLOP/s per GPU  : {eff_per_gpu:.3e}  (MFU {mfu:.0%})")
print(f"wall-clock                : {wall_days:.1f} days  ({wall_hours:.0f} GPU-job hours)")
print(f"aggregate GPU-hours       : {gpu_hours:.3e}")
print(f"expected failures in run  : {expected_failures:.0f}")
print(f"mean time between failures: {mean_uptime_between_failures_min:.0f} min of wall-clock")

Code 19.1.1: A pure-Python sizing estimate for a 70-billion-parameter, 15-trillion-token run. It applies the 6ND rule for compute, divides by effective cluster throughput for wall-clock, and divides aggregate GPU-hours by per-GPU mean-time-between-failures to predict the failure count that Chapter 18 must absorb.

model parameters N        : 7.00e+10
token budget D            : 1.50e+13
total training FLOPs (6ND): 6.300e+24
cluster size              : 16384 GPUs
effective FLOP/s per GPU  : 3.956e+14  (MFU 40%)
wall-clock                : 11.2 days  (270 GPU-job hours)
aggregate GPU-hours       : 4.424e+06
expected failures in run  : 88
mean time between failures: 183 min of wall-clock

Output 19.1.1: The four numbers that define the run. Six point three septillion FLOPs, sixteen thousand GPUs, eleven days of wall-clock, and roughly ninety expected hardware failures along the way, one every three hours. The failure number is why fault tolerance is not optional at this scale.

Three things in Output 19.1.1 deserve emphasis. First, the compute is genuinely astronomical: $6.3 \times 10^{24}$ FLOPs is more arithmetic than a single fast GPU could finish in two centuries, which is the whole reason sixteen thousand of them run in parallel. Second, even with a cluster that large, the wall-clock is over a week, so the run is a sustained operation, not a job you babysit and restart. Third, and most consequential, the run expects about ninety hardware failures, one roughly every three hours. A naive system that restarted from scratch on every failure would never finish. That single number, derived here in a few lines of arithmetic, is the entire justification for the elastic, checkpoint-driven machinery of Chapter 18: at this scale, recovering from failure faster than failures arrive is a precondition for the run terminating at all.

Thesis Thread: Every Axis of the Book, Composed Into One Run

This section is where the book's spine becomes a single object. The data axis (Part II) builds and tokenizes the corpus. The training axis (Part III, Chapter 15) keeps the gradient exact across replicas. The model axis (Chapters 16 and 17) splits a model too large for any one device across pipeline stages and experts. The coordination axis (Chapter 18 and Part VII) holds the run together through ninety failures. A foundation model is the place where "distribute the essential work across machines," the thesis of Section 1.1, stops being one axis at a time and becomes all of them, simultaneously, on one cluster. Read the rest of this chapter as the assembly manual.

Library Shortcut: Composed Parallelism Is Configured, Not Coded

You do not hand-write 3D parallelism. Production stacks expose the layout as a configuration, and the framework wires up the process groups, the collective schedules, and the checkpoint logic. In Megatron-Core, for example, the entire composed layout is a handful of arguments:

# Megatron-Core / Megatron-LM launch arguments (one line per parallel axis)
# torchrun ... pretrain_gpt.py \
TP=8                 # tensor-parallel size (split each layer across 8 GPUs)
PP=16                # pipeline-parallel size (split the layer stack into 16 stages)
DP=128               # data-parallel size (128 replicas of the above)
# total GPUs = TP * PP * DP = 8 * 16 * 128 = 16384, the cluster in Output 19.1.1
# the framework derives every NCCL process group, the pipeline schedule,
# the gradient all-reduce, and the distributed (sharded) checkpoint from these.

Code 19.1.2: The composed layout of Output 19.1.1 as three numbers whose product is the cluster size. Thousands of lines of process-group setup, collective scheduling, and sharded checkpointing, the subject of Chapters 15 through 18, collapse into a parallelism configuration the framework expands for you.

Practical Example: Reading the Run Before Buying the Cluster

Who: An ML infrastructure lead at a startup planning its first in-house foundation-model pretraining run.

Situation: Leadership wanted a 70-billion-parameter model trained on a fresh fifteen-trillion-token corpus, and asked for a cluster budget and a delivery date.

Problem: A vendor quote for a fixed GPU count came with no honest estimate of wall-clock or of how often the run would break, so the date and the budget were guesses.

Dilemma: Commit to a large cluster and a hard date on intuition, risking a run that either misses the date or burns budget idling, or first model the run on paper and let the numbers set the plan.

Decision: They ran the Code 19.1.1 estimate before signing anything, sweeping cluster size and realized utilization to see where wall-clock and failure rate landed.

How: The 6ND term fixed the total compute at $6.3 \times 10^{24}$ FLOPs; dividing by effective throughput gave eleven days on sixteen thousand GPUs at forty percent utilization; the GPU-hours over mean-time-between-failures predicted roughly ninety failures, one every three hours.

Result: The failure estimate, not the FLOP count, drove the plan: the team budgeted for elastic restart and frequent sharded checkpoints up front (Chapter 18), and set a delivery date with realistic slack instead of an optimistic one.

Lesson: A few lines of arithmetic size the entire undertaking, and the number that most shapes the system design is usually the failure count, not the compute.

5. From Sizing to Building Intermediate

The estimate in Output 19.1.1 tells you the run is feasible and tells you what it will cost; it does not tell you how to spend the budget wisely. That is the work of the rest of the chapter. The very first question is allocation: given a fixed compute budget $C \approx 6ND$, how large should the model $N$ be and how many tokens $D$ should it see? Make the model too big for the token budget and it is undertrained; make it too small and you waste compute it could have used. The answer is the compute-optimal scaling law, the subject of the next section, and it is the reason the run in Output 19.1.1 paired a 70-billion-parameter model with a fifteen-trillion-token budget rather than some other split.

From there the chapter walks the pipeline of Figure 19.1.1 in order: constructing the corpus (19.3), deduplicating and quality-filtering it (19.4), tokenizing it (19.5), orchestrating the pretraining run with the composed parallelism sized above (19.6), then fine-tuning (19.7) and aligning (19.8) the result, before closing with the energy, cost, and responsibility of operating at this scale (19.9). Each section is one stage of the integration test. We begin where every responsible run begins, with the question of how to spend the compute, in Section 19.2.

Research Frontier: Open Frontier Runs as System Reports (2024 to 2026)

The clearest window into foundation-model engineering is the technical reports that accompany open releases, because they document the system, not just the benchmark scores. The Llama 3 herd of models report (Dubey et al., 2024) describes a run on the order of sixteen thousand H100 GPUs and is candid about the failure rate, noting that hardware faults interrupted training many times and that automated recovery was essential, precisely the regime Output 19.1.1 predicts. DeepSeek-V3 (DeepSeek-AI, 2024) trained a 671-billion-parameter mixture-of-experts model and reports striking compute efficiency through co-designed expert parallelism, low-precision (FP8) training, and a custom communication scheme, a direct attack on the distribution tax of Section 3. The Qwen2.5 technical report (Qwen Team, 2024) documents pretraining on roughly eighteen trillion tokens and the data-pipeline engineering that makes a corpus that size usable. Across all three, the recurring lesson matches this section's thesis: at the frontier, the differentiating work is the distributed-systems engineering, not a secret architecture. We unpack the data-side of these reports in Sections 19.3 to 19.5 and the parallelism side in 19.6.

Exercise 19.1.1: Trace the Bottleneck Conceptual

For each of the five stages in Figure 19.1.1 (build corpus, tokenize, pretrain, fine-tune, align), name the specific distributed operation or resource that bottlenecks it, and the part or chapter of this book that develops the remedy. Then argue, in two or three sentences, why buying faster individual accelerators would do little for at least three of the five stages. Your answer should make explicit the claim of Section 3 that these stages are limited by data movement between machines, not by arithmetic on one machine.

Exercise 19.1.2: Resize the Run Coding

Modify Code 19.1.1 to model a 405-billion-parameter model trained on the same fifteen-trillion-token budget on the same sixteen-thousand-GPU cluster. Report the new total FLOPs, wall-clock in days, and expected failure count. Then, holding the model and token budget fixed, find by sweeping n_gpus the cluster size at which the expected number of failures during the run first exceeds two hundred. Explain how that failure count would change your checkpointing strategy, citing the trade-off from Chapter 18 between checkpoint frequency and throughput.

Exercise 19.1.3: The Communication Floor Analysis

The 6ND rule counts only arithmetic and ignores communication. Suppose the data-parallel replicas in Output 19.1.1 must all-reduce a gradient of $N = 7 \times 10^{10}$ numbers (2 bytes each in BF16) once per training step, over an interconnect delivering an effective 100 gigabytes per second per GPU for the collective. Estimate the all-reduce time per step using the standard ring-all-reduce cost of roughly $2(K-1)/K$ times the gradient size in bytes divided by bandwidth, for the data-parallel group size $K$ implied by Code 19.1.2. Compare it to the per-step compute time implied by Output 19.1.1, and state whether communication or computation dominates a step, and what that implies for the realized utilization figure. We make this kind of estimate rigorous in Chapter 3.