Part IV: Parallel Deep Learning and Large Models
Chapter 19: Training Foundation Models at Scale

Orchestrating Distributed Pretraining

"They spent four chapters teaching me to shard, to pipeline, to route, and to recover. Then one Tuesday they switched everything on at once and asked me to behave like a single calm machine for ninety days."

A Cluster Mid-Pretraining, Day Forty-One
Big Picture

A foundation-model pretraining run is the moment every parallelism in Part IV stops being a separate technique and becomes one composed system: a data loader streaming tokenized shards into a model that is simultaneously data-parallel, tensor-parallel, pipeline-parallel, and possibly expert-parallel, all running under elastic checkpointing while a team watches loss curves and utilization dashboards for months. Nothing here is new machinery; the machinery was built in Chapters 15 through 18. What is new is the orchestration: choosing the parallel degrees that tile your cluster, wiring the training recipe (warmup, decay, clipping, large-batch scaling) so a 2048-GPU run stays numerically stable, and treating Model FLOPs Utilization as the single number that decides whether your millions of dollars of compute are doing arithmetic or waiting on the network. This section is where the chapters compose.

By this point in the book the individual parallelisms are familiar tools. Data parallelism replicates the model and synchronizes gradients with an all-reduce (Chapter 15). Tensor and pipeline parallelism split one model that no single device can hold (Chapter 16). Expert parallelism routes tokens to experts scattered across machines (Chapter 17). Elastic, fault-tolerant training keeps a thousand-GPU job alive through constant hardware failures (Chapter 18). A real pretraining run uses all of them at once, on the same step, for weeks. Orchestrating that composition, deciding how much of each parallelism to apply, how to feed it data, and how to keep it efficient and alive, is the subject of this section and the practical heart of the chapter.

1. Assembling the Parallelism: Mapping a Model onto a Cluster Intermediate

The first decision in any pretraining run is how to lay the model across the GPUs. You have a fixed cluster of $N$ accelerators and several orthogonal ways to split work, and the product of the chosen degrees must exactly equal the world size. For a dense transformer the three classic axes are tensor parallel (TP), pipeline parallel (PP), and data parallel (DP), giving the well-known 3D-parallel mapping. The constraint is a single identity,

$$N = d_{\text{TP}} \cdot d_{\text{PP}} \cdot d_{\text{DP}},$$

and when the model is a mixture of experts you add a fourth factor $d_{\text{EP}}$ for expert parallelism, making it 4D. The art is not in the arithmetic but in matching each axis to the hardware level whose bandwidth it can afford. Tensor parallelism exchanges activations on every layer, so it must stay inside one node where NVLink is fast; pushing it across nodes collapses utilization. Pipeline parallelism only passes a thin activation tensor between adjacent stages, so it tolerates the slower inter-node fabric and is the natural way to span nodes to fit a model that does not fit in one. Data parallelism communicates once per step (the gradient all-reduce) and soaks up whatever GPUs remain to raise throughput.

Data loader tokenized shards (Ch 8) stream & prefetch 3D-parallel model on the GPU cluster data-parallel replicas (gradient all-reduce, Ch 15) replica 1 (one node) TP shard TP shard pipeline stage A pipeline stage B TP intra-node, PP across stages replica 2 (one node) TP shard TP shard pipeline stage A pipeline stage B ... DP times across the cluster Checkpoint store durable, async restart point (Ch 18) Monitoring loss, grad-norm, MFU, throughput (Ch 5, Ch 18)
Figure 19.6.1: The pretraining stack this section orchestrates. A streaming data loader feeds tokenized shards into a model that is tensor-parallel inside each node, pipeline-parallel across its stages, and data-parallel across replicas. The same model writes durable checkpoints asynchronously and emits loss, gradient-norm, throughput, and Model FLOPs Utilization to a monitoring loop. Every box is a chapter of Part IV; this section wires them together.
Key Insight: The Parallelism Degrees Are a Bandwidth-Matching Problem

Choosing $d_{\text{TP}}, d_{\text{PP}}, d_{\text{DP}}$ (and $d_{\text{EP}}$) is not free choice; it is placement against a bandwidth hierarchy. Tensor parallelism is the hungriest communicator and belongs on the fastest link (intra-node NVLink), pipeline parallelism is the leanest and spans the slow inter-node fabric, and data parallelism fills the remainder. Get the mapping backwards, for example tensor-parallel across nodes, and Model FLOPs Utilization collapses even though every other line of code is correct. The cluster topology, not the model, dictates the first cut of the degrees.

2. Wiring the Data Loader and Running Under Elasticity Intermediate

A model laid across the cluster is useless without tokens to feed it, and at pretraining scale the data path is itself a distributed system. The corpus has been tokenized and written as sharded files in the previous section and stored on the distributed file system of Chapter 8. Each data-parallel replica streams a disjoint set of shards, prefetching and decoding ahead of the GPU so the accelerators never stall waiting for input. The subtlety unique to pretraining is that the loader must be deterministic and resumable: when a failure throws the job back to a checkpoint, the data loader has to re-seek to the exact token offset it had reached, or the run will silently re-train on some tokens and skip others. This is why the loader's position is part of the checkpoint, not just the model weights and optimizer state.

Running under elasticity ties the whole picture together. Over a run of weeks across thousands of GPUs, hardware failure is not an exception to handle but a steady-state condition to absorb, exactly the regime that Chapter 18 built machinery for. The orchestration writes checkpoints on a fixed wall-clock cadence, detects a dead worker, reconstitutes the process group (possibly at a smaller world size), reloads the last checkpoint including the data-loader offset, and resumes. The cost of each restart is wasted compute since the last checkpoint plus a fixed reload overhead, and a well-run job budgets for it rather than being surprised by it. Our planner in Section 4 makes that budget a concrete number.

3. The Training Recipe at Scale Advanced

Composing the parallelism gets the arithmetic onto the cluster; the training recipe decides whether that arithmetic converges to a good model instead of diverging into a wall of NaNs. Three ingredients dominate at scale. The first is the learning-rate schedule: a short linear warmup over the first few thousand steps lets the optimizer's running statistics settle before the full step size is applied, after which a cosine or linear decay brings the rate down toward zero over the token budget. A schedule with peak rate $\eta_{\max}$, warmup length $T_w$, and total length $T$ takes the shape

$$\eta(t) = \begin{cases} \eta_{\max}\,\dfrac{t}{T_w}, & t \le T_w, \\[6pt] \eta_{\max}\,\dfrac{1 + \cos\!\big(\pi\,\tfrac{t - T_w}{T - T_w}\big)}{2}, & t > T_w. \end{cases}$$

The second ingredient is large-batch scaling. A 2048-GPU data-parallel job has an enormous global batch (millions of tokens per step), and the relationship between batch size and the right learning rate is the subject of Chapter 10: too small a rate wastes the parallelism, too large a one destabilizes it, and the usable batch saturates past a critical size. The third is gradient clipping, the safety valve that rescales the global gradient whenever its norm exceeds a threshold, so a single bad batch cannot blow the weights apart. These three, warmup-decay, batch-matched learning rate, and clipping, are what keep a run on the rails for months.

Fun Note: The Loss Spike at 3 a.m.

Pretraining teams develop a folklore around the loss spike: the curve, flat and reassuring for days, suddenly jumps, and a pager goes off. Sometimes it self-heals as the optimizer recovers; sometimes it is the leading edge of a divergence that will ruin the run. The standard remedy entered the literature as a manual operation: roll back to a checkpoint before the spike, skip the offending data batches, and resume. An entire engineering practice exists around deciding, at 3 a.m., whether to wait and watch or to roll back and burn a day of compute.

4. A Pretraining-Run Planner Intermediate

The decisions above, parallel degrees, throughput, wall-clock, cost, and expected failures, can be turned into a single back-of-the-envelope planner before a single GPU is reserved. The code below takes a model, a token budget, and a cluster, picks the 3D-parallel degrees so they tile the world size, and estimates the run. Its throughput estimate rests on the standard dense-transformer FLOP count of $C \approx 6PT$ for $P$ parameters and $T$ training tokens (forward plus backward), and on a target Model FLOPs Utilization, the fraction of the hardware's peak the run actually sustains. Everything else, GPU-hours, dollars, and the number of failures to expect from the cluster's mean time between failures, follows by arithmetic.

import math

# A pretraining-run planner: model + tokens + cluster -> a run plan.
# Choose parallel degrees, then estimate MFU, wall-clock, GPU-hours,
# dollar cost, and how many hardware failures to expect over the run.

# --- Inputs: the model (a Llama-3-70B-class dense transformer). ---
P = 70e9                      # parameters
n_layers = 80
d_model = 8192
tokens = 15e12               # training token budget (15T)

# --- Inputs: the cluster (H100 SXM nodes, 8 GPUs each). ---
n_gpus = 2048
gpu_per_node = 8
peak_flops = 989e12          # H100 BF16 peak, FLOP/s per GPU (dense)
dollars_per_gpu_hour = 2.50  # typical large-reservation rate
mtbf_gpu_hours = 50000.0     # mean time between failures, per-GPU hours

# --- Step 1: pick parallel degrees (tensor x pipeline x data = world size). ---
# Tensor parallel stays inside one node (fast NVLink); pipeline spans nodes to
# fit the model; data parallel soaks up the rest of the cluster for throughput.
tp = gpu_per_node            # 8-way tensor parallel, NVLink-bound, intra-node
pp = 8                       # 8 pipeline stages -> 10 layers per stage
dp = n_gpus // (tp * pp)     # data-parallel replicas fill the remaining GPUs
assert tp * pp * dp == n_gpus, "degrees must tile the cluster exactly"

# --- Step 2: arithmetic of the run. 6*P*tokens is the standard dense FLOP
#     estimate (fwd+bwd) from the Chinchilla/Kaplan accounting. ---
total_flops = 6.0 * P * tokens
mfu = 0.45                   # target Model FLOPs Utilization (H100 70B regime)
sustained = n_gpus * peak_flops * mfu     # delivered FLOP/s across the cluster

seconds = total_flops / sustained
hours = seconds / 3600.0
days = hours / 24.0
gpu_hours = hours * n_gpus
cost = gpu_hours * dollars_per_gpu_hour

# --- Step 3: expected failures over the wall-clock, from per-GPU MTBF. ---
expected_failures = gpu_hours / mtbf_gpu_hours
# A restart wastes, on average, half a checkpoint interval plus a fixed
# restart overhead. Budget that lost time.
ckpt_interval_h = 1.0        # checkpoint every hour of wall-clock
restart_overhead_h = 0.25    # load weights + dataloader re-seek per restart
lost_h = expected_failures * (ckpt_interval_h / 2.0 + restart_overhead_h)
overhead_pct = 100.0 * lost_h / hours

print("=== Pretraining run plan ===")
print(f"model                : {P/1e9:.0f}B params, {n_layers} layers, d_model={d_model}")
print(f"token budget         : {tokens/1e12:.0f}T tokens  ->  {total_flops:.2e} FLOPs (6*P*T)")
print(f"cluster              : {n_gpus} GPUs ({n_gpus//gpu_per_node} nodes x {gpu_per_node})")
print("--- chosen parallel degrees ---")
print(f"tensor  (intra-node) : TP={tp}")
print(f"pipeline (cross-node): PP={pp}  ->  {n_layers//pp} layers/stage")
print(f"data    (replicas)   : DP={dp}")
print(f"check  TP*PP*DP       : {tp}*{pp}*{dp} = {tp*pp*dp} == world size {n_gpus}")
print("--- performance & economics ---")
print(f"target MFU           : {mfu*100:.0f}%")
print(f"sustained throughput : {sustained/1e15:.2f} PFLOP/s")
print(f"wall-clock           : {hours:,.0f} GPU-step-hours  ->  {days:.1f} days")
print(f"GPU-hours            : {gpu_hours:,.0f}")
print(f"compute cost         : ${cost/1e6:.2f}M  (@ ${dollars_per_gpu_hour:.2f}/GPU-h)")
print("--- reliability budget ---")
print(f"expected failures    : {expected_failures:.1f} over the run")
print(f"time lost to restarts: {lost_h:.1f} h  ({overhead_pct:.1f}% overhead)")
print(f"effective wall-clock : {days*(1+overhead_pct/100):.1f} days incl. restarts")
Code 19.6.1: A pure-Python pretraining-run planner. It tiles the cluster into tensor, pipeline, and data-parallel degrees, then turns the $6PT$ FLOP count and a target MFU into wall-clock, GPU-hours, dollar cost, and an expected-failure budget, all before any hardware is reserved.
=== Pretraining run plan ===
model                : 70B params, 80 layers, d_model=8192
token budget         : 15T tokens  ->  6.30e+24 FLOPs (6*P*T)
cluster              : 2048 GPUs (256 nodes x 8)
--- chosen parallel degrees ---
tensor  (intra-node) : TP=8
pipeline (cross-node): PP=8  ->  10 layers/stage
data    (replicas)   : DP=32
check  TP*PP*DP       : 8*8*32 = 2048 == world size 2048
--- performance & economics ---
target MFU           : 45%
sustained throughput : 911.46 PFLOP/s
wall-clock           : 1,920 GPU-step-hours  ->  80.0 days
GPU-hours            : 3,932,142
compute cost         : $9.83M  (@ $2.50/GPU-h)
--- reliability budget ---
expected failures    : 78.6 over the run
time lost to restarts: 59.0 h  (3.1% overhead)
effective wall-clock : 82.5 days incl. restarts
Output 19.6.1: The planned run for a 70B model on 15T tokens across 2048 H100s: 80 days of wall-clock, roughly 3.9 million GPU-hours, near 10 million dollars of compute, and about 79 hardware failures to absorb, the restarts adding only 3 percent overhead because checkpointing is frequent. Halving the target MFU would nearly double every cost line, which is why MFU is the number teams obsess over.

The planner makes the chapter's economics tangible. A foundation-model run is months of wall-clock, thousands of GPUs, tens of failures, and an eight-figure compute bill, and the one knob that scales every line at once is Model FLOPs Utilization. Notice what the output does not say: it assumes the parallel degrees were placed correctly against the bandwidth hierarchy. If tensor parallelism had been spread across nodes, the achieved MFU in the same formula would be a fraction of the target, and the 80-day plan would balloon past 150 days at the same cost per GPU-hour.

Thesis Thread: Where Every Axis of Distribution Meets

This section is the convergence point promised back in Section 1.1, where the six axes of distribution were first named. A pretraining run distributes the data (the streaming loader of Chapter 8), distributes the training (the gradient all-reduce of Chapter 15), distributes the model (the tensor and pipeline shards of Chapter 16, the experts of Chapter 17), and coordinates the cluster (the elastic checkpointing of Chapter 18), all on the same step, for months. No earlier chapter could compose these because each owned only one axis. Foundation-model pretraining is the system in which the whole book's thesis, distribute the essential work and keep many machines behaving as one, is exercised at full scale.

5. Why Model FLOPs Utilization Is the Central Performance Goal Advanced

Model FLOPs Utilization is the fraction of the hardware's theoretical peak that your run actually spends doing the model's useful arithmetic. If $C = 6PT$ is the model's FLOP count, $N$ the GPU count, $F_{\text{peak}}$ the per-GPU peak, and $\tau$ the wall-clock seconds, then

$$\text{MFU} = \frac{6PT}{N \cdot F_{\text{peak}} \cdot \tau}.$$

Every loss in the system, communication that does not overlap with computation, pipeline bubbles, dataloader stalls, recomputation for memory, and restart overhead, shows up as a smaller MFU and a longer $\tau$. This is why MFU, not raw teraflops or step time, is the number a pretraining team treats as its north star: it is dimensionless, comparable across model sizes and clusters, and directly proportional to dollars saved. The measurement discipline behind it, throughput, utilization, and efficiency as first-class metrics, is the subject of Chapter 5, and the babysitting loop watches it on the same dashboard as the loss. A run whose loss is falling beautifully but whose MFU has quietly dropped from 45 to 30 percent is burning a third of its budget on nothing, and catching that is as important as catching a loss spike.

Library Shortcut: Megatron-LM, NeMo, and DeepSpeed Orchestrate the Whole Run

The planner in Code 19.6.1 estimates a run; production frameworks execute it. NVIDIA's Megatron-LM and NeMo, and Microsoft's DeepSpeed, take the tensor, pipeline, data, and expert degrees as configuration and assemble the entire stack of Figure 19.6.1 for you: the sharded model, the overlapped collectives, the pipeline schedule, the distributed optimizer, the resumable data loader, asynchronous checkpointing, and an MFU report on every log line. Launching a 3D-parallel run becomes a config block rather than the thousands of lines of distributed plumbing it represents:

# Megatron-LM / NeMo: the parallel degrees are flags, not code.
torchrun --nproc_per_node=8 --nnodes=256 pretrain_gpt.py \
  --tensor-model-parallel-size 8 \
  --pipeline-model-parallel-size 8 \
  --use-distributed-optimizer \
  --lr 3e-4 --lr-warmup-iters 2000 --lr-decay-style cosine \
  --clip-grad 1.0 --global-batch-size 2048 \
  --save /ckpt --save-interval 500 --log-throughput
Code 19.6.2: The same 3D-parallel mapping and training recipe as Code 19.6.1, expressed as launcher flags. The framework handles process-group setup, collective scheduling, the pipeline bubble, gradient clipping, the warmup-cosine schedule, resumable checkpointing, and per-step MFU logging that you would otherwise build by hand across Chapters 15 to 18.
Practical Example: Chasing MFU on a Stalling 70B Run

Who: A training-infrastructure engineer on a foundation-model team running a 70B dense model on 2048 H100s.

Situation: The run was live and the loss was descending normally, but the per-step MFU reported on the dashboard had drifted from a planned 45 percent down to 31 percent over the first week.

Problem: At 31 percent MFU the run was on track for roughly 115 days instead of the planned 80, adding millions of dollars and missing the release window, even though no error had been thrown.

Dilemma: Stop a healthy run to investigate, losing days of progress, or let it continue and hope the drop was transient, risking a far larger overrun if it was structural.

Decision: They paused, because the planner showed the MFU gap alone dwarfed any restart cost, and profiled one step end to end.

How: Profiling revealed the dataloader was stalling the pipeline: a slow storage tier was failing to prefetch shards ahead of the GPUs, so accelerators idled between batches. They moved the hot shards to faster storage and increased prefetch depth, then resumed from the last checkpoint with the data-loader offset intact.

Result: MFU recovered to 44 percent, the run returned to its 80-day trajectory, and the one-day pause paid for itself many times over in compute not wasted.

Lesson: A falling MFU is a silent budget leak that no exception will surface. Watching it with the same vigilance as the loss curve is what separates a run that finishes on plan from one that quietly doubles in cost.

6. The Practical Reality: Months, Failures, and a Team on Dashboards Beginner

Strip away the equations and a pretraining run is a human and operational endeavor. It occupies thousands of GPUs continuously for weeks to months, costs millions of dollars in compute, and runs on hardware that is, at this scale, always partly broken: as the planner showed, a single run absorbs dozens of failures, and the team's job is to make each one a brief restart rather than a lost run. People watch dashboards in shifts, tracking the loss, the gradient norm, the throughput, and the MFU, ready to roll back a divergence or chase a utilization drop. The babysitting loop, watch, diagnose, intervene, resume, is as much a part of foundation-model training as the optimizer, and the frameworks of the previous shortcut exist precisely to make that loop survivable across a run that no single person can sit through end to end.

Research Frontier: Production Pretraining Infrastructure (2024 to 2026)

The orchestration of pretraining is now a published research subject of its own. ByteDance's MegaScale (Jiang et al., NSDI 2024) reports training an LLM across more than 10,000 GPUs and documents the full-stack engineering, parallelism placement, communication overlap, and a fault-tolerance system that automatically detects and recovers from failures, needed to sustain about 55 percent MFU at that scale. Meta's Llama 3 herd paper (2024) describes a 16,000-GPU run on its training infrastructure and is unusually candid about operational reality, cataloguing hundreds of interruptions over the run and the checkpointing and health-monitoring systems that kept effective utilization high despite them. Across the field, sustained MFU has become the headline efficiency record that infrastructure papers compete on, and the open frameworks (Megatron-LM, NeMo, DeepSpeed) now ship the overlap, scheduling, and async-checkpointing techniques those papers introduced. The frontier is pushing the same orchestration toward larger clusters, FP8 training, and increasingly geo-distributed runs.

With the run orchestrated, the parallelism composed, the recipe tuned, and the MFU defended, a pretrained foundation model emerges. The next step is to take that general model and adapt it to specific tasks without repeating the full expense, which moves the story from pretraining to distributed fine-tuning in Section 19.7.

Exercise 19.6.1: Re-map the Cluster Conceptual

The run in Output 19.6.1 used $d_{\text{TP}}=8$, $d_{\text{PP}}=8$, $d_{\text{DP}}=32$ on 2048 GPUs. Suppose the model now needs 16-way tensor parallelism to fit a wider layer, but each node still holds only 8 GPUs. Explain what goes wrong if you naively set $d_{\text{TP}}=16$, identify which bandwidth tier the tensor-parallel traffic would now cross, and propose an alternative degree assignment (using pipeline or sharded-data parallelism instead) that fits the model without forcing tensor parallelism across nodes. State the new $d_{\text{TP}} \cdot d_{\text{PP}} \cdot d_{\text{DP}}$ and check it equals 2048.

Exercise 19.6.2: The MFU-to-Dollars Sensitivity Coding

Extend Code 19.6.1 to sweep the target MFU over the set $\{0.30, 0.35, 0.40, 0.45, 0.50\}$ and print, for each, the wall-clock in days and the compute cost in millions of dollars, holding everything else fixed. Then add a second sweep over the per-GPU MTBF $\{20000, 50000, 100000\}$ hours and report how the expected-failure count and restart overhead change. Write one sentence on which of the two knobs, MFU or MTBF, moves the total cost more, and why that justifies the chapter's claim that MFU is the central performance goal.

Exercise 19.6.3: Budget the Restart Overhead Analysis

Using the planner's reliability model, derive an expression for the wall-clock lost to restarts as a function of the checkpoint interval $I$ (in hours), assuming each failure wastes $I/2$ of progress plus a fixed reload overhead $r$, and that failures arrive at the rate implied by the cluster's MTBF. Show that there is a trade-off: very frequent checkpointing (small $I$) wastes time writing checkpoints, while infrequent checkpointing (large $I$) wastes time re-computing after each failure. If writing one checkpoint costs $c$ hours of stalled training, write the total overhead as a function of $I$ and find the $I$ that minimizes it. Relate your answer to the asynchronous-checkpointing techniques of Chapter 18 that drive $c$ toward zero.