Part IV: Parallel Deep Learning and Large Models

"I was promised that all eight of us would share the model evenly. I now hold layers nineteen through twenty-four, a third of the optimizer state, and a deep suspicion that the activations I forwarded are still in flight somewhere."
A Shard That Believes It Is the Whole Model

Big Picture

When a model and its training run outgrow a single accelerator, you stop replicating the whole thing and start cutting it apart, distributing the training and model axes across many devices that must agree, every step, on one set of weights. Part IV is the heart of large-model training. It begins where a model still fits on one device and only the data is too large, then removes that assumption one piece at a time: split the layers, split each layer, shard the optimizer state, route tokens to a sparse subset of experts, survive the loss of a worker mid-run, and orchestrate the months-long campaigns that produce foundation models. The same collective communication that Part I introduced and Part III scaled across machines now becomes the inner loop of every method here, and the communication-versus-failure tax named in Chapter 1 sets the price of every design choice. By the end of this part you will be able to take a model that does not fit on your hardware and choose, justify, and combine the parallelism strategies that make training it possible.

Part Overview

Part III distributed the learning algorithm while assuming the model itself was small enough to live on one machine. Part IV removes that assumption. The defining problem of modern deep learning is that the model, its activations, its gradients, and its optimizer state have grown larger than the memory of any single accelerator, and that a single training run now consumes thousands of devices for weeks. The discipline of cutting one model and one optimization across many devices, while keeping them in exact agreement step after step, is what this part teaches. It is the most communication-intensive territory in the book, and the place where the abstract axes of Chapter 1 become concrete engineering.

The part is organized as a progression that relaxes a single constraint at a time. Chapter 15 starts from the friendliest case, the model fits but the data does not, and develops data-parallel training as the exact scale-out of the gradient identity, with ring all-reduce and gradient bucketing as its engine. Chapter 16 breaks the assumption that the model fits at all, partitioning it across devices three ways: tensor (intra-layer) parallelism, pipeline (inter-layer) parallelism, and sharded data parallelism in the ZeRO and FSDP family that splits the optimizer state itself. Chapter 17 makes the model sparse, routing each token through only a few of many experts so that capacity grows without proportional compute, and distributing those experts across devices. These three chapters define the parallelism dimensions whose combinations every large training stack composes.

The second half turns from how to cut the work to how to run it at scale and survive. Chapter 18 confronts the reality that on thousands of devices something always fails, building elastic and fault-tolerant training that checkpoints, recovers, and resizes the worker pool without losing the run. Chapter 19 assembles every preceding dimension into the months-long campaigns that train foundation models, where parallelism choices, data pipelines, stability, and cost collide. Chapter 20 builds the distributed reinforcement-learning infrastructure that separates actors, learners, and replay at scale, the substrate that later powers both alignment and the multi-agent learning of Part VI. Chapter 21 closes the part by distributing the search over training configurations themselves, running many trials in parallel and pruning the unpromising ones early.

A thread runs through all seven chapters. The collective communication primitives of Chapter 4 are the engine of every method here: all-reduce synchronizes data-parallel gradients, all-gather and reduce-scatter drive sharded training, and all-to-all routes tokens to experts. The parameter-server sharding of Chapter 11 returns, generalized, as the optimizer-state sharding of Chapter 16, and the data parallelism of Chapter 15 returns as one factor in the expert parallelism of Chapter 17. Every chapter pays the same two taxes Chapter 1 named, communication and failure, and the art of this part is balancing them so that adding devices keeps buying real training throughput rather than idle accelerators waiting on a barrier.

The One Idea That Organizes Part IV

If you keep one idea from this part, keep this: parallelism strategies are not alternatives but dimensions you compose, and you pick each one to relax the specific resource that ran out. Data parallelism answers a throughput ceiling, tensor and pipeline parallelism answer a model-memory ceiling, sharded parallelism answers an optimizer-state ceiling, and expert parallelism answers a capacity ceiling without paying full compute. Real foundation-model training combines several of these at once, nesting data parallelism over pipeline stages over tensor-sharded layers, and the engineering question is never "which one" but "in what combination, and at what communication cost". Elastic training keeps that combination running when devices fail, and distributed search tunes the configuration that drives it. Read the seven chapters as one toolbox: each adds a dimension, and Chapter 19 shows you how they fit together in a single real run.

Part Roadmap

15 Data-Parallel Deep Learning The friendliest case made rigorous: the model fits but the data does not, so replicate the weights, split the batch, and synchronize gradients with ring all-reduce and bucketing, the exact scale-out of the gradient identity.
16 Model, Pipeline, and Sharded Parallelism When the model itself does not fit: tensor parallelism inside each layer, pipeline parallelism across layers, and ZeRO/FSDP sharding that splits the optimizer state across devices.
17 Expert Parallelism and Sparse Distributed Models Growing capacity without growing compute: mixture-of-experts routing sends each token through a few of many experts, distributed across devices with all-to-all communication.
18 Elastic and Fault-Tolerant Distributed Training Surviving scale: on thousands of devices something always fails, so checkpoint, recover, and resize the worker pool elastically without losing the run.
19 Training Foundation Models at Scale The full assembly: months-long campaigns where parallelism dimensions, data pipelines, numerical stability, and cost collide in a single real training run.
20 Distributed Reinforcement Learning Infrastructure Learning from experience at scale: the actor, learner, and replay architecture that decouples data generation from policy updates, the substrate alignment and multi-agent learning both reuse.
21 Distributed Hyperparameter Search and AutoML Distributing the search itself: run many training trials in parallel, share intermediate results, and prune the unpromising configurations early to spend compute where it pays.

Read the seven chapters in order and each one adds a dimension to the toolbox. Chapters 15 through 17 define the parallelism strategies, data, model, pipeline, sharded, and expert, that every large stack composes; Chapter 18 keeps the composition running through failure; Chapter 19 shows the strategies working together in a real foundation-model run; and Chapters 20 and 21 extend the same distributed machinery to reinforcement learning and to the search over training configurations. The all-reduce you met in Chapter 1 and built in Chapter 4 is the heartbeat of all of them.

What's Next?

Part IV trained the largest models we build; Part V: Distributed Inference and Serving deploys them. The model that took thousands of devices and weeks to train must now answer requests in milliseconds, for many users at once, at a cost that does not bankrupt the operator. Part V opens with the per-node efficiency prerequisite (quantization, KV-cache paging, FlashAttention) that Chapter 22 labels plainly as scale-up, then multiplies those gains across a serving fleet: distributed inference, distributed LLM serving, distributed retrieval, and the MLOps that keeps the whole fleet healthy. The optimizer-state sharding and pipeline parallelism you learned here reappear, repurposed from training throughput to serving latency, and the KV-cache economics of a single node return multiplied across the fleet. Read Part V next to follow a foundation model from the training cluster to the user.