Part IV: Parallel Deep Learning and Large Models
Chapter 19: Training Foundation Models at Scale

Training Foundation Models at Scale

Everything before this chapter built a capability in isolation: data parallelism spread the batch, model and sharded parallelism spread the parameters, expert parallelism spread the layers, and the previous chapter armored the whole thing against failure. This chapter is where those capabilities stop being separate techniques and become one system pointed at one target: the multi-week, multi-thousand-GPU pre-training run that produces a frontier foundation model. The work splits into two halves that are easy to underestimate. The first is everything before the optimizer ever runs, the petabyte-scale data pipeline that crawls, cleans, deduplicates, and tokenizes trillions of tokens, all of it distributed because no single machine can hold or process it. The second is the run itself, where scaling laws decide how big a model the compute budget can afford, where every parallelism axis from this part is composed into one job, and where fine-tuning and alignment turn a raw next-token predictor into something useful. The chapter closes on the bill: the energy, the dollars, and the responsibility that come with spending a small data center for a single model.

Conceptual illustration for Chapter 19: Training Foundation Models at Scale

"They told me I was one model. I have since learned that I am a data pipeline that crawled half the web, a thousand GPUs that agreed to stay synchronized for six weeks, and a learning-rate schedule nobody was allowed to touch. The weights are just where we all finally agreed."

A Foundation Model, Reflecting on Its Own Pre-Training
Big Picture

Training a foundation model is a single distributed system that composes the data processing of Part II, the optimization and sharding of Part III, and every parallelism and fault-tolerance axis of Part IV into one end-to-end run, where the binding constraint shifts from any single component to the orchestration that keeps trillions of tokens flowing through thousands of synchronized accelerators for weeks without losing a single irreplaceable hour. This is the chapter where the book's threads converge. The MapReduce shuffle you first met in Chapter 6 reappears as the distributed deduplication that scrubs trillions of near-duplicate documents; the all-reduce of Chapter 4 is the heartbeat of every gradient step; the sharded optimizer of Chapter 16 is what lets the parameters fit at all; and the checkpoint-resume-rescale machinery of Chapter 18 is what keeps a run alive across the inevitable failures. What is genuinely new here is the integration and the economics. Scaling laws turn a fixed compute budget into a principled choice of how many parameters to train on how many tokens, so the run is sized before it starts rather than guessed at. The data pipeline that feeds the run is itself a massive distributed-systems problem, larger in bytes than the model and decisive for final quality, because a frontier model is only as good as the corpus it never gets to see twice. Pre-training orchestration is the discipline of babysitting a job too expensive to repeat: stability engineering, learning-rate schedules, loss-spike recovery, and the operational vigilance that catches a silent divergence before it wastes a week. Fine-tuning and alignment then take the raw base model and, through distributed supervised tuning, preference optimization, and reinforcement learning from human feedback, turn a next-token predictor into an assistant. And because a single run can consume the energy of a small town for a month, the chapter closes on a sober accounting of energy, dollars, and the responsibility that scale imposes. The thesis the chapter serves is the book's whole spine, made concrete: a foundation model is not a single artifact you download, it is the frozen output of a distributed system that had to be engineered end to end.

Chapter Overview

This is the fifth chapter of Part IV, and it is the one that stops teaching a technique and starts building a system. Chapters 15 through 18 each gave you one axis: data parallelism, model and pipeline and sharded parallelism, expert parallelism, and the elasticity and fault tolerance that keep a long job alive. This chapter assumes you hold all of them and asks the harder question of how they combine, with the data pipeline of Part II underneath, into the single most ambitious distributed-systems artifact most engineers will ever touch. The binding constraint is no longer any one resource; it is the orchestration that keeps every piece moving in lockstep, because at this scale the bottleneck is whichever stage is currently failing to keep up, and the cost of getting it wrong is measured in weeks of wasted accelerator time and millions of dollars.

The nine sections fall into four movements. The first frames the run and sizes it: Section 19.1 reframes a foundation model as a distributed system rather than a single artifact, and Section 19.2 develops the scaling laws that turn a compute budget into a principled choice of model and dataset size. The second movement builds the data, which is most of the work: Section 19.3 constructs the trillion-token dataset as a distributed pipeline, Section 19.4 deduplicates it and enforces quality at scale, and Section 19.5 turns raw text into tokens across a corpus too large to process on one machine. The third movement runs the model: Section 19.6 orchestrates the pre-training job itself, composing every parallelism axis with the fault tolerance of the previous chapter and the stability engineering a one-shot run demands. The fourth movement makes the model useful and counts the cost: Section 19.7 fine-tunes it across machines, Section 19.8 aligns it through a systems view of supervised tuning, preference optimization, and reinforcement learning from human feedback, and Section 19.9 accounts rigorously for the energy, dollars, and responsibility of scaling.

Read in order, the nine sections take you from "a foundation model is the output of a distributed system, not a thing you simply download" to "you can size a run from a compute budget, build the petabyte pipeline that feeds it, orchestrate the multi-thousand-GPU job that trains it, align the result into something useful, and account for what it cost." The argument is cumulative and integrative: the scaling laws size the run, the data pipeline feeds it, the orchestration composes every axis of this part into one job, alignment turns the raw model into an assistant, and the cost accounting closes the loop on what spending a data center for a single model actually means.

Prerequisites

This chapter is the capstone of Part IV and assumes you have read the rest of it. From Chapter 16: Model, Pipeline, and Sharded Parallelism you carry the sharded optimizer and the tensor and pipeline partitioning that let a model larger than any single device fit and train; from Chapter 18: Elastic and Fault-Tolerant Distributed Training you carry the checkpointing, deterministic restart, and elasticity that keep a multi-week run alive on hardware that will not stay healthy for its full duration. The data half of this chapter assumes the distributed data processing of Part II, and in particular the Chapter 6: The MapReduce Model and Distributed Algorithms shuffle, because the trillion-token deduplication and cleaning pipeline is a distributed-algorithms problem before it is a machine-learning one. The chapter also leans on the all-reduce of Chapter 4 and the data-parallel step of Chapter 15. It assumes comfortable Python and PyTorch and a working memory of how the parallelism axes of this part compose. No prior experience running a frontier-scale job is required; Section 19.1 builds the systems framing from the ground up before any single component is detailed.

Learning Objectives

Remember the Chapter as One Sentence

If you keep one thing from this chapter, keep this: a foundation model is not a single artifact but the frozen output of a distributed system that had to be engineered end to end, sized by scaling laws, fed by a petabyte-scale data pipeline, trained by composing every parallelism and fault-tolerance axis of this part into one multi-week run, aligned into usefulness across machines, and paid for in energy and dollars that demand an honest accounting. Read forward, the sections build the system in the order it is actually constructed: first the systems framing and the scaling laws that decide how big to build, then the data pipeline that is most of the real work, then the orchestration that runs the job, and finally the alignment and cost accounting that turn the raw model into something useful and count what it took. Read as a question, the chapter asks of any frontier run: how do you decide its size before spending a dollar, how do you assemble and clean the trillions of tokens it eats, how do you keep thousands of accelerators descending the same loss for weeks, how do you make the result helpful, and how do you justify the bill. The roadmap below walks the nine sections that answer it.

Chapter Roadmap

Read the nine sections in order and you will hold a working blueprint for the most ambitious distributed system in the book: Section 19.1 reframes the whole undertaking as one system, Section 19.2 sizes it from a compute budget, Sections 19.3 through 19.5 build and clean and tokenize the trillions of tokens that feed it, Section 19.6 orchestrates the run that trains it, Sections 19.7 and 19.8 adapt and align the result into something useful, and Section 19.9 counts the cost. The thread to watch is the rest of the book reappearing as load-bearing infrastructure: the Chapter 6 shuffle inside deduplication, the Chapter 4 all-reduce inside every step, and the Chapter 18 fault tolerance keeping the whole multi-week run from losing a single irreplaceable hour.

What's Next?

This chapter trained, adapted, and aligned a foundation model by composing every axis of parallel deep learning into one supervised run, but it treated the data as a fixed corpus that arrives, gets cleaned, and is consumed. The next chapter relaxes that assumption in the most demanding way possible. Chapter 20: Distributed Reinforcement Learning Infrastructure turns to the setting where the training data does not exist until the model itself generates it, where actors roll out experience against an environment while learners consume it, and where the central engineering problem becomes keeping sampling throughput and learning throughput in balance across a cluster. The alignment-by-reinforcement-learning machinery you met at the end of this chapter is one instance of that infrastructure; Chapter 20 builds it in full, developing the actor-learner architecture, distributed experience collection and replay, off-policy correction at scale, and the synchronous-versus-asynchronous tradeoffs that decide whether an RL system saturates its accelerators or starves them. Read it next, and watch the data pipeline of this chapter become a feedback loop that the model itself drives.

Bibliography & Further Reading

Scaling Laws and Compute-Optimal Training

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., Amodei, D. "Scaling Laws for Neural Language Models." arXiv:2001.08361, 2020. arxiv.org/abs/2001.08361

The paper that established the power-law relationship between loss, parameters, data, and compute, the empirical foundation for sizing a run in Section 19.2.

📄 Paper

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., et al. "Training Compute-Optimal Large Language Models (Chinchilla)." arXiv:2203.15556, 2022. arxiv.org/abs/2203.15556

The Chinchilla result that corrected the parameter-to-token ratio for compute-optimal training, the direct reference for the budget-allocation reasoning of Section 19.2.

📄 Paper

Foundation Models and Pre-Training Recipes

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., et al. "Language Models are Few-Shot Learners (GPT-3)." arXiv:2005.14165, 2020. arxiv.org/abs/2005.14165

The 175-billion-parameter model that demonstrated emergent few-shot ability at scale, the motivating example for the orchestration challenge of Section 19.6.

📄 Paper

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., et al. "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv:2307.09288, 2023. arxiv.org/abs/2307.09288

An openly documented pre-training and alignment recipe, a concrete reference for the end-to-end pipeline of Sections 19.6 through 19.8.

📄 Paper

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., et al. "The Llama 3 Herd of Models." arXiv:2407.21783, 2024. arxiv.org/abs/2407.21783

A detailed account of a frontier-scale data, training, and infrastructure stack, the closest published mirror of the whole-chapter system view.

📄 Paper

Dataset Construction, Deduplication, and Data Quality

Penedo, G., Kydlíček, H., allal, L. B., Lozhkov, A., Mitchell, M., Raffel, C., Von Werra, L., Wolf, T. "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale." arXiv:2406.17557, 2024. arxiv.org/abs/2406.17557

A reproducible, large-scale web-data pipeline with ablations on filtering and deduplication, the closest reference for the dataset construction of Sections 19.3 and 19.4.

📄 Paper

Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., Carlini, N. "Deduplicating Training Data Makes Language Models Better." arXiv:2107.06499, 2021. arxiv.org/abs/2107.06499

The study showing that near-duplicate removal improves models and reduces memorization, the motivating result for the distributed deduplication of Section 19.4.

📄 Paper

Fine-Tuning and Parameter-Efficient Adaptation

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W. "LoRA: Low-Rank Adaptation of Large Language Models." arXiv:2106.09685, 2021. arxiv.org/abs/2106.09685

The low-rank adapter method that makes fine-tuning cheap by training a tiny fraction of parameters, the per-node enabler placed in the distributed picture of Section 19.7.

📄 Paper

Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L. "QLoRA: Efficient Finetuning of Quantized LLMs." arXiv:2305.14314, 2023. arxiv.org/abs/2305.14314

The method that fine-tunes a quantized base model with low-rank adapters to fit large models on a single accelerator, extending the per-node enabler of Section 19.7.

📄 Paper

Alignment: SFT, RLHF, and DPO

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., et al. "Training Language Models to Follow Instructions with Human Feedback (InstructGPT)." arXiv:2203.02155, 2022. arxiv.org/abs/2203.02155

The paper that established the supervised-tuning-plus-RLHF alignment recipe, the systems backbone for the alignment view of Section 19.8.

📄 Paper

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., Finn, C. "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." arXiv:2305.18290, 2023. arxiv.org/abs/2305.18290

The method that aligns a model directly from preference pairs without a separate reward model or RL loop, a simpler alternative weighed in Section 19.8.

📄 Paper

Energy, Cost, and Responsible Scaling

Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., Dean, J. "Carbon Emissions and Large Neural Network Training." arXiv:2104.10350, 2021. arxiv.org/abs/2104.10350

The study that quantifies the energy and carbon cost of large training runs and the factors that move them, the foundation for the responsible-scaling accounting of Section 19.9.

📄 Paper