Part IV: Parallel Deep Learning and Large Models
Chapter 19: Training Foundation Models at Scale

Scaling Laws

"They asked how big I should be. I asked how much compute they had. We were, it turned out, asking the same question."

A Model Sizing Itself to a Budget
Big Picture

Scaling laws are the empirical bridge from a compute budget to a model design: test loss falls as a smooth power law in parameters, data, and compute, so a few small runs predict the loss of a run hundreds of times larger, and one simple identity, $C \approx 6ND$, tells you how to split a fixed budget between a bigger model and more tokens. This is the section that turns "we have a thousand GPUs for a month" into concrete numbers: how many parameters $N$, how many training tokens $D$, and therefore whether the cluster needs to lean on model parallelism for a larger model or on data-parallel throughput for more tokens. Before anyone writes a training loop, scaling laws decide the shape of the job, and getting them wrong wastes the most expensive compute a team will ever buy.

The previous section framed a foundation-model run as a distributed system with many moving parts. This section answers the question that has to be settled before any of those parts are provisioned: given a fixed amount of compute, how large should the model be and on how many tokens should it train? For most of the field's history this was guesswork, and the guesses were biased toward large models trained for too few steps. Scaling laws replaced the guesswork with measurement. They are not a theory of why neural networks generalize; they are a robust empirical regularity, observed across many orders of magnitude, that the test loss of a transformer follows a power law in the resources you spend on it. That regularity is what makes a multi-million-dollar training run something you can plan rather than gamble on.

log training compute C (FLOPs) log test loss L irreducible loss floor E pilot runs lie on a straight line: L = E + A·C extrapolate predicted loss of the big run
Figure 19.2.1: The empirical scaling law on log-log axes. A handful of inexpensive pilot runs (dark points) fall along a straight line, the signature of a power law $L = E + A\,C^{-\alpha}$ above an irreducible floor $E$. The fitted line, extended as a dotted extrapolation, predicts the test loss of a run far larger than any measured point (orange marker) before that run is ever launched.

1. Loss as a Power Law in Parameters, Data, and Compute Intermediate

The central empirical finding, established by Kaplan and co-authors in 2020 and refined since, is that the test loss of an autoregressive transformer is, to a good approximation, a power law in each of three resources held in surplus: the number of non-embedding parameters $N$, the number of training tokens $D$, and the total training compute $C$. Holding the other resources non-binding, each relationship has the same shape,

$$L(N) \approx \left(\frac{N_c}{N}\right)^{\alpha_N}, \qquad L(D) \approx \left(\frac{D_c}{D}\right)^{\alpha_D}, \qquad L(C) \approx \left(\frac{C_c}{C}\right)^{\alpha_C}.$$

A power law is a straight line on log-log axes, which is exactly what Figure 19.2.1 shows, and that straight-line behavior is what makes extrapolation trustworthy: if five small runs spanning two orders of magnitude in compute lie on a line, the line keeps going, and the loss of a run a hundred times larger is read off the fit rather than discovered by spending the compute. The exponents $\alpha$ are small, typically around $0.05$ to $0.10$ for parameters and data, which means the loss improves slowly; a tenfold increase in compute buys a modest absolute drop in loss, and that diminishing return is the economic reality every scaling decision lives inside. A more honest functional form adds an irreducible floor, the entropy of the data itself, that no amount of compute removes,

$$L(N, D) \approx E + \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}},$$

where $E$ is the loss a perfect model would still incur. This is the form Hoffmann and co-authors fit in the 2022 Chinchilla study, and it is the one that yields a compute-optimal recipe once we connect $N$ and $D$ through a compute budget.

Key Insight: Small Runs Buy a Map of the Large Run

The practical power of scaling laws is not the precise loss number; it is that a few cheap runs, fit to a power law, predict the behavior of a run you cannot yet afford to repeat. This converts the largest training jobs from one-shot gambles into measured extrapolations. The cost of the planning runs is a rounding error against the cost of the production run, and skipping them is how teams burn a cluster-month producing a model that a half-day of pilot runs would have shown to be the wrong size.

2. The Compute Identity: Why $C \approx 6ND$ Intermediate

To turn the loss surface into a budget plan we need to relate the three resources, and one identity does it. The compute to train a dense transformer is, to excellent approximation,

$$C \approx 6 N D,$$

where $C$ is total floating-point operations, $N$ is the parameter count, and $D$ is the number of training tokens. The factor of six is not a fudge: each parameter participates in roughly two FLOPs (a multiply and an add) per token in the forward pass, and the backward pass costs about twice the forward pass, giving roughly $2 + 4 = 6$ FLOPs per parameter per token. Multiplying by $N$ parameters and $D$ tokens gives $C \approx 6ND$. The rule ignores attention's quadratic term, which is small relative to the feed-forward cost at the sequence lengths used in practice, and it is accurate enough that every capacity-planning spreadsheet in the field is built on it.

This identity is the lever for distribution. Once you know your cluster's sustained throughput in FLOPs (the number of GPUs times their realized FLOPs per second times the wall-clock you can hold them), you know $C$. Then $C \approx 6ND$ is one equation in two unknowns, $N$ and $D$, and the scaling law supplies the second equation by telling you which $(N, D)$ pair on the budget curve minimizes the loss. The answer dictates the parallelism strategy directly: a large $N$ pushes you toward model and pipeline parallelism (Chapter 16) because the model no longer fits on one device, while a large $D$ pushes you toward higher data-parallel throughput (Chapter 15) so the tokens stream through fast enough to finish on time.

3. Compute-Optimal Training: The Chinchilla Correction Advanced

For a fixed compute budget $C$, substitute $D = C / (6N)$ into the loss form $L(N, D) = E + A N^{-\alpha} + B D^{-\beta}$ and minimize over $N$. The minimization yields an optimal model size and token count that both grow as power laws in the budget, $N_{\text{opt}} \propto C^{a}$ and $D_{\text{opt}} \propto C^{b}$ with $a$ and $b$ each near one half. The headline empirical result of the Chinchilla study is that, with the fitted exponents, model size and data should scale in roughly equal proportion, which corresponds to a near-constant token-to-parameter ratio of about twenty tokens per parameter at the compute-optimal point,

$$\frac{D_{\text{opt}}}{N_{\text{opt}}} \approx 20.$$

This corrected a costly bias. The earlier 2020 analysis had recommended spending most of a larger budget on more parameters and comparatively few extra tokens, which produced large, badly under-trained models. Chinchilla showed that a 70-billion-parameter model trained on 1.4 trillion tokens beat the 175-billion and 280-billion-parameter models of its day, because those larger models had been starved of data. The lesson for a planner is blunt: at a fixed budget, a smaller model trained on more tokens often wins, and the twenty-tokens-per-parameter heuristic is the first sanity check on any proposed run. The code below fits a power-law loss curve to a few pilot points, extrapolates to a hundredfold-larger budget, and then applies the $6ND$ rule with the twenty-to-one ratio to size a model for several budgets.

import numpy as np

# A few (compute C in FLOPs, measured test loss) points from small pilot runs.
# Loss is modeled as L(C) = E + A * C**(-alpha): an irreducible floor E plus a
# power-law term that shrinks as compute grows.
C = np.array([1e17, 3e17, 1e18, 3e18, 1e19])          # training compute (FLOPs)
L = np.array([3.21, 2.86, 2.55, 2.31, 2.12])          # observed test loss (nats)

# Fit E, A, alpha: grid over the floor E, then a linear fit in log space for
# (A, alpha), since log(L - E) = log(A) - alpha * log(C) is linear.
best = None
for E in np.linspace(0.0, min(L) - 0.01, 400):
    y, x = np.log(L - E), np.log(C)
    A_mat = np.vstack([np.ones_like(x), -x]).T
    (logA, alpha), *_ = np.linalg.lstsq(A_mat, y, rcond=None)
    sse = np.sum((E + np.exp(logA) * C ** (-alpha) - L) ** 2)
    if best is None or sse < best[0]:
        best = (sse, E, np.exp(logA), alpha)

sse, E, A, alpha = best
print("fitted power law  : L(C) = %.3f + %.3f * C^(-%.4f)" % (E, A, alpha))
print("fit SSE           : %.2e" % sse)

C_target = 1e21                                        # 100x the biggest pilot run
print("extrapolated loss : L(%.0e FLOPs) = %.3f"
      % (C_target, E + A * C_target ** (-alpha)))

# Chinchilla split: C = 6 * N * D with D = 20 * N gives N = sqrt(C / (6*20)).
def chinchilla_split(C_budget, ratio=20.0):
    N = np.sqrt(C_budget / (6.0 * ratio))
    return N, ratio * N

for C_budget in [1e21, 1e22, 1e23, 1e24]:
    N, D = chinchilla_split(C_budget)
    print("budget %.0e FLOPs -> N=%.2e params, D=%.2e tokens, 6ND=%.0e"
          % (C_budget, N, D, 6.0 * N * D))
Code 19.2.1: A from-scratch scaling-law planner. The first block fits $L(C) = E + A\,C^{-\alpha}$ to five pilot points and extrapolates a hundredfold; the second block applies $C \approx 6ND$ with a twenty-to-one token-to-parameter ratio to size a compute-optimal model for four budgets. No deep-learning framework is used; the whole calculation is a least-squares fit and a square root.
fitted power law  : L(C) = 1.333 + 3152.716 * C^(-0.1897)
fit SSE           : 9.42e-05
extrapolated loss : L(1e+21 FLOPs) = 1.660
budget 1e+21 FLOPs -> N=2.89e+09 params, D=5.77e+10 tokens, 6ND=1e+21
budget 1e+22 FLOPs -> N=9.13e+09 params, D=1.83e+11 tokens, 6ND=1e+22
budget 1e+23 FLOPs -> N=2.89e+10 params, D=5.77e+11 tokens, 6ND=1e+23
budget 1e+24 FLOPs -> N=9.13e+10 params, D=1.83e+12 tokens, 6ND=1e+24
Output 19.2.1: The fit recovers a clean power law over the five pilot points (sum of squared errors near $10^{-4}$) and predicts a test loss of $1.66$ at a budget a hundred times the largest measured run. The budget table reads off compute-optimal sizes: a $10^{21}$-FLOP budget wants a 2.9-billion-parameter model on 58 billion tokens, and every row satisfies $6ND = C$ exactly by construction.

The fitted exponent here, near $0.19$, is steeper than the small exponents seen on real corpora because the synthetic pilot points were chosen to make the straight line easy to read; on real data the curve is flatter and the planning runs must span more compute to pin it down. The mechanics are identical, and the table is the deliverable a planning meeting actually wants: a defensible model size for each budget the team might be granted.

Thesis Thread: The Budget Chooses the Distribution Axis

Scaling laws are where the book's central question, how to spread the work across machines, gets its numerical answer for foundation models. The $6ND$ identity converts a cluster (a count of GPUs and a wall-clock) into a budget $C$, and the compute-optimal split converts $C$ into a target $N$ and $D$. A large $N$ is a vote for model and pipeline parallelism (Chapter 16); a large $D$ is a vote for data-parallel throughput (Chapter 15). The same compute, allocated two different ways, produces two different distributed systems, and the scaling law is what tells you which one to build.

Practical Example: Right-Sizing a Model to a Granted Cluster

Who: A research lead at a startup awarded a fixed allocation on a shared GPU cluster.

Situation: The grant was 512 GPUs for three weeks, a compute budget the team estimated at roughly $3 \times 10^{22}$ FLOPs of useful training after overheads.

Problem: The founding instinct, shaped by headline parameter counts, was to train as large a model as would fit, around 30 billion parameters, on whatever tokens fit in the time.

Dilemma: Train the big 30-billion model and accept that it would see only a few hundred billion tokens, well under the compute-optimal ratio, or train a smaller model the scaling law endorsed and let it see far more data.

Decision: They ran six pilot jobs spanning two orders of magnitude in compute, fit the Chinchilla loss form, and let the fit choose the size, exactly the procedure in Code 19.2.1.

How: The $6ND$ rule on a $3 \times 10^{22}$-FLOP budget pointed at roughly a 16-billion-parameter model on about 320 billion tokens, not the 30-billion model; they sized the data-parallel mesh for that token throughput rather than the model-parallel mesh for the larger model.

Result: The compute-optimal 16-billion model reached a lower validation loss than a 30-billion model trained under the same budget in a control run, and finished inside the three-week window with room to spare.

Lesson: The largest model that fits is almost never the best model the budget can buy. Let the scaling law, not the parameter-count headline, choose $N$, and then build the cluster mesh the chosen $(N, D)$ implies.

4. The Inference-Aware Correction: Train Smaller, Serve Cheaper Advanced

Chinchilla minimizes training loss for a fixed training budget, but a model that will be served to millions of users incurs a second, often larger, cost: inference. Every query runs the forward pass, and the forward cost scales with $N$, so a smaller model is cheaper to serve forever. This reframes the optimization. If you will serve the model heavily, it pays to push past the twenty-to-one ratio and train a smaller model on far more tokens than compute-optimality alone would suggest, accepting a higher training cost in exchange for a permanently lower inference cost. This is the logic behind the LLaMA family and many production models, which are deliberately "over-trained" relative to Chinchilla: a 7-billion-parameter model trained on one to two trillion tokens sits well past the compute-optimal ratio, but it is small enough to serve cheaply at scale, and the serving savings dwarf the extra training compute over the model's deployed life.

Formally, the right objective minimizes total lifetime compute, training plus inference, $C_{\text{train}} + C_{\text{infer}}$, where $C_{\text{infer}} \approx 2 N D_{\text{infer}}$ and $D_{\text{infer}}$ is the total tokens the model will ever generate in service. When $D_{\text{infer}}$ is large, the optimum shifts toward smaller $N$, and the shift can be dramatic for a model with billions of expected queries. The connection to distribution is direct: a smaller served model needs less model parallelism per replica and fits more replicas per node, which is exactly the per-replica economics that Chapter 24 multiplies across a serving fleet, building on the per-node cost model of Chapter 22.

Research Frontier: Data-Constrained, Inference-Optimal, and Distillation Scaling (2024 to 2026)

Three active lines are reshaping the Chinchilla picture. First, data-constrained scaling (Muennighoff et al., 2023, and follow-ups through 2025) asks what happens when you run out of unique tokens and must repeat data: the laws still hold for a handful of epochs, with returns decaying as repetition grows, which matters now that frontier runs are bumping against the supply of high-quality text. Second, inference-aware and inference-optimal scaling laws (Sardana et al., 2024) make the train-smaller-serve-cheaper argument quantitative, fitting the total-lifetime-compute optimum directly and confirming that heavily served models should be trained well past the compute-optimal ratio. Third, distillation scaling laws (Busbridge et al., 2025) predict the loss of a student model as a function of its size, the teacher's size, and the distillation compute, turning "should we distill or train from scratch?" into a power-law comparison rather than a hunch. Each line keeps the same methodology, fit a smooth law on small runs and extrapolate, while moving the objective closer to what a production system actually pays for.

Library Shortcut: Fitting the Law With a Solver Instead of a Grid

Code 19.2.1 fit the loss floor with a hand-rolled grid search, which is transparent but crude. In practice you hand the full Chinchilla form to a nonlinear least-squares routine and let it fit all parameters jointly, including the floor, in a single call:

import numpy as np
from scipy.optimize import curve_fit

def loss_law(C, E, A, alpha):                 # L(C) = E + A * C**(-alpha)
    return E + A * np.power(C, -alpha)

C = np.array([1e17, 3e17, 1e18, 3e18, 1e19])
L = np.array([3.21, 2.86, 2.55, 2.31, 2.12])

# curve_fit jointly estimates E, A, alpha; log-domain fitting and Huber loss are
# one keyword away for robustness to noisy large runs.
(E, A, alpha), _ = curve_fit(loss_law, C, L, p0=[1.0, 1e3, 0.1], maxfev=10000)
print(f"E={E:.3f}  A={A:.1f}  alpha={alpha:.4f}")
Code 19.2.2: The same fit as Output 19.2.1 in one curve_fit call. The roughly twenty lines of grid-plus-linear-regression collapse to a single nonlinear solve, and SciPy handles the parameter coupling, convergence, and (with a loss keyword) robustness to outlier runs that the manual grid did not.

5. Using Scaling Laws to Plan a Distributed Run Intermediate

Putting the pieces together gives a planning recipe that runs before any production training starts. Estimate the cluster's sustained throughput in FLOPs per second, multiply by the wall-clock you can hold the allocation, and subtract realistic overheads to get the budget $C$. Run a handful of pilot jobs spanning two orders of magnitude in compute, fit the loss law, and read off the predicted loss at $C$ to know whether the result will be worth the spend at all. Apply $C \approx 6ND$ with the compute-optimal ratio, adjusted downward in $N$ if the model will be served heavily, to fix the target $(N, D)$. Only then do you choose the parallelism mesh: the chosen $N$ sets how much model and pipeline parallelism you need to fit the model (Chapter 16), and the chosen $D$ sets the data-parallel width needed to stream the tokens in time (Chapter 15), with the communication-cost models of Chapter 3 telling you when adding workers stops helping.

Fun Note: The Twenty-to-One Rule of Thumb

If you remember one number from this section, make it twenty: roughly twenty training tokens per parameter at the compute-optimal point. It is the back-of-the-envelope check that catches an under-trained giant before it eats a cluster. A 7-billion-parameter model wants about 140 billion tokens to be compute-optimal, and if it is going to be served a lot, you cheerfully feed it ten times that and call the extra training compute a down payment on cheap inference.

None of this guarantees the production run succeeds; scaling laws describe the loss you reach under stable, well-tuned training, and the chapters ahead are about making that training stable and well-tuned across thousands of machines. What the laws do guarantee is that you enter the run with a defensible target rather than a hopeful guess, which is the difference between planning a foundation-model run and gambling on one. The next section turns to the first thing the chosen budget demands: assembling and preparing the trillions of tokens of $D$, a distributed-data problem in its own right, taken up in Section 19.3.

Exercise 19.2.1: From Cluster to Model Size Analysis

You are granted 256 accelerators that each sustain $4 \times 10^{14}$ FLOPs per second of useful training throughput, for two weeks of wall-clock, at an effective utilization of 40 percent after data loading and checkpoint overhead. Compute the training budget $C$ in FLOPs. Using $C \approx 6ND$ and the compute-optimal ratio of about twenty tokens per parameter, find the compute-optimal $N$ and $D$. Then state whether this run is more likely to be bottlenecked by fitting the model (favoring model parallelism) or by streaming the tokens (favoring data-parallel width), and justify the call from the size of $N$ relative to a single accelerator's memory.

Exercise 19.2.2: Refit and Extrapolate Coding

Take Code 19.2.1 and replace the synthetic pilot points with a noisier set: add Gaussian noise of standard deviation $0.03$ to the five loss values, then refit. Repeat the noisy fit fifty times and report the mean and standard deviation of the extrapolated loss at $C = 10^{21}$. How much does measurement noise on small runs widen the uncertainty of the large-run prediction, and what does that imply about how many pilot runs and how wide a compute range you need before trusting an extrapolation?

Exercise 19.2.3: When Inference Dominates Conceptual

A model will be served an expected $10^{14}$ generated tokens over its deployed life, with inference cost $C_{\text{infer}} \approx 2 N D_{\text{infer}}$. Sketch, in words and a rough inequality, why the total-lifetime-compute optimum $C_{\text{train}} + C_{\text{infer}}$ pushes the chosen $N$ below the Chinchilla compute-optimal value, and explain how a smaller served $N$ changes the per-replica parallelism strategy that Chapter 24 deploys across a serving fleet. You do not need to solve the optimization exactly; argue the direction of the shift and its systems consequence.