Part IV: Parallel Deep Learning and Large Models
Chapter 19: Training Foundation Models at Scale

Energy, Cost, and Responsible Scaling

"They asked me to scale the model. Nobody asked who pays the power bill, who owns the data, or who answers when it goes wrong. So I learned to ask all three before I allocate a single node."

A Scheduler That Reads the Electricity Meter
Big Picture

A frontier pretraining run is not just an algorithm; it is a multi-million-dollar, megawatt-scale industrial process, and the bill, the carbon footprint, and the duty of care are design constraints that bind as hard as memory or bandwidth. Every choice in the previous eight sections, the data you keep, the precision you train in, the parallelism you pick, the moment you stop, lands as dollars on an invoice, kilowatt-hours on a utility meter, and kilograms of carbon in the atmosphere. This closing section makes those three currencies explicit, shows that the same efficiency levers cut all three at once, and names the responsibilities that come with operating at this scale. It then folds the whole chapter together: a foundation model is the entire book composed into one run, and that run is bounded, top to bottom, by cost and energy.

The earlier sections of this chapter treated scale as a systems problem: fit the model across devices, keep the workers fed, survive the failures, stop at the compute-optimal point. Underneath every one of those decisions sits a meter that never stops running. A modern frontier pretraining run consumes tens of millions of GPU-hours, costs on the order of tens to hundreds of millions of dollars, and draws power at the megawatt scale for weeks or months. At that magnitude, cost and energy stop being something you tally afterward and become parameters you design against from the first planning meeting, exactly the cost-aware posture introduced in Chapter 3 and made measurable in the cost and energy accounting of Chapter 5. This section gives you the arithmetic to put a number on a run before you launch it, and the vocabulary to defend that number to the people who pay for it and live with it.

Efficiency levers Higher MFU FP8 / low precision Compute-optimal sizing Do not over-train Training run GPU-hours power, PUE Costdollars (rental / amortized) EnergykWh = GPU-hours x kW x PUE CarbonkgCO2e = kWh x grid intensity Carbon-aware: cleaner grid cuts carbon alone, without changing cost or energy
Figure 19.9.1: One run, three meters. The GPU-hours of a training run flow into dollars, energy, and carbon. The four efficiency levers on the left (higher model FLOPs utilization, FP8, compute-optimal sizing, and not over-training) shrink the GPU-hours themselves, so they cut all three meters together. A carbon-aware choice of a cleaner grid (bottom) reduces only the carbon meter, leaving the dollar and energy figures unchanged.

1. The Bill: Cost as a First-Class Constraint Beginner

Start with the simplest of the three meters, because it is the one every stakeholder already understands. The dollar cost of a training run is, to first order, the price of the compute it consumes:

$$\text{Cost} \approx G \cdot p, \qquad G = \frac{C_{\text{useful}}}{F_{\text{peak}} \cdot \text{MFU}},$$

where $G$ is the number of GPU-hours, $p$ is the price per GPU-hour (a rental rate, or the amortized cost of owned hardware plus power), $C_{\text{useful}}$ is the useful compute the science requires in FLOPs, $F_{\text{peak}}$ is the peak throughput of one accelerator, and MFU is the model FLOPs utilization from the efficiency discussion earlier in this chapter. The second equation is the load-bearing one: the science fixes $C_{\text{useful}}$, but the GPU-hours you actually pay for are that number divided by how much of each accelerator's peak you sustain. Doubling MFU halves the bill for the identical model. This is why the utilization fights of the previous sections were never merely about speed; every percentage point of MFU is a percentage point off an eight-figure invoice.

Key Insight: Cost and Energy Are Set by Useful-Compute Divided by Efficiency

The science of a model fixes the useful FLOPs it needs. Everything you pay, in dollars, in kilowatt-hours, and in carbon, is that fixed numerator divided by your efficiency: model FLOPs utilization, arithmetic precision, and how close your token budget sits to compute-optimal. You cannot cheat the numerator without changing the model, but the denominator is entirely an engineering choice, which is why a single run can vary by more than a factor of two in all three currencies with identical final quality.

Treating cost as a design constraint changes decisions you might otherwise make on instinct. It argues against renting the largest accelerator when a cheaper one at higher utilization is more dollars-efficient; it argues for spot and preemptible capacity wherever the elastic, fault-tolerant training of Chapter 18 lets you survive the interruptions; and it argues, most of all, for not buying compute the model cannot turn into quality, the over-training trap we return to in Section 4.

2. The Footprint: Energy and Carbon Intermediate

The second and third meters convert the same GPU-hours into physical quantities. Energy is the cleanest to reason about. If each accelerator draws an average power $P_{\text{gpu}}$ (in kilowatts) while training, the electricity delivered to the chips is $G \cdot P_{\text{gpu}}$ kilowatt-hours. But the chips are not the whole datacenter: cooling, power conversion, and distribution all draw current too. The industry captures that overhead in a single multiplier, the power usage effectiveness (PUE), defined as total facility energy divided by IT energy. A PUE of $1.1$ means the building spends ten percent on top of the compute; older facilities run $1.5$ or worse. The facility energy is therefore

$$E_{\text{kWh}} = G \cdot P_{\text{gpu}} \cdot \text{PUE}.$$

Carbon follows by multiplying energy by the carbon intensity of the electricity, $I$, measured in grams of CO$_2$-equivalent per kilowatt-hour. A coal-heavy grid sits near $700$ gCO$_2$e/kWh; a grid rich in nuclear, hydro, wind, and solar can fall below $50$. The emitted carbon is

$$M_{\text{CO}_2\text{e}} = \frac{E_{\text{kWh}} \cdot I}{1000} \ \text{kgCO}_2\text{e}.$$

Two of these factors, PUE and $I$, are properties of where and when you train, not of your model. That observation is the seed of carbon-aware scheduling: the same job placed in a region with a cleaner grid, or deferred to an hour when wind and solar are abundant, emits dramatically less carbon for an unchanged dollar and energy cost. We treat the joules-per-token of a run as a quantity to be engineered down, and the carbon as a quantity to be both engineered down and placed wisely. The responsible-scaling and environmental-accounting practices that formalize this carbon-aware placement are developed in Chapter 35.

Library Shortcut: CodeCarbon Reads the Meters for You

The arithmetic above is worth doing by hand once, to understand it, but in practice you instrument the training loop rather than estimate from a spreadsheet. CodeCarbon samples GPU, CPU, and RAM power live, looks up your region's real-time grid intensity, and reports kWh and kgCO$_2$e per run with a few lines:

from codecarbon import EmissionsTracker

tracker = EmissionsTracker(project_name="frontier-pretrain")  # auto-detects region + grid
tracker.start()
train(model, data, steps=...)          # your existing training loop, unchanged
emissions_kg = tracker.stop()          # kgCO2e, also logged to emissions.csv
print(f"this run emitted {emissions_kg:.1f} kgCO2e")
Code 19.9.1: CodeCarbon collapses the manual energy-and-carbon arithmetic into a two-line wrapper around any training loop, reading live power draw and regional grid intensity instead of the fixed averages we assume by hand. The companion ML CO2 Impact calculator does the same estimate from hardware, hours, and region for a run you have not launched yet.

3. The Levers: One Set of Knobs Cuts All Three Intermediate

The structure of the three formulas hides a gift. Dollars, energy, and carbon all scale with the GPU-hours $G$, and $G$ shrinks with efficiency. So a single set of levers, the ones this chapter has already taught for performance, doubles as the lever set for cost and footprint. The demo below makes this concrete: it fixes the useful compute of a frontier-scale run and asks what that identical run costs in all three currencies as we raise model FLOPs utilization and switch to FP8.

def estimate(gpu_hours, gpu_power_kw, pue, grid_gco2_per_kwh, price_per_gpu_hour):
    # Energy delivered to the GPUs, then scaled up by datacenter overhead (PUE).
    it_energy_kwh = gpu_hours * gpu_power_kw          # raw GPU electricity
    facility_kwh = it_energy_kwh * pue                # + cooling, power loss
    carbon_kg = facility_kwh * grid_gco2_per_kwh / 1000.0   # gCO2e -> kgCO2e
    dollars = gpu_hours * price_per_gpu_hour          # rental / amortized cost
    return dollars, facility_kwh, carbon_kg

# Total useful FLOPs are fixed; GPU-hours = useful_FLOPs / (peak_FLOPS * MFU).
useful_pflop_days = 4.5e5        # ~450k PFLOP-days of useful compute (fixed)
peak_pflops = 0.99              # H100 BF16 dense peak, PFLOP/s per GPU
gpu_power_kw = 0.70            # ~700 W per GPU under load
pue = 1.12                    # efficient modern datacenter
grid_gco2_per_kwh = 369.0     # average grid carbon intensity (gCO2e/kWh)
price_per_gpu_hour = 2.50     # USD per GPU-hour

def gpu_hours_for(mfu, fp8_speedup):
    eff_pflops = peak_pflops * mfu * fp8_speedup     # sustained PFLOP/s per GPU
    return useful_pflop_days * 86400.0 / eff_pflops / 3600.0

scenarios = [
    ("Baseline  BF16, MFU 35%", 0.35, 1.0),
    ("Tuned     BF16, MFU 50%", 0.50, 1.0),
    ("FP8       FP8,  MFU 50%", 0.50, 1.6),  # FP8 ~1.6x effective throughput
]
for name, mfu, fp8 in scenarios:
    gh = gpu_hours_for(mfu, fp8)
    d, kwh, ckg = estimate(gh, gpu_power_kw, pue, grid_gco2_per_kwh, price_per_gpu_hour)
    print(f"{name:<26}{gh:>12,.0f}{d:>13,.0f}{kwh:>13,.0f}{ckg/1000:>10.1f}")
Code 19.9.2: A pure-Python cost-energy-carbon estimator for a fixed-science training run. The useful compute is held constant across rows; only model FLOPs utilization and precision change, so the GPU-hours, dollars, energy, and carbon all move together. The full script also reports the combined speedup and the effect of a low-carbon grid.
scenario                     GPU-hours       cost $   energy kWh     tCO2e
------------------------------------------------------------------------------------
Baseline  BF16, MFU 35%     31,168,831   77,922,078   24,436,364    9017.0
Tuned     BF16, MFU 50%     21,818,182   54,545,455   17,105,455    6311.9
FP8       FP8,  MFU 50%     13,636,364   34,090,909   10,690,909    3944.9
------------------------------------------------------------------------------------
FP8 + tuned MFU vs baseline: cost x2.29, energy x2.29, carbon x2.29 lower
Same FP8 run on a low-carbon grid (30 gCO2e/kWh): carbon x12.3 lower again
Output 19.9.2: The same model, trained three ways. Lifting utilization from 35 to 50 percent and adopting FP8 low-precision training (Chapter 15) cuts the bill from roughly $78M to $34M, the energy from 24 to 11 gigawatt-hours, and the carbon from about 9,000 to 3,900 tonnes, a 2.29x reduction in every currency at identical final quality. Moving that FP8 run to a low-carbon grid cuts the carbon a further 12x without touching the dollar or energy figure, exactly the carbon-aware-placement effect of Figure 19.9.1.

The lesson is structural, not numerical. Because all three meters are driven by the same GPU-hours, the engineer who optimizes for utilization and precision is, without any extra work, also the engineer who minimizes the carbon footprint. The two goals that are sometimes posed as a trade-off, performance versus sustainability, are the same goal viewed through different units. Higher MFU (the utilization arithmetic of Section 19.6), FP8 and low precision (Chapter 15), compute-optimal sizing, and not over-training all push the denominator up, and the denominator is shared.

Fun Note: The Most Expensive Way to Save Money

There is a tempting fallacy that since cleaner grids cut carbon for free, you can ignore efficiency and just train somewhere green. The arithmetic disagrees. A clean grid divides only the carbon meter; the dollar meter and the energy meter do not care how the electrons were generated. An over-trained, low-utilization run on a hydro grid still burns the same eight-figure budget and the same gigawatt-hours. Efficiency is the only lever that pulls all three meters at once, which makes it the cheapest sustainability investment you will ever make, because you were going to make it for the budget anyway.

4. Compute-Optimal and the Cost of Over-Training Intermediate

The fourth lever is the subtlest and the one most often gotten wrong: spending compute the model cannot convert into quality. The compute-optimal framing developed earlier in this chapter answers a precise question, given a fixed compute budget $C$, how should you split it between model size $N$ and training tokens $D$, subject to $C \approx 6 N D$, to minimize loss. Train far past that point and the loss curve flattens while the meters keep climbing. Every additional token then buys diminishing quality at undiminished cost, energy, and carbon.

There is a legitimate reason to over-train relative to the pretraining-optimal point: a model that will serve billions of inference requests should be made smaller and trained longer, because a smaller model is cheaper to serve forever, and the extra training cost is amortized over the deployment lifetime. That is an inference-economics decision, made deliberately with the serving cost of Chapter 3 in view, not an accident. The failure mode this section warns against is the accidental over-train: running longer because the cluster was already booked, or because a flat loss curve was mistaken for one still descending. Naming the compute-optimal point, and justifying any departure from it in writing, is how a cost-aware team keeps a training plan honest.

Practical Example: The Run That Was Twenty Percent Cheaper for Free

Who: A research-infrastructure lead at a foundation-model lab planning a 70-billion-parameter pretraining run.

Situation: The original plan budgeted 30 million GPU-hours on rented H100 capacity, a roughly $75M line item, with a sustainability commitment to report and offset emissions.

Problem: A first cost-and-carbon estimate, built with the arithmetic of Code 19.9.2, projected about 9,000 tonnes CO$_2$e, large enough to attract board-level attention and a public disclosure obligation.

Dilemma: Push for a faster launch on the planned 35-percent-utilization configuration, or spend two weeks tuning the parallelism and adopting FP8 first, delaying the start but lowering every meter.

Decision: They spent the two weeks. Profiling lifted MFU to 50 percent, FP8 added a further 1.6x effective throughput, and the data team confirmed the token budget already sat at the compute-optimal point, so no over-training correction was needed.

How: They instrumented the loop with CodeCarbon (Code 19.9.1) to verify the projection against live power draw, and scheduled the run in a low-carbon region whose grid averaged under 60 gCO$_2$e/kWh.

Result: The same model trained in 13.6 million GPU-hours for about $34M, drawing 11 GWh, and the regional placement cut reported emissions by an order of magnitude beyond the efficiency gain. The two-week delay paid for itself many times over.

Lesson: Estimate all three meters before you launch. The efficiency work you would do for the budget is the same work that shrinks the footprint, and grid placement is a near-free additional carbon cut.

5. Responsible Scaling: Provenance, Safety, and Governance Advanced

Cost and energy bound a run from below; responsibility bounds it from the sides. Three duties attach to operating at this scale, and a serious training plan treats them as gates, not afterthoughts. The first is data provenance and consent. The corpus construction, deduplication, and tokenization of the early sections of this chapter determine not only quality but also legitimacy: where the data came from, whether its use is licensed or consented, and whether personal or copyrighted material was filtered. A model is downstream of its data in every sense, including the legal and ethical ones, so provenance must be recorded as the corpus is built, not reconstructed after a dispute.

The second duty is safety evaluation before release. A frontier model is evaluated not only for capability but for the harms it might enable, and that evaluation happens before weights are shipped, using the held-out, contamination-controlled methodology of Chapter 5. The third is compute governance: because a frontier run requires a concentration of accelerators that few organizations control, the training compute itself has become a unit of policy, with disclosure thresholds and review tied to total FLOPs. An engineer planning a run at this scale should know which thresholds it crosses and what reporting they trigger.

Research Frontier: Carbon-Aware Scheduling and Compute Governance (2024 to 2026)

Two research lines now treat the meters of this section as primary objects. On the environmental side, carbon-aware scheduling has moved from proposal to practice: systems in the lineage of Google's carbon-intelligent computing and the Carbon Explorer and ACT accounting frameworks shift flexible training and batch jobs in time and space to follow low-carbon electricity, and recent work reports double-digit percentage carbon reductions for deferrable training with no change to the model. The lifecycle view is sharpening too, with embodied carbon (the footprint of manufacturing the accelerators themselves) increasingly accounted alongside operational energy. On the governance side, the 2024 to 2025 wave of policy, including the EU AI Act's systemic-risk tier and United States reporting frameworks, fixes training-compute thresholds (commonly stated around $10^{25}$ to $10^{26}$ FLOPs) above which a model triggers mandatory evaluation and disclosure, making the FLOP count of a run a regulated quantity. The open question the field is actively working is how to certify a run's reported cost, energy, carbon, and safety evaluation in a way an external auditor can trust, the certification problem we flagged at the close of Section 5.

Thesis Thread: Scale-Out Is Bounded, Not Unbounded

The thesis of this book is that intelligence at scale is built by distributing work across many machines. This section adds the boundary condition: the distribution is not free, and the meters that measure its price, dollars, joules, and carbon, are first-class design constraints that sit beside the communication tax and the failure tax from Chapter 3. A frontier run is scale-out at its most ambitious, and it is precisely there that the cost of scale-out is most visible. Every efficiency primitive in Part IV, higher MFU, FP8, sharding, expert parallelism, exists in part to keep that price affordable, which is why a chapter on training the largest models has to end with the bill.

6. Chapter 19, Composed Beginner

This chapter set out to train a foundation model at scale, and the path it walked was, in effect, the whole book composed into a single run. The data was constructed, deduplicated, and tokenized using the distributed data processing of Part II. The training distributed data, the model, and the optimizer state across thousands of accelerators using the 3D and 4D parallelism of Chapters 15 through 17, and it survived weeks of inevitable hardware failure using the elastic, fault-tolerant machinery of Chapter 18. The run was sized to the compute-optimal point, then fine-tuned and aligned into something usable, and at every step it was bounded by the cost and energy meters of this final section. A foundation model is not one technique; it is the entire discipline of scale-out AI, exercised at once and paid for in three currencies.

Key Takeaway: A Foundation Model Is the Whole Book, Under a Budget

Training a foundation model at scale composes nearly everything this book teaches: distributed data construction, deduplication, and tokenization; 3D and 4D parallel pretraining that survives failure; compute-optimal sizing; and fine-tuning and alignment into a usable system. What makes it a single coherent engineering problem rather than a pile of techniques is the budget. Cost, energy, and carbon are driven by one shared quantity, the GPU-hours, and that quantity is useful-compute divided by efficiency. Master the efficiency levers and you have mastered the bill, the footprint, and the responsibility together, because at this scale they are three readings of the same meter.

Exercise 19.9.1: Plan a Run to a Budget Conceptual

You are given a hard cap of $20M and a goal of keeping emissions under 2,000 tonnes CO$_2$e. Using the formulas in Sections 1 and 2 and the parameters in Code 19.9.2, work out the largest useful-compute budget (in PFLOP-days) you can afford under each constraint separately, at the tuned FP8 efficiency. Which constraint binds first, the money or the carbon? State one change to the plan (an efficiency lever or a placement choice) that would relax the binding constraint, and explain which meter it moves and which it leaves untouched.

Exercise 19.9.2: Extend the Estimator Coding

Extend Code 19.9.2 with two additions. First, add a checkpoint-and-restart overhead: assume failures and restarts waste 8 percent of GPU-hours (the elastic-training tax from Chapter 18), and show its effect on all three meters. Second, add an inference-amortization view: given a deployment that serves $10^{12}$ tokens over the model's lifetime, compute the marginal cost and carbon per million served tokens for two designs, a larger model trained at the compute-optimal point versus a smaller model deliberately over-trained by 3x, and identify the token volume at which the over-trained design becomes cheaper overall.

Exercise 19.9.3: Audit a Reported Footprint Analysis

Find a published foundation-model technical report that states either GPU-hours or total energy. Reconstruct the missing meters using the arithmetic of this section: if only GPU-hours are given, estimate energy and carbon under a stated PUE and grid intensity; if only energy is given, back out the implied GPU-hours. List every assumption you had to introduce, and discuss which one your final carbon number is most sensitive to. Tie your reasoning to the certification problem raised in the research-frontier callout: what would the report need to publish for your audit to require no assumptions at all?

Project Ideas

1. Compute-optimal run planner with a cost and carbon budget. Build a tool that takes a target model quality (or a fixed compute budget), the compute-optimal $C \approx 6ND$ relationship, and the three-meter arithmetic of this section, and returns a recommended $(N, D)$ pair, the GPU-hours, and the projected dollars, energy, and carbon. Add hard constraints (a dollar cap and a carbon cap) and have the planner report which constraint binds and which efficiency lever or grid placement best relaxes it. Validate the cost and carbon estimates against CodeCarbon on a small proxy run.

2. Carbon-aware scheduling simulator. Using public historical grid-intensity traces for two or three regions, simulate placing a deferrable multi-day training job. Compare a naive "start now, nearest region" policy against a carbon-aware policy that shifts the job in time and space, and quantify the carbon saved, the dollar cost of any added delay or data transfer, and the energy (which should be nearly unchanged). Report the carbon-per-dollar Pareto frontier.

3. Footprint audit of three published models. Pick three foundation models whose technical reports disclose GPU-hours or energy, reconstruct all three meters for each with stated assumptions (extending Exercise 19.9.3), and produce a comparison table plus a short write-up on what each report would need to disclose for an external party to verify the numbers without assumptions. Connect your findings to the compute-governance thresholds discussed in Section 5.