"They asked me to scale the model. Nobody asked who pays the power bill, who owns the data, or who answers when it goes wrong. So I learned to ask all three before I allocate a single node."
A Scheduler That Reads the Electricity Meter
A frontier pretraining run is not just an algorithm; it is a multi-million-dollar, megawatt-scale industrial process, and the bill, the carbon footprint, and the duty of care are design constraints that bind as hard as memory or bandwidth. Every choice in the previous eight sections, the data you keep, the precision you train in, the parallelism you pick, the moment you stop, lands as dollars on an invoice, kilowatt-hours on a utility meter, and kilograms of carbon in the atmosphere. This closing section makes those three currencies explicit, shows that the same efficiency levers cut all three at once, and names the responsibilities that come with operating at this scale. It then folds the whole chapter together: a foundation model is the entire book composed into one run, and that run is bounded, top to bottom, by cost and energy.
The earlier sections of this chapter treated scale as a systems problem: fit the model across devices, keep the workers fed, survive the failures, stop at the compute-optimal point. Underneath every one of those decisions sits a meter that never stops running. A modern frontier pretraining run consumes tens of millions of GPU-hours, costs on the order of tens to hundreds of millions of dollars, and draws power at the megawatt scale for weeks or months. At that magnitude, cost and energy stop being something you tally afterward and become parameters you design against from the first planning meeting, exactly the cost-aware posture introduced in Chapter 3 and made measurable in the cost and energy accounting of Chapter 5. This section gives you the arithmetic to put a number on a run before you launch it, and the vocabulary to defend that number to the people who pay for it and live with it.
1. The Bill: Cost as a First-Class Constraint Beginner
Start with the simplest of the three meters, because it is the one every stakeholder already understands. The dollar cost of a training run is, to first order, the price of the compute it consumes:
$$\text{Cost} \approx G \cdot p, \qquad G = \frac{C_{\text{useful}}}{F_{\text{peak}} \cdot \text{MFU}},$$where $G$ is the number of GPU-hours, $p$ is the price per GPU-hour (a rental rate, or the amortized cost of owned hardware plus power), $C_{\text{useful}}$ is the useful compute the science requires in FLOPs, $F_{\text{peak}}$ is the peak throughput of one accelerator, and MFU is the model FLOPs utilization from the efficiency discussion earlier in this chapter. The second equation is the load-bearing one: the science fixes $C_{\text{useful}}$, but the GPU-hours you actually pay for are that number divided by how much of each accelerator's peak you sustain. Doubling MFU halves the bill for the identical model. This is why the utilization fights of the previous sections were never merely about speed; every percentage point of MFU is a percentage point off an eight-figure invoice.
The science of a model fixes the useful FLOPs it needs. Everything you pay, in dollars, in kilowatt-hours, and in carbon, is that fixed numerator divided by your efficiency: model FLOPs utilization, arithmetic precision, and how close your token budget sits to compute-optimal. You cannot cheat the numerator without changing the model, but the denominator is entirely an engineering choice, which is why a single run can vary by more than a factor of two in all three currencies with identical final quality.
Treating cost as a design constraint changes decisions you might otherwise make on instinct. It argues against renting the largest accelerator when a cheaper one at higher utilization is more dollars-efficient; it argues for spot and preemptible capacity wherever the elastic, fault-tolerant training of Chapter 18 lets you survive the interruptions; and it argues, most of all, for not buying compute the model cannot turn into quality, the over-training trap we return to in Section 4.
2. The Footprint: Energy and Carbon Intermediate
The second and third meters convert the same GPU-hours into physical quantities. Energy is the cleanest to reason about. If each accelerator draws an average power $P_{\text{gpu}}$ (in kilowatts) while training, the electricity delivered to the chips is $G \cdot P_{\text{gpu}}$ kilowatt-hours. But the chips are not the whole datacenter: cooling, power conversion, and distribution all draw current too. The industry captures that overhead in a single multiplier, the power usage effectiveness (PUE), defined as total facility energy divided by IT energy. A PUE of $1.1$ means the building spends ten percent on top of the compute; older facilities run $1.5$ or worse. The facility energy is therefore
$$E_{\text{kWh}} = G \cdot P_{\text{gpu}} \cdot \text{PUE}.$$Carbon follows by multiplying energy by the carbon intensity of the electricity, $I$, measured in grams of CO$_2$-equivalent per kilowatt-hour. A coal-heavy grid sits near $700$ gCO$_2$e/kWh; a grid rich in nuclear, hydro, wind, and solar can fall below $50$. The emitted carbon is
$$M_{\text{CO}_2\text{e}} = \frac{E_{\text{kWh}} \cdot I}{1000} \ \text{kgCO}_2\text{e}.$$Two of these factors, PUE and $I$, are properties of where and when you train, not of your model. That observation is the seed of carbon-aware scheduling: the same job placed in a region with a cleaner grid, or deferred to an hour when wind and solar are abundant, emits dramatically less carbon for an unchanged dollar and energy cost. We treat the joules-per-token of a run as a quantity to be engineered down, and the carbon as a quantity to be both engineered down and placed wisely. The responsible-scaling and environmental-accounting practices that formalize this carbon-aware placement are developed in Chapter 35.
The arithmetic above is worth doing by hand once, to understand it, but in practice you instrument the training loop rather than estimate from a spreadsheet. CodeCarbon samples GPU, CPU, and RAM power live, looks up your region's real-time grid intensity, and reports kWh and kgCO$_2$e per run with a few lines:
from codecarbon import EmissionsTracker
tracker = EmissionsTracker(project_name="frontier-pretrain") # auto-detects region + grid
tracker.start()
train(model, data, steps=...) # your existing training loop, unchanged
emissions_kg = tracker.stop() # kgCO2e, also logged to emissions.csv
print(f"this run emitted {emissions_kg:.1f} kgCO2e")
3. The Levers: One Set of Knobs Cuts All Three Intermediate
The structure of the three formulas hides a gift. Dollars, energy, and carbon all scale with the GPU-hours $G$, and $G$ shrinks with efficiency. So a single set of levers, the ones this chapter has already taught for performance, doubles as the lever set for cost and footprint. The demo below makes this concrete: it fixes the useful compute of a frontier-scale run and asks what that identical run costs in all three currencies as we raise model FLOPs utilization and switch to FP8.
def estimate(gpu_hours, gpu_power_kw, pue, grid_gco2_per_kwh, price_per_gpu_hour):
# Energy delivered to the GPUs, then scaled up by datacenter overhead (PUE).
it_energy_kwh = gpu_hours * gpu_power_kw # raw GPU electricity
facility_kwh = it_energy_kwh * pue # + cooling, power loss
carbon_kg = facility_kwh * grid_gco2_per_kwh / 1000.0 # gCO2e -> kgCO2e
dollars = gpu_hours * price_per_gpu_hour # rental / amortized cost
return dollars, facility_kwh, carbon_kg
# Total useful FLOPs are fixed; GPU-hours = useful_FLOPs / (peak_FLOPS * MFU).
useful_pflop_days = 4.5e5 # ~450k PFLOP-days of useful compute (fixed)
peak_pflops = 0.99 # H100 BF16 dense peak, PFLOP/s per GPU
gpu_power_kw = 0.70 # ~700 W per GPU under load
pue = 1.12 # efficient modern datacenter
grid_gco2_per_kwh = 369.0 # average grid carbon intensity (gCO2e/kWh)
price_per_gpu_hour = 2.50 # USD per GPU-hour
def gpu_hours_for(mfu, fp8_speedup):
eff_pflops = peak_pflops * mfu * fp8_speedup # sustained PFLOP/s per GPU
return useful_pflop_days * 86400.0 / eff_pflops / 3600.0
scenarios = [
("Baseline BF16, MFU 35%", 0.35, 1.0),
("Tuned BF16, MFU 50%", 0.50, 1.0),
("FP8 FP8, MFU 50%", 0.50, 1.6), # FP8 ~1.6x effective throughput
]
for name, mfu, fp8 in scenarios:
gh = gpu_hours_for(mfu, fp8)
d, kwh, ckg = estimate(gh, gpu_power_kw, pue, grid_gco2_per_kwh, price_per_gpu_hour)
print(f"{name:<26}{gh:>12,.0f}{d:>13,.0f}{kwh:>13,.0f}{ckg/1000:>10.1f}")
scenario GPU-hours cost $ energy kWh tCO2e
------------------------------------------------------------------------------------
Baseline BF16, MFU 35% 31,168,831 77,922,078 24,436,364 9017.0
Tuned BF16, MFU 50% 21,818,182 54,545,455 17,105,455 6311.9
FP8 FP8, MFU 50% 13,636,364 34,090,909 10,690,909 3944.9
------------------------------------------------------------------------------------
FP8 + tuned MFU vs baseline: cost x2.29, energy x2.29, carbon x2.29 lower
Same FP8 run on a low-carbon grid (30 gCO2e/kWh): carbon x12.3 lower again
The lesson is structural, not numerical. Because all three meters are driven by the same GPU-hours, the engineer who optimizes for utilization and precision is, without any extra work, also the engineer who minimizes the carbon footprint. The two goals that are sometimes posed as a trade-off, performance versus sustainability, are the same goal viewed through different units. Higher MFU (the utilization arithmetic of Section 19.6), FP8 and low precision (Chapter 15), compute-optimal sizing, and not over-training all push the denominator up, and the denominator is shared.
There is a tempting fallacy that since cleaner grids cut carbon for free, you can ignore efficiency and just train somewhere green. The arithmetic disagrees. A clean grid divides only the carbon meter; the dollar meter and the energy meter do not care how the electrons were generated. An over-trained, low-utilization run on a hydro grid still burns the same eight-figure budget and the same gigawatt-hours. Efficiency is the only lever that pulls all three meters at once, which makes it the cheapest sustainability investment you will ever make, because you were going to make it for the budget anyway.
4. Compute-Optimal and the Cost of Over-Training Intermediate
The fourth lever is the subtlest and the one most often gotten wrong: spending compute the model cannot convert into quality. The compute-optimal framing developed earlier in this chapter answers a precise question, given a fixed compute budget $C$, how should you split it between model size $N$ and training tokens $D$, subject to $C \approx 6 N D$, to minimize loss. Train far past that point and the loss curve flattens while the meters keep climbing. Every additional token then buys diminishing quality at undiminished cost, energy, and carbon.
There is a legitimate reason to over-train relative to the pretraining-optimal point: a model that will serve billions of inference requests should be made smaller and trained longer, because a smaller model is cheaper to serve forever, and the extra training cost is amortized over the deployment lifetime. That is an inference-economics decision, made deliberately with the serving cost of Chapter 3 in view, not an accident. The failure mode this section warns against is the accidental over-train: running longer because the cluster was already booked, or because a flat loss curve was mistaken for one still descending. Naming the compute-optimal point, and justifying any departure from it in writing, is how a cost-aware team keeps a training plan honest.
Who: A research-infrastructure lead at a foundation-model lab planning a 70-billion-parameter pretraining run.
Situation: The original plan budgeted 30 million GPU-hours on rented H100 capacity, a roughly $75M line item, with a sustainability commitment to report and offset emissions.
Problem: A first cost-and-carbon estimate, built with the arithmetic of Code 19.9.2, projected about 9,000 tonnes CO$_2$e, large enough to attract board-level attention and a public disclosure obligation.
Dilemma: Push for a faster launch on the planned 35-percent-utilization configuration, or spend two weeks tuning the parallelism and adopting FP8 first, delaying the start but lowering every meter.
Decision: They spent the two weeks. Profiling lifted MFU to 50 percent, FP8 added a further 1.6x effective throughput, and the data team confirmed the token budget already sat at the compute-optimal point, so no over-training correction was needed.
How: They instrumented the loop with CodeCarbon (Code 19.9.1) to verify the projection against live power draw, and scheduled the run in a low-carbon region whose grid averaged under 60 gCO$_2$e/kWh.
Result: The same model trained in 13.6 million GPU-hours for about $34M, drawing 11 GWh, and the regional placement cut reported emissions by an order of magnitude beyond the efficiency gain. The two-week delay paid for itself many times over.
Lesson: Estimate all three meters before you launch. The efficiency work you would do for the budget is the same work that shrinks the footprint, and grid placement is a near-free additional carbon cut.
5. Responsible Scaling: Provenance, Safety, and Governance Advanced
Cost and energy bound a run from below; responsibility bounds it from the sides. Three duties attach to operating at this scale, and a serious training plan treats them as gates, not afterthoughts. The first is data provenance and consent. The corpus construction, deduplication, and tokenization of the early sections of this chapter determine not only quality but also legitimacy: where the data came from, whether its use is licensed or consented, and whether personal or copyrighted material was filtered. A model is downstream of its data in every sense, including the legal and ethical ones, so provenance must be recorded as the corpus is built, not reconstructed after a dispute.
The second duty is safety evaluation before release. A frontier model is evaluated not only for capability but for the harms it might enable, and that evaluation happens before weights are shipped, using the held-out, contamination-controlled methodology of Chapter 5. The third is compute governance: because a frontier run requires a concentration of accelerators that few organizations control, the training compute itself has become a unit of policy, with disclosure thresholds and review tied to total FLOPs. An engineer planning a run at this scale should know which thresholds it crosses and what reporting they trigger.
Two research lines now treat the meters of this section as primary objects. On the environmental side, carbon-aware scheduling has moved from proposal to practice: systems in the lineage of Google's carbon-intelligent computing and the Carbon Explorer and ACT accounting frameworks shift flexible training and batch jobs in time and space to follow low-carbon electricity, and recent work reports double-digit percentage carbon reductions for deferrable training with no change to the model. The lifecycle view is sharpening too, with embodied carbon (the footprint of manufacturing the accelerators themselves) increasingly accounted alongside operational energy. On the governance side, the 2024 to 2025 wave of policy, including the EU AI Act's systemic-risk tier and United States reporting frameworks, fixes training-compute thresholds (commonly stated around $10^{25}$ to $10^{26}$ FLOPs) above which a model triggers mandatory evaluation and disclosure, making the FLOP count of a run a regulated quantity. The open question the field is actively working is how to certify a run's reported cost, energy, carbon, and safety evaluation in a way an external auditor can trust, the certification problem we flagged at the close of Section 5.
The thesis of this book is that intelligence at scale is built by distributing work across many machines. This section adds the boundary condition: the distribution is not free, and the meters that measure its price, dollars, joules, and carbon, are first-class design constraints that sit beside the communication tax and the failure tax from Chapter 3. A frontier run is scale-out at its most ambitious, and it is precisely there that the cost of scale-out is most visible. Every efficiency primitive in Part IV, higher MFU, FP8, sharding, expert parallelism, exists in part to keep that price affordable, which is why a chapter on training the largest models has to end with the bill.
6. Chapter 19, Composed Beginner
This chapter set out to train a foundation model at scale, and the path it walked was, in effect, the whole book composed into a single run. The data was constructed, deduplicated, and tokenized using the distributed data processing of Part II. The training distributed data, the model, and the optimizer state across thousands of accelerators using the 3D and 4D parallelism of Chapters 15 through 17, and it survived weeks of inevitable hardware failure using the elastic, fault-tolerant machinery of Chapter 18. The run was sized to the compute-optimal point, then fine-tuned and aligned into something usable, and at every step it was bounded by the cost and energy meters of this final section. A foundation model is not one technique; it is the entire discipline of scale-out AI, exercised at once and paid for in three currencies.
Training a foundation model at scale composes nearly everything this book teaches: distributed data construction, deduplication, and tokenization; 3D and 4D parallel pretraining that survives failure; compute-optimal sizing; and fine-tuning and alignment into a usable system. What makes it a single coherent engineering problem rather than a pile of techniques is the budget. Cost, energy, and carbon are driven by one shared quantity, the GPU-hours, and that quantity is useful-compute divided by efficiency. Master the efficiency levers and you have mastered the bill, the footprint, and the responsibility together, because at this scale they are three readings of the same meter.
You are given a hard cap of $20M and a goal of keeping emissions under 2,000 tonnes CO$_2$e. Using the formulas in Sections 1 and 2 and the parameters in Code 19.9.2, work out the largest useful-compute budget (in PFLOP-days) you can afford under each constraint separately, at the tuned FP8 efficiency. Which constraint binds first, the money or the carbon? State one change to the plan (an efficiency lever or a placement choice) that would relax the binding constraint, and explain which meter it moves and which it leaves untouched.
Extend Code 19.9.2 with two additions. First, add a checkpoint-and-restart overhead: assume failures and restarts waste 8 percent of GPU-hours (the elastic-training tax from Chapter 18), and show its effect on all three meters. Second, add an inference-amortization view: given a deployment that serves $10^{12}$ tokens over the model's lifetime, compute the marginal cost and carbon per million served tokens for two designs, a larger model trained at the compute-optimal point versus a smaller model deliberately over-trained by 3x, and identify the token volume at which the over-trained design becomes cheaper overall.
Find a published foundation-model technical report that states either GPU-hours or total energy. Reconstruct the missing meters using the arithmetic of this section: if only GPU-hours are given, estimate energy and carbon under a stated PUE and grid intensity; if only energy is given, back out the implied GPU-hours. List every assumption you had to introduce, and discuss which one your final carbon number is most sensitive to. Tie your reasoning to the certification problem raised in the research-frontier callout: what would the report need to publish for your audit to require no assumptions at all?
1. Compute-optimal run planner with a cost and carbon budget. Build a tool that takes a target model quality (or a fixed compute budget), the compute-optimal $C \approx 6ND$ relationship, and the three-meter arithmetic of this section, and returns a recommended $(N, D)$ pair, the GPU-hours, and the projected dollars, energy, and carbon. Add hard constraints (a dollar cap and a carbon cap) and have the planner report which constraint binds and which efficiency lever or grid placement best relaxes it. Validate the cost and carbon estimates against CodeCarbon on a small proxy run.
2. Carbon-aware scheduling simulator. Using public historical grid-intensity traces for two or three regions, simulate placing a deferrable multi-day training job. Compare a naive "start now, nearest region" policy against a carbon-aware policy that shifts the job in time and space, and quantify the carbon saved, the dollar cost of any added delay or data transfer, and the energy (which should be nearly unchanged). Report the carbon-per-dollar Pareto frontier.
3. Footprint audit of three published models. Pick three foundation models whose technical reports disclose GPU-hours or energy, reconstruct all three meters for each with stated assumptions (extending Exercise 19.9.3), and produce a comparison table plus a short write-up on what each report would need to disclose for an external party to verify the numbers without assumptions. Connect your findings to the compute-governance thresholds discussed in Section 5.