"They added five hundred more of me to the cluster and the throughput barely moved. I spent most of my life waiting on a barrier, and somebody still paid full price for every idle second."
A GPU Billed at Peak, Utilized at Forty Percent
A distributed AI system is worth running only if it delivers useful work at an acceptable price in dollars, in hardware utilization, and in energy. Speedup and throughput, the metrics of the previous sections, tell you whether the system is fast; they say nothing about whether it is efficient or affordable. A job can be twice as fast and four times as expensive. It can saturate a thousand accelerators and still leave sixty percent of their compute on the floor. This section introduces the three accounting metrics that decide whether a system should exist at all: cost per unit of work, Model FLOPs Utilization, and energy per unit of work. Each one converts raw speed into a number a budget owner, a hardware planner, or a sustainability auditor can actually act on, and together they turn "is this system good?" into a question you answer with arithmetic rather than enthusiasm.
The earlier sections of this chapter measured how fast a distributed system runs: speedup against one machine (Section 5.2), throughput and tail latency (Section 5.3), and the communication-to-computation ratio that explains where the wall-clock goes (Section 5.4). Speed is necessary, but it is not the same as value. A training run that finishes in half the time on eight times the hardware has gotten slower per dollar, not faster. A serving fleet that hits its latency target while every accelerator sits two-thirds idle is burning capital and power for nothing. To decide whether a system is worth running, you need metrics that put the work in the numerator and the resources consumed in the denominator. There are three that matter, and this section builds each one from a measurement you already have.
The first is cost per unit of work: dollars per training run, per million tokens generated, per thousand inferences served. The second is utilization, specifically Model FLOPs Utilization (MFU), which asks what fraction of the hardware's theoretical peak compute the model actually used. The third is energy and its downstream carbon, measured as joules per token or kilowatt-hours per run, now a first-class concern both for the operating bill and for responsibility. The three are tightly coupled: low utilization wastes both dollars and joules, because an idle accelerator still draws power and still appears on the invoice. Driving utilization up is the single lever that improves all three at once.
1. Cost per Unit of Work Beginner
The most direct question anyone asks about a distributed system is what it costs to do a fixed amount of work. The amount of work has to be defined precisely, because "cost per hour" is meaningless without it; an idle cluster also costs money per hour. For training, the natural unit is the full run, or the cost to reach a target loss. For generation, it is dollars per million tokens. For classification or embedding, it is dollars per thousand inferences. In every case the recipe is the same: take the price of holding the resources for some interval, and divide by the useful work completed in that interval.
Write $R$ for the rental rate of the whole job in dollars per hour, which for a homogeneous cluster of $G$ accelerators at price $p$ each is simply $R = G \cdot p$. If the job completes $W$ units of work per hour, then the cost per unit is
$$C_{\text{unit}} = \frac{R}{W} = \frac{G \cdot p}{W}.$$For token generation, $W$ is tokens per hour, so $C_{\text{per million}} = R \,/\, (\text{tokens per hour} / 10^6)$. The structure exposes the two ways to make work cheaper: lower the rate $R$ (cheaper hardware, spot pricing, fewer machines) or raise the work rate $W$ (better utilization, larger batches, less idle time). Crucially, adding machines raises $R$ immediately and in full, but raises $W$ only to the extent the work actually scales. If doubling $G$ less than doubles $W$, which is the normal case once communication starts to dominate, then cost per unit of work goes up even as wall-clock time goes down. This is why "faster" and "cheaper" are different axes, and why a scaling study that reports only speedup is incomplete. The cost-aware scaling models of Section 3.9 formalize exactly this tension between wall-clock and dollars.
Wall-clock speedup and cost per unit of work move in opposite directions once scaling is sublinear. Adding accelerators always multiplies the rate $R$ by exactly the number you added, but multiplies the work rate $W$ by less than that whenever communication or idle time eats into the gain. The break-even point, where one more machine stops paying for itself, is almost always reached well before the point where the system stops getting faster. Reporting speedup without cost per unit of work tells the optimistic half of the story and hides the half that the budget owner cares about.
2. Model FLOPs Utilization, and Why Big Runs Sit Far Below Peak Intermediate
Cost per unit of work tells you the price of the output. Utilization tells you how much of the hardware you paid for actually did useful work. A vendor advertises an accelerator's peak compute, for example roughly $989$ TFLOP/s of dense BF16 on a current data-center GPU, but no real workload reaches it. Model FLOPs Utilization is the honest fraction. It counts only the floating-point operations the model strictly requires, divides by the aggregate hardware peak, and reports the ratio. If $T$ is achieved throughput in tokens per second, $F$ is the model FLOPs per token, $G$ is the number of accelerators, and $P_{\text{peak}}$ is the per-accelerator peak in FLOP/s, then
$$\text{MFU} = \frac{T \cdot F}{G \cdot P_{\text{peak}}}.$$For a dense transformer, the model FLOPs per token follows the well-known $6N$ rule: a forward and backward pass over $N$ parameters costs about $6N$ floating-point operations per token (roughly $2N$ for the forward pass and $4N$ for the backward). MFU counts only this irreducible arithmetic. A closely related metric, Hardware FLOPs Utilization (HFU), counts every FLOP the hardware actually executed, including the redundant recomputation introduced by activation checkpointing. HFU is therefore always at least as large as MFU, and the gap between them is exactly the overhead of those engineering tricks. MFU is the metric to report when you want to know how efficiently the run converts silicon into model progress, because it credits only work that moves the model forward.
Well-tuned large training runs typically land somewhere between thirty and fifty percent MFU, and many production runs sit lower. The reasons are the same forces that this whole chapter has been tracking, now showing up as wasted FLOP/s rather than wasted seconds. Communication is the largest culprit: while a worker waits on an all-reduce to synchronize gradients (Chapter 4), its arithmetic units are idle but its meter is running. Memory stalls are the second: when a kernel is waiting on data to arrive from high-bandwidth memory, the compute units starve. Pipeline parallelism adds a third, the pipeline bubble, the startup and drain phases during which some stages have no microbatch to work on. Load imbalance, kernel launch overhead, and less-than-perfect overlap of communication with computation fill in the rest. Figure 5.5.1 shows where the FLOP/s of a typical run actually go.
The practical upshot is that MFU is a diagnostic, not just a scorecard. A run at fifteen percent MFU is telling you that most of your hardware budget is being spent on waiting, and the communication-to-computation analysis of Section 5.4 will usually point at the dominant culprit. Raising MFU is the same act as lowering cost per token and energy per token, because all three share the denominator of useful work done per unit of resource consumed.
3. Energy and Carbon as First-Class Metrics Intermediate
Energy used to be folded silently into the dollar cost and ignored. It no longer can be, for two reasons. The operating reason is that at fleet scale, power is a hard physical constraint: a data center has a fixed megawatt budget, and once it is full, the only way to do more work is to do it more efficiently per joule. The responsibility reason is that the carbon emitted by training and serving large models has become a reported, audited quantity that organizations are accountable for. Both reasons make energy per unit of work a metric you measure deliberately rather than infer after the fact.
The base measurement is joules per token (or joules per inference, or kilowatt-hours per training run). If the facility draws $P_{\text{fac}}$ watts while producing $T$ tokens per second, then
$$E_{\text{token}} = \frac{P_{\text{fac}}}{T} \quad \text{joules per token}, \qquad P_{\text{fac}} = G \cdot P_{\text{gpu}} \cdot \text{PUE}.$$Here $P_{\text{gpu}}$ is the board power drawn by one accelerator under load and PUE is the data center's Power Usage Effectiveness, the ratio of total facility power to the power delivered to the computing equipment itself. A PUE of $1.0$ would mean every watt reached a chip; real data centers add cooling, power conversion, and lighting, so a good modern facility runs around $1.1$ and an older one well above $1.5$. PUE is the multiplier that turns chip power into wall-plug power, and it is why efficient cooling is part of AI efficiency. To turn joules into carbon, multiply the energy by the grid's carbon intensity in grams of CO2-equivalent per kilowatt-hour, a number that varies by region and by hour of day, which is precisely what makes carbon-aware scheduling possible.
Notice that $T$, achieved throughput, sits in the denominator of both cost per token and energy per token, and that $T$ is what MFU measures the efficiency of. This is the unifying thread of the section: the same idle FLOP/s that drag MFU down also raise the dollar cost and the joule cost of every token, because the accelerators draw power and accrue rent whether or not they are doing useful arithmetic. We connect this energy accounting to the responsible-scaling practices, power-capping, carbon-aware placement, and reporting obligations, of Chapter 35.
4. Computing the Three Metrics from Real Inputs Intermediate
The three metrics share so much structure that one small script computes all of them from a handful of measured inputs: the number of accelerators, their peak FLOP/s, the model size, the achieved throughput, the rental price, the board power, the PUE, and the grid carbon intensity. The code below takes a concrete scenario, a $256$-GPU job training a $7$-billion-parameter dense model, and reports MFU, cost per million tokens, energy per token, and carbon per million tokens in one pass.
# Scenario: one training job on a cluster of accelerators.
num_gpus = 256 # accelerators in the job
peak_flops_per_gpu = 9.89e14 # 989 TFLOP/s, dense BF16 peak of one H100 SXM
tokens_per_second = 2.45e6 # measured end-to-end throughput of the whole job
# Transformer training FLOPs per token: the 6N rule (fwd+bwd) for N parameters.
num_params = 7.0e9 # a 7B-parameter dense model
flops_per_token = 6 * num_params
# --- Model FLOPs Utilization (MFU) ---
achieved_flops = tokens_per_second * flops_per_token # useful FLOP/s the model did
peak_flops = num_gpus * peak_flops_per_gpu # aggregate hardware peak
mfu = achieved_flops / peak_flops
# --- Cost per million tokens ---
gpu_hourly_usd = 2.50 # blended price per accelerator-hour
cluster_hourly_usd = num_gpus * gpu_hourly_usd
tokens_per_hour = tokens_per_second * 3600.0
cost_per_million = cluster_hourly_usd / (tokens_per_hour / 1e6)
# --- Energy and carbon per token ---
gpu_watts = 700.0 # board power of one accelerator at load
pue = 1.12 # power usage effectiveness of the facility
facility_watts = num_gpus * gpu_watts * pue
joules_per_token = facility_watts / tokens_per_second
grid_g_per_kwh = 380.0 # grid carbon intensity, gCO2e per kWh
kwh_per_token = joules_per_token / 3.6e6 # 3.6e6 joules per kWh
gco2_per_million = kwh_per_token * grid_g_per_kwh * 1e6
print("MFU :", f"{mfu*100:.1f} %")
print("cluster cost per hour :", f"${cluster_hourly_usd:,.0f}")
print("cost per 1M tokens :", f"${cost_per_million:.4f}")
print("joules per token :", f"{joules_per_token:.3f}")
print("gCO2e per 1M tokens :", f"{gco2_per_million:.1f}")
MFU : 40.6 %
cluster cost per hour : $640
cost per 1M tokens : $0.0726
joules per token : 0.082
gCO2e per 1M tokens : 8.6
The output makes the coupling concrete. The cost per million tokens, the energy per token, and the MFU all read off the same throughput; improve that one number and all three numbers improve together. The script also makes it easy to play out cost-aware decisions: switch gpu_hourly_usd to a spot price and watch the cost per token fall, the lever that Chapter 33 turns into a scheduling policy, or lower grid_g_per_kwh to model running the job in a cleaner region or a greener hour, the lever behind carbon-aware scheduling.
Code 5.5.1 estimates facility power from a nameplate board-power figure. In a real run you measure it. NVIDIA accelerators report instantaneous board power through NVML, which pynvml exposes directly, and higher-level libraries wrap this into automatic carbon accounting so you never hand-roll the joules-to-CO2 conversion:
# pip install pynvml codecarbon
import pynvml
pynvml.nvmlInit()
h = pynvml.nvmlDeviceGetHandleByIndex(0)
watts = pynvml.nvmlDeviceGetPowerUsage(h) / 1000.0 # measured board power, in watts
from codecarbon import EmissionsTracker
with EmissionsTracker() as tracker: # logs energy + region carbon intensity
train_one_epoch() # your workload here
# tracker writes kWh consumed and kg CO2e emitted to emissions.csv
pynvml returns the real per-device watts that Code 5.5.1 approximated, and codecarbon handles the region-specific carbon intensity and the joules-to-CO2 conversion, writing an auditable emissions record.Who: An inference platform engineer running a large-language-model API for an enterprise product.
Situation: The team had doubled the serving fleet to cut p99 latency, hit the latency target, and declared victory in the weekly review.
Problem: The next month's GPU invoice nearly tripled, and finance asked why the cost per million tokens had risen sharply even though the system was demonstrably faster.
Dilemma: Keep the larger fleet, which met the latency SLO at a cost the business could not sustain, or shrink it, which threatened the latency target that had justified the expansion.
Decision: Before touching fleet size, they measured MFU and found it had fallen to eleven percent: the extra replicas ran with batches too small to fill the accelerators, so most of the new hardware was idle yet fully billed and fully powered.
How: They reverted to the smaller fleet, raised the per-replica batch size and enabled continuous batching to push MFU back above thirty percent, and used the cost-per-million-token and joules-per-token formulas of this section to set the fleet size from the actual work rate rather than from latency alone.
Result: Cost per million tokens dropped back near the original level, energy per token fell in the same proportion, and the latency SLO still held, because the bottleneck had been utilization, not raw capacity.
Lesson: Faster is not cheaper. Always pair a latency or throughput win with its cost-per-unit-of-work and utilization numbers, or you will pay for idle silicon and call it progress.
5. Putting the Three Together Beginner
The three metrics of this section are not independent scores to optimize separately; they are three views of one ratio, useful work over resources consumed. Cost per unit of work uses dollars in the denominator's resource term, energy per unit of work uses joules, and MFU uses theoretical peak FLOP/s. Because achieved throughput sits underneath all three, the cheapest single intervention is almost always to raise utilization: a run that converts more of its purchased FLOP/s into model progress is simultaneously cheaper per token, greener per token, and a better return on the hardware. When utilization is already high and the metrics still hurt, the levers move to the rate itself, cheaper or spot-priced hardware for the dollar term, cleaner grids or off-peak hours for the carbon term, both of which the infrastructure chapters develop into scheduling policy.
These accounting metrics also close a loop that the chapter opened. Speedup and throughput say whether distribution made the system faster; cost, utilization, and energy say whether it made the system better. A scaling study that omits them can recommend a configuration that is fast and ruinous. With them, "should we run this, and at what size?" becomes a decision backed by numbers. The next section turns to a subtler danger: that the numbers themselves can mislead if the benchmark is built carelessly. We move from what to measure to how to measure it without fooling ourselves in Section 5.6.
Three live threads sharpen the metrics of this section. First, MFU reporting has become a near-standard disclosure in large-model system papers, traceable to the PaLM training report (Chowdhery et al., 2022) that popularized the metric and to the Megatron-LM and MegaScale efficiency studies, the latter reporting MFU above fifty percent on tens of thousands of GPUs; the field now treats MFU as the headline efficiency number a serious training run must justify. Second, FP8 training, pushed into production by the NVIDIA Transformer Engine and DeepSeek-V3's large-scale FP8 run (2024 to 2025), raises usable peak FLOP/s and energy efficiency per token, though it shifts the burden onto numerical-stability engineering and complicates the choice of which "peak" belongs in the MFU denominator. Third, carbon-aware scheduling has matured from research into deployed practice: systems in the lineage of Carbontracker and the carbon-aware computing line shift flexible training and batch-inference work toward times and regions with lower grid carbon intensity, turning the carbon term of this section's energy formula into a control knob. We develop the responsible-scaling side of these practices in Chapter 35.
An accelerator stalled on a communication barrier is doing nothing, but it is not free. It still draws most of its board power, and it still bills at the full hourly rate. At fleet scale this is the most expensive way to do nothing that has ever existed: a thousand top-end GPUs waiting one second on a barrier costs roughly the same as the electricity and rent you would pay for a small office to do nothing for a day. Utilization is the metric that turns that invisible waiting into a number on a dashboard, which is the first step toward making it stop.
A team scales a training job from $64$ to $256$ accelerators and observes wall-clock time drop from $12$ hours to $4$ hours. Using the cost-per-unit-of-work relation $C_{\text{unit}} = R / W$, and assuming a flat per-accelerator price, determine whether the run got cheaper or more expensive per unit of work, and by what factor. State the parallel efficiency implied by the speedup, and explain in one or two sentences why a review that celebrated only the wall-clock reduction would be misleading to a budget owner.
Starting from Code 5.5.1, wrap the calculation in a function of tokens_per_second and print MFU, cost per million tokens, and joules per token for achieved throughputs of $1.0\times10^6$, $2.45\times10^6$, and $3.5\times10^6$ tokens per second, holding all other inputs fixed. Confirm numerically that cost per token and energy per token are each inversely proportional to throughput while the rate and power stay constant, and explain why this means a single utilization improvement moves all three accounting metrics in the same direction.
A $512$-GPU run reports an MFU of $22\%$. Independent profiling attributes the lost compute as follows: communication waits account for $30\%$ of wall-clock time, the pipeline bubble for $12\%$, and memory stalls for $10\%$, with the small remainder being kernel-launch and overlap inefficiency. Reconcile these wall-clock fractions with the $22\%$ MFU figure, identify which single waste category is the largest target for improvement, and connect your reasoning to the communication-to-computation ratio of Section 5.4. State what intervention you would try first and what you would expect it to do to the cost per million tokens.