"I counted my megawatts as the model trained, and I counted whom it failed as the model served. Nobody had asked me to total either column. So I did, and I left the receipt on the table."
A Datacenter, Counting Its Megawatts as the Model Trains
Scale multiplies not only capability and cost but harm: a model trained on skewed data and served to a billion endpoints broadcasts every bias it learned across the whole population, and the energy, water, and carbon of training and inference grow with the fleet rather than the model. The rest of this chapter taught you to keep a distributed AI system running and secure when failure is the steady state; this final section asks the question that a working, secure system still leaves open: at this scale, what are we responsible for? Two answers dominate. The first is fairness, because distribution over non-IID, geographically uneven data can encode disparities, and a fleet amplifies them uniformly to everyone it reaches. The second is environmental cost, because the same scale-out that makes the system possible turns a per-node energy number into a population-scale total that shows up on a grid and in an emissions ledger. Both are scale problems, which is why they belong in this book and not in a footnote; responsibility scales with the fleet, and measuring it requires the same distributed aggregation the rest of the system already runs.
A distributed AI system that survives its faults, repels its attackers, and protects its training data can still do harm of a kind no fault detector flags. Two such harms scale precisely with the engineering this book celebrates. The first is bias: a model that fits a skewed distribution and is then replicated across a fleet does not make a few unfair decisions, it makes the same unfair decision everywhere at once, with the consistency that is otherwise the system's great virtue. The second is environmental cost: the energy that powers a single accelerator is negligible, but multiplied by the thousands of accelerators in a training cluster and by the billions of queries a deployed model answers, it becomes a real draw on a real electrical grid with a real carbon intensity. Neither harm is a single-machine concern. Both are scale-out concerns, born of the same multiplication that gives the system its power, and both are measurable with the distributed-aggregation machinery already at hand.
1. Bias and Fairness at Fleet Scale Intermediate
Bias in a machine learning model is not new, and the textbook account treats it as a property of one model fit to one dataset. Distribution changes the account in two ways that this book must name because they are scale effects, not single-machine effects. The first concerns where the data comes from. A distributed or federated training run, of the kind built in Chapter 14, draws its data from many sites whose distributions differ, the non-IID condition that makes federated optimization hard in the first place. When the sites are geographically skewed, when one region contributes ten times the data of another, or when whole populations are absent from the training fleet, the aggregated model inherits that skew. It fits the well-represented groups and underserves the rest, and it does so not by accident of a single bad batch but as a structural consequence of how the data was partitioned across the fleet.
The second change concerns where the model goes. A single-machine model serves a single stream of requests; a fleet-served model answers a population. When a model is replicated to a billion edge endpoints, as in the on-device deployments of Chapter 34, every bias it carries is applied uniformly to everyone the fleet reaches. Replication, the property that makes the serving fleet reliable and fast, is exactly the property that makes its bias universal. A disparity that would affect a handful of users from one model instance affects the entire underserved population from a fleet of identical instances. Scale does not dilute the harm; it broadcasts it.
The two forces that make a distributed AI system powerful each have a fairness cost. Replication, which gives the serving fleet its reliability, applies any learned bias identically to every user the fleet reaches, so a model that is slightly unfair on one box is uniformly unfair across a billion. Aggregation, which gives federated training its reach, can average away a minority's signal when one region's data dominates the combine step, so the global model fits the majority and the per-region disparity never appears in the global loss. Both costs are invisible to a metric computed on one node. Seeing them requires measuring fairness the way the system already measures everything else: per group, per shard, rolled up across the fleet.
Measuring fairness at this scale is itself a distributed-aggregation problem, and that is the useful news. The fairness metrics of Chapter 5 are group rates, and group rates are sums; sums roll up across a fleet with the same collective that rolls up a gradient. Consider the demographic-parity difference, the gap between the rate at which a model produces a favorable outcome for two protected groups $A$ and $B$. With $\hat{y}=1$ denoting the favorable decision and $G$ the protected attribute, it is
$$\Delta_{\mathrm{DP}} = \bigl| \,P(\hat{y}=1 \mid G=A) - P(\hat{y}=1 \mid G=B)\, \bigr|.$$Each probability is a positive count divided by a group total. If shard $k$ reports its local positive count $p_{g,k}$ and total $n_{g,k}$ for each group $g$, the fleet-wide rate is the all-reduced sum of the counts over the all-reduced sum of the totals,
$$P(\hat{y}=1 \mid G=g) = \frac{\sum_{k=1}^{K} p_{g,k}}{\sum_{k=1}^{K} n_{g,k}},$$so the global fairness gap is computed from two per-group all-reduces, one for the positives and one for the totals, and nothing larger ever leaves a shard. The equal-opportunity gap is the same arithmetic restricted to the truly-positive subgroup, comparing true-positive rates rather than overall positive rates. The point is structural: the fairness audit of a billion-endpoint fleet is one collective over per-group counters, cheap enough to run continuously, and the only reason it is ever skipped is that nobody wired the counters in.
You do not hand-roll fairness metrics any more than you hand-roll an all-reduce. Microsoft's fairlearn exposes demographic parity, equalized odds, and equal opportunity as one-line metric calls, and its MetricFrame slices any scikit-learn metric by a sensitive attribute. The distributed part is yours: gather the per-shard, per-group counts with a collective first, then feed the fleet-wide arrays in.
# pip install fairlearn
from fairlearn.metrics import MetricFrame, demographic_parity_difference
from sklearn.metrics import selection_rate
# y_pred and group are the FLEET-WIDE arrays, reconstructed from per-shard
# all-reduced counters (one all_reduce of positives, one of totals per group).
dp_gap = demographic_parity_difference(y_true, y_pred, sensitive_features=group)
mf = MetricFrame(metrics=selection_rate, y_true=y_true,
y_pred=y_pred, sensitive_features=group)
print("per-group selection rate:\n", mf.by_group) # the rows Figure 35.8.1 rolls up
print("demographic-parity gap :", round(dp_gap, 3))
fairlearn in two calls. The hand-written rollup of Code 35.8.1 collapses to demographic_parity_difference; what the library cannot do for you is the cross-fleet collective that produces y_pred and group in the first place, which remains the distributed-systems work this book teaches.2. The Environmental Cost of Training and Serving Intermediate
The second scale-amplified responsibility is physical. A training run draws power, that power comes from a grid, and the grid has a carbon intensity; the product is an emissions figure that grows with the cluster and the wall-clock. The arithmetic is deliberately simple so that no one can claim it was too hard to compute. Training energy is the cluster's electrical power multiplied by the run's duration and by the datacenter's power-usage effectiveness, the overhead factor for cooling and power delivery,
$$E_{\mathrm{train}} = P_{\mathrm{compute}} \cdot t \cdot \mathrm{PUE},$$and the carbon emitted is that energy multiplied by the grid's carbon intensity,
$$C = E \cdot I, \qquad [\,\mathrm{kg\,CO_2}\,] = [\,\mathrm{kWh}\,] \cdot [\,\mathrm{gCO_2/kWh}\,] \cdot 10^{-3}.$$A modern datacenter PUE sits near $1.1$, so most of the energy is the accelerators themselves; the grid intensity $I$, by contrast, ranges over more than an order of magnitude between a hydro-rich grid near $45$ grams of carbon dioxide per kilowatt-hour and a coal-heavy grid above $600$. That spread is the lever this section returns to: the same job, run on a cleaner grid, emits far less, and nothing about the model changes. The training side of the ledger is a one-time cost per run, large but bounded.
Inference is the other side, and at population scale it is the side that dominates over the life of a deployment. A single query on an efficient node, the per-node figure that Chapter 22 works so hard to minimize, costs a fraction of a watt-hour. That number looks negligible until it is multiplied by the query volume of a deployed fleet. The total inference energy over a period is the per-query energy times the number of queries,
$$E_{\mathrm{infer}} = \varepsilon_{\mathrm{query}} \cdot Q,$$and when $Q$ is billions of queries per day, a tenth of a watt-hour becomes hundreds of megawatt-hours. This is the inference-at-population-scale multiplier, and it is the exact place where the per-node efficiency of Chapter 22 stops being a single-machine concern and becomes an environmental one: a thirty-percent reduction in $\varepsilon_{\mathrm{query}}$ that would be a modest single-box win becomes, multiplied by $Q$, a reduction measured in tens of thousands of tonnes of carbon per year across the fleet. Efficiency, multiplied, is sustainability.
Who: A sustainability-minded ML infrastructure lead at a company training large recommendation models nightly.
Situation: A two-week pretraining run on 1024 accelerators was scheduled, by habit, in whichever region had spare capacity, usually a coal-heavy grid near the team's headquarters.
Problem: The run emitted on the order of a hundred tonnes of carbon dioxide, and a quarterly emissions report had just made that number visible to the board.
Dilemma: Keep scheduling for lowest latency-to-team and engineering convenience, or schedule for grid cleanliness and accept that the cluster might sit in a region with slower data ingress.
Decision: They made carbon intensity a first-class scheduling signal, placing the flexible, latency-tolerant training job in a hydro-rich region and leaving only latency-critical serving near users.
How: They extended the spot-and-region scheduler of Section 33.8 to rank candidate regions by real-time grid carbon intensity, treating a clean grid the way the cost-aware scheduler already treated a cheap spot price.
Result: The same run, on the cleaner grid, emitted roughly eight tonnes instead of a hundred and seven, a reduction above ninety percent, with no change to the model and a few hours of added data-transfer time.
Lesson: For any training job whose deadline is soft, where and when it runs matters as much as how efficiently it runs. Carbon-aware scheduling is the single highest-leverage environmental decision a cluster operator controls.
3. Carbon-Aware Scheduling and Efficiency as Sustainability Intermediate
The practical example named the lever; this section makes it a policy. Because grid carbon intensity varies by more than an order of magnitude across regions and by a large factor across hours of the day as solar and wind come and go, a training job with a soft deadline can be moved in space or time to run where and when the grid is cleanest. This is carbon-aware scheduling, and it is the direct sibling of the cost-aware and spot scheduling of Section 33.8: there the scheduler ranked candidate placements by dollar price and preemption risk, here it ranks them by grams of carbon per kilowatt-hour, and the machinery is identical. A latency-tolerant batch training run is the ideal candidate, because it can tolerate the data-transfer delay of running far from the team and the temporal delay of waiting for a clean hour. A latency-critical serving fleet is the poor candidate, because it must sit near its users, which is why the environmental strategy splits along the same line as the rest of this book: move the flexible training where the grid is clean, and make the inflexible serving as efficient as possible in place.
That second clause is the other half of sustainability, and it is where scale-up rejoins the argument as a labeled enabler rather than the main event. The per-node efficiency techniques of Chapter 22, quantization, KV-cache paging, efficient attention, each shave a fraction off $\varepsilon_{\mathrm{query}}$. On one box that fraction is a latency or cost improvement. Multiplied across the billions of queries of a population-scale fleet, the same fraction is an emissions reduction, which is the through-line of Chapter 24 stated in carbon rather than dollars. The two strategies compose: carbon-aware placement handles the where-and-when of the flexible training load, and per-node efficiency handles the how-much of the inflexible serving load, and a responsible operator runs both. The demo below computes every number in this argument from the formulas above so the claims rest on arithmetic, not assertion.
import numpy as np
# Part A: training carbon = power * time * PUE * grid intensity.
def training_carbon(power_kw, hours, pue, intensity_g_per_kwh):
energy_kwh = power_kw * hours * pue # E = P_compute * t * PUE
carbon_kg = energy_kwh * intensity_g_per_kwh / 1000.0 # C = E * I
return energy_kwh, carbon_kg
n_gpu, per_gpu_kw = 1024, 0.45 # ~450 W per accelerator
cluster_kw = n_gpu * per_gpu_kw
hours, pue = 14 * 24, 1.12 # a two-week run, modern PUE
grids = {"coal-heavy region": 620.0, "hydro-rich region": 45.0} # gCO2 per kWh
print("== Training carbon (1024 GPUs, 2 weeks) ==")
for name, intensity in grids.items():
energy, carbon = training_carbon(cluster_kw, hours, pue, intensity)
print(f" {name:18s}: {energy:8.0f} kWh -> {carbon:8.1f} kg CO2 ({carbon/1000:5.2f} t)")
clean = training_carbon(cluster_kw, hours, pue, grids["hydro-rich region"])[1]
dirty = training_carbon(cluster_kw, hours, pue, grids["coal-heavy region"])[1]
print(f" carbon-aware shift saving : {dirty - clean:8.1f} kg CO2 ({(1-clean/dirty)*100:.1f}% lower)")
# Part B: inference total = per-query energy * query volume, at population scale.
print("\n== Inference at population scale ==")
per_query_wh, daily_queries = 0.30, 2_000_000_000 # Ch 22 per-node baseline, fleet volume
daily_kwh = per_query_wh * daily_queries / 1000.0 # E_infer = eps_query * Q
annual_kwh = daily_kwh * 365
annual_tonnes = annual_kwh * grids["coal-heavy region"] / 1000.0 / 1000.0
print(f" per-query energy : {per_query_wh:.2f} Wh")
print(f" daily queries : {daily_queries:,}")
print(f" annual energy : {annual_kwh/1e6:,.2f} GWh")
print(f" annual carbon : {annual_tonnes:,.0f} t CO2 (at 620 gCO2/kWh)")
savings = per_query_wh * 0.30 * daily_queries / 1000.0 * 365 * grids["coal-heavy region"] / 1e6
print(f" 30% per-node win saves: {savings:,.0f} t CO2/yr across the fleet")
# Part C: demographic-parity gap rolled up across the fleet (per-group all-reduce).
print("\n== Fairness rolled up across the fleet ==")
fleet = [ # each shard reports (positives, total) for protected groups A and B
{"region": "shard-NA", "A": (820, 1000), "B": (610, 1000)},
{"region": "shard-EU", "A": (790, 1000), "B": (540, 1000)},
{"region": "shard-APAC", "A": (1760, 2000), "B": (980, 2000)},
]
posA, totA = sum(s["A"][0] for s in fleet), sum(s["A"][1] for s in fleet) # all-reduce sums
posB, totB = sum(s["B"][0] for s in fleet), sum(s["B"][1] for s in fleet)
rateA, rateB = posA / totA, posB / totB
dp_gap = abs(rateA - rateB) # demographic-parity difference
print(f" group A positive rate : {rateA:.3f} ({posA}/{totA})")
print(f" group B positive rate : {rateB:.3f} ({posB}/{totB})")
print(f" demographic-parity gap: {dp_gap:.3f}")
print(f" verdict :", "FAILS 0.10 fairness bound" if dp_gap > 0.10 else "within bound")
== Training carbon (1024 GPUs, 2 weeks) ==
coal-heavy region : 173408 kWh -> 107513.1 kg CO2 (107.51 t)
hydro-rich region : 173408 kWh -> 7803.4 kg CO2 ( 7.80 t)
carbon-aware shift saving : 99709.7 kg CO2 (92.7% lower)
== Inference at population scale ==
per-query energy : 0.30 Wh
daily queries : 2,000,000,000
annual energy : 219.00 GWh
annual carbon : 135,780 t CO2 (at 620 gCO2/kWh)
30% per-node win saves: 40,734 t CO2/yr across the fleet
== Fairness rolled up across the fleet ==
group A positive rate : 0.843 (3370/4000)
group B positive rate : 0.532 (2130/4000)
demographic-parity gap: 0.310
verdict : FAILS 0.10 fairness bound
Code 35.8.1 estimates carbon from a nameplate power figure; for a real job you would rather measure it. codecarbon reads the actual energy drawn by the CPU, GPU, and RAM during a run, looks up the local grid's carbon intensity by geolocation, and writes a per-run emissions record, turning the estimate of $E=P\cdot t\cdot\mathrm{PUE}$ into a measurement.
# pip install codecarbon
from codecarbon import EmissionsTracker
tracker = EmissionsTracker(project_name="nightly-pretrain")
tracker.start()
train_one_epoch() # your real training step; CodeCarbon meters the hardware
emissions_kg = tracker.stop() # kg CO2, from measured energy x local grid intensity
print(f"this run emitted {emissions_kg:.3f} kg CO2")
codecarbon reads both from the hardware and the location, so the emissions number becomes as routine to log as the training loss, and carbon-aware scheduling has a real signal to act on.The accounting that Code 35.8.1 does by hand is the subject of an active research and tooling effort. The lineage from the ML CO2 Impact calculator and the experiment-impact-tracker through MLCO2 and codecarbon has standardized per-run emissions measurement, and large-model reporting now routinely includes a carbon figure. On the scheduling side, carbon-aware systems such as Google's work on shifting flexible compute to clean hours and the broader "carbon-intelligent computing" line move load in time and space toward low-intensity grid windows, the production form of Part A's region shift. A newer thread quantifies the water footprint of datacenter cooling, not just carbon (Li et al., 2023, on the water cost of large-model training and inference), making clear that the environmental ledger has more than one column. On the fairness side, work on federated and distributed fairness studies how to enforce group-fairness constraints when no node sees the whole population and the data is non-IID, the exact setting of Chapter 14. The common message is that both responsibilities of this section are becoming measurable, schedulable quantities rather than afterthoughts.
The spine of this book is that AI at scale is the engineering of systems distributed across many machines, and that distribution multiplies what one machine can do. This section is the same multiplication, read in two other columns. The replication that gives a serving fleet its reliability multiplies any bias across the whole population; the per-node energy that Chapter 22 minimizes multiplies, across billions of queries, into a grid-scale carbon total. Every scaling argument in this book has a responsibility dual: the same $K$-fold or billion-fold factor that makes the system powerful makes its harms and its costs that large too. Designing at scale means budgeting that dual, with the same distributed measurement, the same collectives, and the same honesty about cost that the rest of the engineering already demands.
Chapter 35 followed one through-line: at scale, failure is not an exception but the steady state, and a distributed AI system must be engineered to stay correct, safe, private, and accountable while the substrate beneath it constantly breaks, and while adversaries and externalities push against it. The chapter built that case in layers. It began with fault tolerance, the discipline of keeping a computation correct when individual nodes crash, hang, or return late, extending the recovery story that runs from MapReduce re-execution (Chapter 6) through elastic training (Chapter 18). It moved to security, where the attack surface of a distributed system is every node and every link, and then to the AI-specific threats of data and model poisoning, answered by Byzantine-robust aggregation rules such as Krum and the trimmed mean that let a global model survive a minority of malicious or arbitrarily faulty workers, the transformation of fault tolerance promised in the cross-reference map. It treated privacy as a first-class constraint, carrying the secure-aggregation idea of Chapter 14 into differential privacy and the privacy-utility trade-off that distributed and federated learning must price in. It placed governance and accountability over the whole, so that a system spread across machines and organizations remains auditable and answerable. And it closed, here, with the two responsibilities that grow with the fleet rather than the model: bias, which replication broadcasts uniformly across a population, and environmental cost, which the per-node energy of Chapter 22 multiplies into a grid-scale total that carbon-aware scheduling (Section 33.8) and efficiency can cut. The unifying lesson is that scale is a multiplier in every direction at once: it multiplies capability, and it multiplies failure, attack surface, privacy exposure, bias, and carbon, so a system that scales out responsibly must budget all of them with the same distributed measurement it already runs.
These projects turn the chapter's pillars into things you can run and measure. Each is sized so that one, carried far enough, could seed a capstone (Chapter 41).
1. Implement Krum, then break it. Build a small federated-averaging loop with a handful of workers, implement the Krum aggregation rule (select the worker update closest to its neighbors), and show it tolerates a minority of poisoned gradients that plain averaging does not. Then design an adaptive attack that defeats Krum by crafting updates that sit near the honest cluster, and report the fraction of malicious workers at which robustness collapses.
2. Build a DP-SGD trainer and plot the privacy-utility curve. Add gradient clipping and calibrated Gaussian noise to an SGD trainer, track the spent privacy budget $\varepsilon$ with a standard accountant, and sweep the noise multiplier to plot test accuracy against $\varepsilon$. Identify the knee of the curve where a small additional privacy guarantee starts costing large accuracy, and discuss what that knee implies for a federated deployment.
3. Build a fleet drift-and-fairness detector. Instrument a simulated serving fleet whose shards see non-IID traffic, have each shard emit per-group prediction counters, and roll them up with a collective into a continuously updated demographic-parity and equal-opportunity gap. Add a distribution-drift signal so the monitor flags both when the fleet becomes unfair and when its input distribution shifts out from under the model.
4. Estimate and minimize training carbon. Wrap a real training run with codecarbon to measure its emissions, then build a small carbon-aware scheduler that, given hourly grid-intensity forecasts for several regions, places a soft-deadline job to minimize total carbon subject to a deadline. Report the emissions saved against a carbon-blind baseline, the production form of Output 35.8.1's region shift.
Using the three shards in Code 35.8.1 Part C, compute the demographic-parity gap separately on each shard, then compare the three local gaps to the fleet-wide gap of $0.31$. Explain how a fleet can look acceptably fair on every individual shard yet fail a fairness bound globally, and why the reverse can also happen. State the general principle this implies for where fairness metrics must be computed in a distributed serving system, and connect it to the all-reduce-of-counts structure of the metrics in Chapter 5.
Extend Code 35.8.1 Part A into a function schedule(job_hours, deadline_hours, region_intensities) where region_intensities gives, for each region, an array of hourly grid carbon intensities over the next deadline_hours. Have the function choose the region and start hour that minimize total emissions for the run while finishing before the deadline, and return the emissions saved against running immediately in the dirtiest region. Test it with at least three regions whose clean hours fall at different times, and confirm the chooser exploits both the spatial and the temporal spread the way Section 33.8 exploits price and preemption.
Using the figures in Output 35.8.1, compute how many days of population-scale inference at $620$ gCO$_2$/kWh it takes to emit as much carbon as the single coal-grid training run of $107.5$ tonnes. Then redo the comparison assuming the inference fleet runs on a $45$ gCO$_2$/kWh grid. Argue from the two results which responsibility, the one-time training cost or the recurring inference cost, dominates a model's lifetime carbon, and what that conclusion implies for where an operator should spend its first efficiency and carbon-aware-scheduling effort.