Section 5.1: Why Distribution Needs Its Own Evaluation Discipline

"They asked if I was faster on sixteen machines. I said yes. Nobody asked if I was cheaper, or correct, or still alive at midnight. I was none of those things."
A Benchmark That Reported Only the Number It Liked

Big Picture

A single accuracy number, the lone figure that summarizes a model on one machine, stops being a sufficient verdict the moment the work is spread across many machines. Distribution does not change only how fast a system runs; it opens new ways for the system to be wrong, new bills to pay, and new ways to fail while still appearing to work. Evaluating a distributed AI system therefore means measuring four things at once: that distribution preserved the result (correctness), that adding machines actually bought speed (performance), that the speed was worth its price (cost and energy), and that the system survives the failures a large cluster makes routine (reliability). This chapter builds the measurement discipline that every later part of the book relies on, and its first rule is the one this section defends: report construct-matched metrics from one configuration, and never quote numbers stitched together from different runs.

For a model trained and served on one machine, evaluation has a comfortable center of gravity: a held-out accuracy, an F1, a perplexity. You report that number, perhaps a confidence interval around it, and a reader knows roughly what they are getting. The number is portable because the machine is invisible; nothing about the result depends on how many cores were busy or how the bytes moved. That comfort is exactly what distribution takes away. Once the data is sharded, the gradient is summed over a network, the model is split across devices, and the service answers from a fleet, the question "is this system good?" fractures into several questions that a single accuracy number cannot hold. This section is about why that fracture happens and what replaces the single number: a small set of metrics, measured together, that together describe a distributed system faithfully.

Figure 5.1.1: The four dimensions of evaluating a distributed AI system. A single accuracy number lives entirely inside the correctness corner; the other three corners are invisible on one machine and become first-class once the work is distributed. The chapter devotes sections to each: performance in Section 5.2 and Section 5.4, latency and goodput in Section 5.3, cost and energy in Section 5.5, and reliability throughout.

1. Why One Accuracy Number Is Not Enough Beginner

Imagine two teams that each built the same recommendation model and each report 0.91 AUC on the same test set. By the single-machine standard they are tied. Now add the facts that distribution exposes. The first team reaches that AUC on four machines in two hours at a rental cost of forty dollars, with a serving fleet that holds its latency target when one replica dies. The second reaches the same AUC on sixteen machines in ninety minutes at three hundred dollars, with a serving path that drops a tenth of its requests whenever a node restarts. The accuracy number says they are equal. Everything that matters for running the system in production says they are not. The single number was never wrong; it was simply blind to the dimensions that distribution makes real.

The blindness has a structure worth naming. A single-machine accuracy assumes the computation that produced it is fixed and reliable, so the only variable is the model's quality. Distribution makes the computation itself a variable: it can be faster or slower depending on how many machines and how they talk; it can be cheaper or more expensive depending on utilization; it can succeed or partially fail depending on what broke during the run. A verdict that reports only quality, while staying silent on speed, price, and survival, is answering a question nobody running a cluster is actually asking.

Key Insight: Distribution Turns Evaluation from a Number into a Vector

On one machine, "how good is it?" collapses to a scalar because the machine is a constant. Distribution makes the machine a variable, so the verdict becomes a short vector: correctness, performance, cost, and reliability, measured together. None of the four dominates in general; which one binds depends on the system. The discipline of this chapter is to measure all four under one configuration so the vector describes a single real system, not an average of imagined ones.

2. Correctness: Did Distribution Preserve the Result? Intermediate

The first thing distribution can break is the answer itself, and this is the dimension most likely to be assumed away. Section 1.1 proved that data-parallel training can be exact: the gradient is an average over examples, an average decomposes over shards, and summing the per-shard partial sums and dividing by $N$ recovers the identical single-machine gradient,

$$\nabla L(w) = \frac{1}{N} \sum_{i=1}^{N} \nabla \ell(w; x_i, y_i) = \frac{1}{N} \sum_{k=1}^{K} \Big( \sum_{i \in S_k} \nabla \ell(w; x_i, y_i) \Big),$$

where $S_1, \dots, S_K$ partition the $N$ examples across $K$ workers. That identity is the gold standard a correctness check measures against: a distributed training run should reproduce the single-machine result up to floating-point rounding, and when it does not, the gap is a bug to be hunted, not a tolerance to be accepted. But the identity holds only when the implementation honors it. Unequal shard sizes combined with a plain unweighted average, non-deterministic reduction orders that change the rounding, a dropped worker whose shard silently vanishes from the sum, or a mismatched random seed across replicas all produce a result that differs from the single-machine answer in ways that no accuracy number on a single test set will reveal. Correctness evaluation for a distributed system therefore means more than "did it score well"; it means "did distribution change the computation it was supposed to preserve." We make this concrete by always reporting the distributed result alongside a single-machine or smaller-scale reference whenever one can be computed.

3. Performance, Cost, and Reliability: The Three New Axes Beginner

Beyond correctness, distribution adds three axes that simply do not exist on one machine. Performance asks whether adding machines bought proportional speed: a system on eight machines that runs only 2.6 times faster than on one has spent five machines' worth of hardware to buy 1.6 machines' worth of speed. The metrics that capture this, speedup and efficiency, are the subject of Section 5.2, and the communication overhead that erodes them, the very tax introduced in Chapter 3, gets its own ratio in Section 5.4. Cost asks whether the speed was worth its price: more machines for a shorter time can cost more in total even when it finishes sooner, and energy is increasingly the binding currency, so Section 5.5 develops cost-per-step, utilization, and energy accounting. Reliability asks whether the system survives the failures a large cluster makes routine: at a thousand machines something is always broken, so throughput under failure (goodput) and tail latency under partial degradation, taken up in Section 5.3, become part of the verdict rather than an afterthought.

These three axes interact, which is why they must be measured together. Pushing performance by adding machines usually worsens efficiency and raises cost; hardening reliability with replication consumes throughput; chasing the lowest cost can leave a system too lean to absorb a failure. A claim that moves one axis while staying silent on the others is not an evaluation, it is a selected highlight. The next demonstration shows exactly how such a highlight misleads.

4. A Faster Run That Got Worse Intermediate

The most common misleading claim in distributed AI is "we added machines and it got faster," stated as if speed were the whole story. It is true and almost meaningless on its own. The code below takes measured per-step wall-clock times for a training job at one, two, four, and eight nodes, where the speedup is sublinear because communication grows with the node count, and computes the three numbers that "it got faster" hides: speedup relative to one node, efficiency (speedup divided by node count), and cost per step in dollars (node count times step time times a rental rate).

base_time = 10.0          # seconds per training step on 1 node
cost_per_node_sec = 0.02  # dollars per node-second (cluster rental)

# Measured wall-clock per step; speedup is sublinear as nodes grow because
# the all-reduce communication per step grows with the worker count.
measured = {1: 10.00, 2: 7.69, 4: 5.00, 8: 3.85}

print(f"{'nodes':>6}{'step_s':>9}{'speedup':>9}{'efficiency':>12}{'cost/step($)':>14}")
for n, t in measured.items():
    speedup = base_time / t
    eff = speedup / n                 # fraction of ideal linear speedup retained
    cost = n * t * cost_per_node_sec  # node-seconds spent per step, priced
    print(f"{n:>6}{t:>9.2f}{speedup:>9.2f}{eff:>11.0%}{cost:>14.4f}")

Code 5.1.1: Turning a bare "it got faster" claim into the three numbers that qualify it. The same measured step times yield a speedup that looks good and an efficiency and cost-per-step that tell the real story.

 nodes   step_s  speedup  efficiency  cost/step($)
     1    10.00     1.00       100%        0.2000
     2     7.69     1.30        65%        0.3076
     4     5.00     2.00        50%        0.4000
     8     3.85     2.60        32%        0.6160

Output 5.1.1: Going from one node to two is genuinely faster (1.30x speedup) yet already a worse deal: efficiency falls to 65% and the cost per step rises from $0.20 to $0.31. By eight nodes the job is 2.6x faster but burns 32% efficiency and triple the cost per step.

Read the table the way a single-number report would, and the two-node row is a success: 1.30 times faster. Read all four columns and the same row is a warning. Efficiency has dropped to 65%, meaning more than a third of every added machine is being wasted on coordination, and the cost per step has risen from twenty cents to nearly thirty-one, so the faster run is the more expensive run. By eight nodes the picture is stark: a real 2.6x speedup bought at the price of two-thirds of the hardware sitting idle on communication and triple the cost per step. Nothing here is fabricated; the speedup is real. The point is that speedup alone endorsed a configuration that efficiency and cost both condemn. An evaluation that reports only the speedup column is not lying about any single number; it is misleading by omission, and omission is the failure mode this chapter is built to prevent.

Practical Example: The Speedup That Tripled the Cloud Bill

Who: A platform engineer at a video-streaming company responsible for nightly retraining of a recommendation model.

Situation: A teammate proudly demonstrated that moving the job from two nodes to eight cut wall-clock time by more than half and posted the speedup chart to the team channel.

Problem: The next month's cloud invoice for that job had roughly tripled, and nobody could explain the gap between "faster" and "more expensive."

Dilemma: Keep the eight-node setup because the faster finish looked good on the dashboard, or revisit it because the finance team had flagged the cost, with no shared metric to settle the argument.

Decision: The engineer recomputed the run exactly as in Output 5.1.1, adding efficiency and cost-per-step columns next to the speedup the teammate had reported, all from the same logged step times.

How: Using one configuration's logs, they showed efficiency had fallen to about a third and cost per step had tripled, then swept node counts to find the knee where added speed stopped paying for itself.

Result: The job was moved to four nodes, keeping most of the speedup at far better efficiency and cost, and the team adopted a rule that any speedup claim must ship with its efficiency and cost-per-step from the same run.

Lesson: A speedup number is a fragment of an evaluation. Co-computed efficiency and cost from the same configuration turn a flattering fragment into an honest verdict.

5. One Configuration, Co-Computed Metrics Intermediate

The deeper lesson of Output 5.1.1 is not which column to read but where the numbers came from. Every value in that table was computed from one set of measured step times, in one pass, for one configuration. That is the central methodological rule of this chapter, and it is the rule every later part of the book relies on: report construct-matched metrics from a single configuration, co-computed in one run, and never assemble a comparison from numbers that came from different runs. The danger is subtle precisely because it can pass a careless audit. If you quote the speedup from one experiment, the accuracy from a second with a different seed, and the cost from a third on a different cluster, each individual number may be correct, yet the bundle describes no system that ever existed. A reader who checks each figure against its source will find every one "backed," and the comparison will still be invalid, because the numbers are not commensurable.

The remedy is mechanical and cheap: when you measure a distributed run, log the raw quantities (step times, byte counts, request outcomes, joules) for that one configuration, and derive every reported metric from that single log. Speedup, efficiency, cost per step, throughput, and goodput then share a denominator and a provenance; they describe the same machine on the same day doing the same work. This is the construct-matching principle that Section 5.6 turns into a benchmarking methodology and Section 5.7 turns into a reproducible measurement protocol you can run on a real cluster.

Thesis Thread: The Evaluation Vector Travels With Every Method

This book leads with scale-out, and every scale-out method it teaches, data parallelism (Chapter 15), sharded training (Chapter 16), distributed serving (Chapter 24), is judged by the four-dimension vector this section defines, not by accuracy alone. When a later chapter claims a method "scales," it is making a falsifiable statement about speedup, efficiency, cost, and reliability measured together. The discipline introduced here is the contract those claims must satisfy.

6. What This Chapter Builds Beginner

The rest of Chapter 5 develops each dimension of Figure 5.1.1 into a measurable, reproducible practice, and Table 5.1.1 is the map. Read it as the promise of the chapter: by its end you will have a metric and a measurement protocol for every corner of the evaluation vector, plus the methodology to combine them soundly. Each later section assumes the single-configuration rule from Section 5 and the four-dimension framing from this section.

Table 5.1.1: The sections of Chapter 5, mapped to the evaluation dimension each one makes measurable. Every section inherits the single-configuration, co-computed-metrics rule established here.

Section	What it makes measurable	Dimension
5.2 Speedup, Efficiency, Scalability Curves	Whether adding machines buys proportional speed	Performance
5.3 Throughput, Goodput, Tail Latency / SLOs	Useful work per second under load and failure	Performance, Reliability
5.4 Communication-to-Computation Ratio	How much time is lost moving data versus computing	Performance
5.5 Cost, Utilization, Energy Accounting	The price and watts behind the speed	Cost and energy
5.6 Benchmarking Methodology and Pitfalls	How to measure without fooling yourself	All four
5.7 Reproducible Measurement on Clusters	Getting the same numbers twice on real hardware	All four

With the four dimensions named and the single-configuration rule in place, the natural next question is how to make the performance dimension precise. What exactly is speedup, how does efficiency fall as machines are added, and what shape does the scalability curve take as the communication tax of Chapter 3 asserts itself? That is where the chapter turns next, in Section 5.2.

Research Frontier: Benchmarks That Measure More Than Accuracy (2024 to 2026)

The field's headline benchmarks have moved toward exactly the multi-dimensional verdict this section argues for. MLPerf Training and MLPerf Inference (MLCommons) report time-to-train and time-to-quality on fixed hardware, so a submission is judged by speed-at-a-quality-target rather than quality alone, and the 2024 to 2025 rounds added a dedicated power measurement track that records the energy a run actually consumed. Inference efficiency leaderboards and the LLM-serving comparisons that accompany systems like vLLM increasingly headline tokens per second per dollar and tokens per joule rather than raw throughput, reflecting that cost and energy have become first-class metrics. A parallel academic thread argues for reporting compute and carbon cost alongside accuracy in every paper, pushing "Green AI" style accounting from a footnote into the main table. We adopt the same stance for the rest of the book: a result is reported as a vector, co-computed on one configuration, never as a lone number.

Library Shortcut: Co-Computing the Metrics With One Pass Over the Logs

In Code 5.1.1 we derived speedup, efficiency, and cost by hand from one dictionary of measured times. In practice you keep the raw per-configuration measurements in a small table and let a dataframe library compute every metric in one pass, guaranteeing they share a source and a denominator:

import pandas as pd

# One row per configuration; raw measured quantities only.
df = pd.DataFrame({"nodes": [1, 2, 4, 8], "step_s": [10.00, 7.69, 5.00, 3.85]})

base = df.loc[df.nodes == 1, "step_s"].item()
df["speedup"]    = base / df.step_s
df["efficiency"] = df.speedup / df.nodes
df["cost_step"]  = df.nodes * df.step_s * 0.02   # node-seconds, priced
print(df.to_string(index=False))

Code 5.1.2: The same three metrics as Output 5.1.1, now derived in one vectorized pass from a single table of raw measurements. The roughly ten lines of manual looping collapse to four assignments, and because every metric reads from the same df, they are construct-matched by construction; the library handles the broadcasting and the shared denominator.

Fun Note: The Benchmark Olympics

There is a folk sport in distributed AI called benchmark cherry-picking: run the same job a dozen times, on a dozen slightly different setups, and report from each run only the metric where it happened to shine. Speedup from the run with the warm cache, accuracy from the run with the lucky seed, cost from the cheapest spot-instance hour. Stitched together, the numbers describe a system that would win every event at once, a champion that exists only on the slide. The single-configuration rule is the anti-doping policy: one run, one set of co-computed numbers, every event from the same athlete on the same day.

Exercise 5.1.1: Place the Metric on the Map Conceptual

For each metric, name which of the four dimensions in Figure 5.1.1 it belongs to, and state one thing it cannot tell you that another dimension would: (a) tokens served per second by an LLM endpoint; (b) the relative error between a distributed gradient and a single-machine reference gradient; (c) dollars of cloud spend per completed training step; (d) the fraction of requests still answered within the latency target while one of eight replicas is restarting. Then explain why reporting only (a) would flatter a system that is failing on (d).

Exercise 5.1.2: Extend the Honest Table Coding

Starting from Code 5.1.1, add a column for throughput in steps per second (the reciprocal of step time) and a column for cost per node-hour-equivalent of useful work, defined as cost per step divided by efficiency. Recompute all columns from the same measured dictionary so they remain construct-matched. Identify the node count that minimizes cost per step, the one that maximizes efficiency, and the one that maximizes raw speedup, and explain in two sentences why a single accuracy number could not have distinguished any of them.

Exercise 5.1.3: Find the Invalid Comparison Analysis

A report claims method B beats method A by citing: B's speedup of 6.1x (from a 16-node run on cluster X), A's speedup of 3.4x (from an 8-node run on cluster Y), B's accuracy of 0.92 (seed 7), and A's accuracy of 0.90 (seed 3). Every individual number is reproducible from its source log. Explain precisely why a number-by-number audit that confirms each figure still fails to validate the comparison, and write down the minimal set of quantities that would have to be co-computed on one configuration per method to make the claim sound. Connect your answer to the single-configuration rule of Section 5 and to the methodology of Section 5.6.