Part I: Foundations of Distributed AI
Chapter 5: Evaluating Distributed AI Systems

Throughput, Goodput, and Tail Latency / SLOs

"I served ten thousand requests a second. Nobody mentioned that nine thousand of them arrived after the user had already left."

A Replica Proud of the Wrong Number
Big Picture

A serving system that reports high throughput while violating its latency target is not fast; it is failing quickly. The metric that tells the truth is goodput: the rate of requests that complete within the latency budget, with everything slower counted as a miss rather than a success. Raw throughput counts completions; it climbs happily as you push more load through a fleet, right up to and past the point where the queues fill and tail latency explodes. Goodput counts only the completions a user actually accepted, so it rises with offered load and then collapses the moment the system saturates. This section defines throughput, goodput, and the tail-latency percentiles (p50/p95/p99) that pin down a Service Level Objective, then runs a load test that makes goodput collapse while throughput keeps climbing, the single most important shape in serving evaluation.

Section 5.2 evaluated distributed training with speedup and efficiency curves: how much faster does the job finish as we add workers? Serving is a different regime. A trained model is deployed behind a fleet of replicas that answer a stream of requests, possibly forever, and the question is no longer "how long until done" but "how many requests per second can we serve, and how quickly does each one come back?" Those two quantities, a rate and a latency, are in tension, and the discipline of this section is to measure them together rather than one at a time. We build directly on the per-node serving metrics introduced in Section 1.6 (throughput, latency, cost, reliability) and on the tail-latency analysis of Section 3.4, now turning them into a measurement procedure for a real load test.

1. Throughput Is What You Can Measure; Goodput Is What You Meant Beginner

Throughput is the simplest serving metric: the number of requests (or, for token-generating models, the number of tokens) a system completes per unit time. If a fleet returns 8,000 responses in one second, its throughput is 8,000 requests per second. The metric is honest about one thing only, the rate of completions, and it is silent about whether any of those completions arrived in time to be useful. That silence is the problem. A response to an interactive query that takes four seconds is, for most products, indistinguishable from a failure: the user has refreshed, retried, or left. Counting it as a success inflates the number that leadership reads on a dashboard while the product quietly degrades.

Goodput repairs this by attaching the latency budget to the count. Fix a latency Service Level Objective (SLO), say "99% of requests return within 200 ms." Goodput is the rate of requests that meet the SLO; every request that returns too late, errors out, or is dropped contributes to throughput but not to goodput. Formally, over a measurement window of length $T$ with completed requests indexed by $i$ and end-to-end latencies $\ell_i$, define

$$\text{throughput} = \frac{1}{T}\sum_{i} 1, \qquad \text{goodput} = \frac{1}{T}\sum_{i} \mathbf{1}[\ell_i \le \ell_{\text{SLO}}],$$

where $\ell_{\text{SLO}}$ is the latency budget and $\mathbf{1}[\cdot]$ is one when the request met it and zero otherwise. The two numbers coincide while the system is comfortably below capacity, because nearly every request meets the budget. They diverge sharply at saturation, where requests still complete (throughput holds) but complete too late to count (goodput collapses). That divergence is the entire reason goodput exists as a separate metric.

Key Insight: Goodput Is the Only Serving Number That Cannot Lie by Omission

Throughput rewards a saturated system for producing late responses, because it counts completions regardless of when they arrive. Goodput refuses that reward: a response that misses the SLO is worth exactly as much as a response that never came, namely zero. This makes goodput the honest capacity of a serving system, the rate at which it does useful work, and it is almost always lower, sometimes dramatically lower, than the throughput a benchmark will happily print. When a vendor quotes throughput without an attached latency SLO, the number is unfalsifiable: you cannot tell whether it describes a fast system or a slow one running flat out.

2. The Tail Is the Service: p50, p95, p99 Intermediate

To turn "fast enough" into a measurable SLO you need a single number that summarizes a whole distribution of latencies, and the mean is the wrong choice. Latency distributions in serving are heavy-tailed: most requests are quick, but a minority wait behind a slow neighbor, a garbage-collection pause, a cache miss, or a straggler replica, and those slow requests are exactly the ones that drive users away. The mean hides them by averaging them against the fast majority. Percentiles expose them. The p99 latency is the value below which 99% of requests fall; equivalently, one request in a hundred is slower than the p99. We report a small ladder of them: p50 (the median, the typical experience), p95 (the edge of the common case), and p99 (the tail that defines the SLO).

Tail percentiles matter more in distributed serving than anywhere else, because a single user request often fans out to many backends and waits for the slowest. If one request touches 100 shards in parallel and each shard independently has a 1% chance of being slow, the probability that at least one shard is slow, and therefore the whole request is slow, is $1 - 0.99^{100} \approx 63\%$. The fan-out turns a rare per-shard tail into a common per-request tail, which is why Section 3.4 insisted that tail latency, not average latency, is the quantity a distributed system must control. An SLO is therefore written against a tail percentile, "p99 latency under 200 ms," not against the mean, and a system is judged compliant only while that percentile stays under budget.

Fun Note: The Median User Does Not Exist

If you optimize for the p50, you are tuning the experience of a user who, by construction, has it better than half of everyone else. The interesting engineering always lives in the p99 and beyond, where the request that fanned out to a thousand shards is held hostage by the one shard that decided to garbage-collect. The tail is not an edge case you can ignore; at scale, the tail is most of your users having one bad request among many.

3. The Collapse: Why Goodput and Throughput Part Ways Intermediate

The defining behavior of a serving system under rising load is this: throughput increases monotonically until it plateaus at capacity, while goodput rises with it, peaks, and then collapses. The mechanism is queueing. Below capacity, requests rarely wait, so latencies are small and almost every request meets the SLO; throughput and goodput are nearly equal and both track the offered load. As offered load approaches the service rate, the queue stops draining between arrivals, waiting time grows without bound, and tail latency shoots past the SLO. Now the system still completes requests, indeed it completes them as fast as it physically can, so throughput stays pinned near capacity, but those completions arrive late, so goodput falls off a cliff. The two curves, equal at low load, separate violently at saturation.

The code below makes this concrete. It simulates a single replica as a first-come-first-served queue with a fixed service rate, sweeps the offered load from well below capacity to well above it, and for each load level reports throughput, goodput at a 200 ms SLO, and the p50/p95/p99 latencies. We run it and read the numbers off the real output.

import numpy as np

# A load test against a single replica modeled as an FCFS queue: one worker
# drains a queue at a fixed service rate; offered load is the arrival rate.
# We sweep arrival rate, simulate per-request end-to-end latency (wait +
# service), then report throughput, goodput at an SLO, and the tail.

rng = np.random.default_rng(7)
SLO_MS = 200.0                 # a request "counts" only if it finishes within this
MU = 500.0                     # service rate: 500 req/sec capacity of the replica
SERVICE_MS = 1000.0 / MU       # mean service time per request, in ms
WINDOW_S = 20.0                # length of each load-test window, in seconds

def run_window(lam):
    n = rng.poisson(lam * WINDOW_S)                                    # Poisson arrivals
    if n == 0:
        return 0.0, 0.0, (0.0, 0.0, 0.0)
    arrivals = np.sort(rng.uniform(0.0, WINDOW_S, n)) * 1000.0          # ms
    service = rng.exponential(SERVICE_MS, n)                           # variable work, ms
    finish = np.empty(n)
    free_at = 0.0                                                      # next idle instant
    for i in range(n):
        start = max(arrivals[i], free_at)                             # wait for the server
        finish[i] = start + service[i]
        free_at = finish[i]
    latency = finish - arrivals                                       # end-to-end, ms
    throughput = n / WINDOW_S                                         # all completions / sec
    goodput = (latency <= SLO_MS).sum() / WINDOW_S                    # SLO-meeting / sec
    p = np.percentile(latency, [50, 95, 99])
    return throughput, goodput, (p[0], p[1], p[2])

print(f"single replica: capacity={MU:.0f} req/s, SLO=p99<{SLO_MS:.0f}ms, window={WINDOW_S:.0f}s")
print(f"{'offered':>8} {'thrput':>8} {'goodput':>8} {'p50':>7} {'p95':>8} {'p99':>9}")
print("-" * 56)
for lam in [100, 200, 300, 400, 450, 480, 500, 520, 560]:
    thr, good, (p50, p95, p99) = run_window(lam)
    print(f"{lam:>8} {thr:>8.0f} {good:>8.0f} {p50:>6.0f}m {p95:>7.0f}m {p99:>8.0f}m")
Code 5.3.1: A minimal load-test harness. A single replica drains a queue at 500 requests per second; the sweep pushes offered load from 100 up to 560 and records throughput, goodput at a 200 ms SLO, and the latency tail at each level.
single replica: capacity=500 req/s, SLO=p99<200ms, window=20s
 offered   thrput  goodput     p50      p95       p99
--------------------------------------------------------
     100      101      101      2m       8m       12m
     200      202      202      2m      10m       15m
     300      298      298      4m      15m       23m
     400      398      398      7m      31m       47m
     450      455      455     14m      63m      101m
     480      483      448     64m     212m      249m
     500      506       72    367m     630m      656m
     520      518      123    524m     925m      968m
     560      560       82   1287m    2379m     2499m
Output 5.3.1: Real output. Below 450 req/s, throughput and goodput are equal and the tail stays under the SLO. At 480 the p95 has already breached 200 ms and goodput dips below throughput. At and beyond 500 (capacity), throughput holds near 500 to 560 while goodput collapses to roughly 72 to 123: the replica is completing requests as fast as it can, but almost all of them arrive too late to count.

Read the two columns side by side. Up to 450 requests per second the throughput and goodput columns are identical, because the p99 sits comfortably under the 200 ms SLO. At 480 the tail crosses the budget and the columns separate. At 500 (the capacity) and above, throughput stays pinned in the 500s while goodput craters to under 130: the system is busier than ever and useful by almost no measure. This is the collapse, and it is the reason a serving fleet must be sized and load-balanced to keep offered load in the region where the two curves still coincide. The figure below renders the same story as three curves against load.

Offered load (requests per second) Rate served saturation (capacity) SLO breached above here throughput goodput peak useful rate
Figure 5.3.1: The signature shape of serving under load. Throughput (orange) rises and plateaus at the replica's capacity. Goodput (green) tracks throughput while the latency tail stays under the SLO (dashed red line), peaks just before saturation, then collapses as the queue fills and nearly every completion arrives too late to count. The gap between the curves past the vertical saturation line is the volume of work the system does that no user accepts. The shape mirrors the 480-to-500 transition in Output 5.3.1.

4. LLM Serving Splits the SLO in Two: TTFT and Inter-Token Latency Advanced

For a request-response model, one latency number tells the whole story. Large language models break this, because a single request produces a stream of tokens over time, and a user perceives two different delays. The first is time-to-first-token (TTFT): how long after submitting the prompt before any output appears. The second is inter-token latency (ITL), sometimes reported as its reciprocal, tokens per second: once generation starts, how quickly do subsequent tokens arrive. A chat that shows its first word in 300 ms and then streams smoothly feels fast; one that shows nothing for three seconds and then dumps the answer feels broken, even if total wall-clock latency is identical. Two distinct experiences hide behind one total-latency number, so LLM serving defines two distinct SLOs.

The split has a systems cause worth naming. TTFT is dominated by the prefill phase, where the model processes the entire prompt in one compute-heavy pass; inter-token latency is governed by the decode phase, where each new token is produced one at a time, bottlenecked by memory bandwidth and by how many concurrent requests share the batch. The two phases stress different resources and respond to different optimizations, which is why a serving stack reports goodput against both SLOs separately: the fraction of requests whose TTFT meets its budget, and the fraction whose steady-state inter-token latency meets its budget. A request that streams its first token late but then keeps pace, or one that starts promptly but then stalls, fails a different SLO, and conflating them hides the actual defect. We develop the prefill/decode split, continuous batching, and disaggregated serving that target these two SLOs in Chapter 24.

Practical Example: The Chatbot That Passed Throughput and Failed Users

Who: An inference platform engineer running an LLM assistant behind a fleet of GPU replicas.

Situation: A capacity review reported the fleet sustaining 12,000 tokens per second per replica, comfortably above the planned traffic, so the team approved a higher request-admission limit.

Problem: After the limit rose, the aggregate token throughput held steady, but support tickets about "the assistant freezes before answering" climbed sharply, and the dashboards showed nothing wrong.

Dilemma: Trust the throughput number that said there was headroom and chase the complaints as client bugs, or suspect that the single throughput figure was hiding a latency failure the metric could not see.

Decision: They instrumented TTFT and inter-token latency separately and computed goodput against a "p99 TTFT under 500 ms" SLO instead of trusting aggregate tokens per second.

How: Per-request timestamps were captured at admission, first-token emission, and each subsequent token; goodput was recomputed as the fraction of requests meeting the TTFT SLO over rolling one-minute windows.

Result: Token throughput was indeed flat, but TTFT goodput had collapsed from 98% to 41%: the raised admission limit had pushed requests into a deep prefill queue, so first tokens arrived seconds late while the decode pipeline kept the token count high. Lowering the admission limit to the goodput knee restored TTFT compliance with a negligible throughput cost.

Lesson: Aggregate token throughput is the LLM analogue of raw request throughput; it can stay flat while the SLO that users feel, TTFT, is in free fall. Measure goodput against the per-phase latency objective, not the bulk rate.

5. Running an Honest Load Test Intermediate

The simulation in Code 5.3.1 teaches the shape of the collapse, but a real evaluation measures a live service, and two methodological traps routinely produce dishonest numbers. The first is the closed-loop trap. A naive load generator sends a request, waits for the response, then sends the next; when the server slows down, the generator slows down with it, so it never actually offers the load that a real population of independent users would. The arrival rate becomes a function of the server's own latency, which masks the saturation collapse entirely. An honest load test is open-loop: requests are issued on a schedule (for example, Poisson arrivals at a fixed rate) independent of how fast responses come back, exactly as Code 5.3.1 generates arrivals before simulating service. This is the difference between measuring what your users will do and measuring what your slow server permits them to do.

The second trap is coordinated omission, named by Gil Tene: when a request is delayed because the server stalled, the requests queued behind it are also delayed, but a closed-loop tester silently drops them rather than recording their inflated latencies, so the reported p99 looks far better than reality. The fix is to measure latency from each request's intended send time, not from when the load generator actually managed to send it, so that a stall contaminates every request it really delayed. A load test that ignores coordinated omission can report a passing p99 for a system that, under honest accounting, breaches its SLO by an order of magnitude. We return to these and other measurement pitfalls as a dedicated topic in Section 5.6.

Library Shortcut: Open-Loop Load Testing With wrk2 or k6

You do not write the timing harness of Code 5.3.1 by hand for a real service. Open-loop generators built to avoid coordinated omission do it for you. wrk2 issues requests at a fixed rate and corrects for coordinated omission natively; a one-line invocation reports the full latency distribution:

# Drive the endpoint at a constant 2000 req/s for 60s with 8 threads,
# 200 open connections, and print the latency percentile spectrum.
wrk2 -t8 -c200 -d60s -R2000 --latency http://service:8080/predict
#   -R2000  : OFFERED rate, held fixed regardless of server speed (open-loop)
#   --latency: print p50/p90/p99/p99.9 corrected for coordinated omission
Code 5.3.2: The whole timing-and-percentile harness of Code 5.3.1 reduces to one wrk2 command. The tool fixes the offered rate, corrects for coordinated omission, and prints the tail percentiles; modern alternatives such as k6 and Locust add scripting and let you compute goodput directly by thresholding the per-request latencies against your SLO.

6. Goodput as a Sizing and Scheduling Target Intermediate

Once goodput is the metric, fleet sizing becomes a clean question: provision enough replicas that the offered load per replica stays left of the goodput knee, the load at which the tail breaches the SLO. From Output 5.3.1, a single replica with a 200 ms SLO holds full goodput up to roughly 450 requests per second, so a fleet expecting 9,000 useful requests per second needs about 20 replicas with headroom, not the 18 that raw capacity (500 each) would suggest. Sizing to raw throughput puts the fleet exactly at the saturation cliff, where a small traffic spike tips goodput off the edge. Sizing to goodput keeps a margin between offered load and the knee, which is what reliability under bursty real traffic actually requires.

This reframing also changes what a good scheduler optimizes. A scheduler that maximizes raw throughput will happily pack a replica to 100% utilization and past the SLO, trading goodput for a bigger but useless completion count. A goodput-aware scheduler instead admits, routes, and batches requests to maximize the SLO-meeting rate, shedding or deferring load that would only produce late responses. That objective, maximize goodput rather than throughput or utilization, is the connective tissue between this section and the serving-systems chapters: Chapter 3 gives the performance models that predict the knee, and the inference chapters of Part V build the admission control, batching, and autoscaling that hold a fleet at it.

Research Frontier: Goodput-Optimized and SLO-Aware Serving (2024 to 2026)

Goodput has moved from a measurement to an explicit optimization target in recent serving research. For LLMs, DistServe (Zhong et al., OSDI 2024) disaggregates the prefill and decode phases onto separate GPU pools so that the TTFT and inter-token SLOs can be met independently, reporting large gains in goodput per GPU over collocated serving; Sarathi-Serve (Agrawal et al., OSDI 2024) introduces chunked-prefill and stall-free batching to hold inter-token latency under budget while keeping throughput high. On the scheduling side, SLO-aware schedulers such as those in the lineage of Clockwork and the more recent Llumnix (Sun et al., OSDI 2024) reschedule and migrate in-flight requests across replicas to defend tail-latency targets under load imbalance, and a growing line of work treats admission control as goodput maximization under an SLO constraint rather than throughput maximization. The common thread is the one this section argues: the quantity worth maximizing is SLO-meeting work, and a serving stack designed around goodput beats one designed around raw throughput precisely where it matters, at the edge of saturation.

We now have the honest serving metrics: throughput as the raw completion rate, goodput as the completion rate that meets a latency SLO, the p50/p95/p99 tail that defines the SLO, and the per-phase TTFT and inter-token objectives that LLMs demand. The collapse of goodput at saturation, while throughput keeps climbing, is the shape to carry into every serving evaluation. The next section asks where the time and the money go inside a distributed step by measuring the communication-to-computation ratio, the metric that explains why adding machines stops helping. That analysis begins in Section 5.4.

Exercise 5.3.1: Read the Collapse Conceptual

Using Output 5.3.1, identify the largest offered load at which goodput still equals throughput, and the smallest offered load at which goodput has fallen below half of throughput. Explain in terms of the p95 and p99 columns why the goodput begins to fall before throughput plateaus, and why a fleet sized to the raw capacity (500 req/s per replica) rather than to the goodput knee is one traffic spike away from a user-visible outage.

Exercise 5.3.2: Two SLOs, One Stream Coding

Extend Code 5.3.1 to model an LLM replica that, for each admitted request, first incurs a prefill delay (proportional to a random prompt length) and then emits 64 tokens with a per-token decode delay drawn from an exponential distribution. Compute two goodputs: one against a "TTFT under 500 ms" SLO and one against a "mean inter-token latency under 40 ms" SLO. Sweep the offered load and show that the two goodputs collapse at different load levels. Explain which phase, prefill or decode, saturates first under your parameters and what that implies for admission control.

Exercise 5.3.3: The Cost of Coordinated Omission Analysis

Consider a service whose server stalls for 1 second once per 10-second window and otherwise responds in 5 ms, under a steady 1,000 req/s open-loop offered rate. Estimate the true p99 latency when every request delayed by the stall is recorded from its intended send time. Then estimate the p99 a naive closed-loop tester would report if it simply pauses during the stall and resumes afterward, recording only the requests it actually managed to send. Quantify the gap between the two p99 values and argue why a load test that ignores coordinated omission can certify an SLO that the live system violates.