Section 3.4: Latency, Throughput, and Tail Latency

"They asked me for my average response time and I gave them a lovely number. Then a thousand of them fanned out at once and discovered I have a slow side."
A Replica That Forgot About Its p99

Big Picture

Latency is how long one request waits; throughput is how many requests finish per second; and the two are tied together by a law so simple it fits on a napkin and so unforgiving that ignoring it sinks production serving systems. A distributed AI service does not get to optimize one of these in isolation. Batching more requests onto a GPU raises throughput and raises per-request latency at the same time. Replicating a model across a fleet raises throughput but does nothing for the latency of a single slow call. And once a user request fans out to dozens of internal services, the experience the user feels is not the average of those services but the slowest of them, so the rare tail event becomes the common case. This section gives you the formal model, centered on Little's law, that turns all of this from folklore into arithmetic you can compute before you provision a single machine.

Section 1.6 introduced throughput, latency, cost, and reliability as the four quantities every distributed AI system is judged on, and gave the intuition for why they trade against one another. This section makes two of them, latency and throughput, formal. We will derive the one equation that links them, Little's law, apply it to a model-serving queue, then confront the part of the story that averages hide: the tail. In a system where a single user action triggers many internal requests, the tail latency, not the mean, is what determines whether the system feels fast. Everything here is measurement you can do on paper or in a few lines of code, and it sets up the scaling laws of Section 3.5 and the communication-cost models later in this chapter.

1. Two Numbers That Are Not the Same Number Beginner

Latency is a duration: the time from when a request arrives to when its response is complete, measured in milliseconds or seconds for one request. Throughput is a rate: the number of requests, tokens, or images the system completes per unit time, measured in requests per second or tokens per second. They answer different questions. Latency answers "how long will this one user wait?" Throughput answers "how many users can the system serve at once?" A self-driving car's perception stack lives or dies on latency; a nightly batch job that embeds a billion documents lives or dies on throughput.

The reason these get confused is that they often move together when you add hardware, and then suddenly stop. Doubling the number of model replicas roughly doubles throughput, because twice as many requests can be in flight. It does nothing for the latency of any single request, because that request still runs on exactly one replica at its own speed. Conversely, making the model itself faster (a smaller model, a better kernel) cuts latency and raises throughput at once. The art of serving is knowing which lever moves which number, and the first tool for that is a law that holds for any stable system, regardless of its internals.

Key Insight: Latency and Throughput Are Coupled, Not Interchangeable

Throughput is a property of the whole system (how much work flows through it per second); latency is a property of one request's journey through it. Adding replicas buys throughput without touching single-request latency. Speeding up the per-request path buys both. Batching, as we will see, buys throughput by spending latency. Any time someone quotes a single "performance number," ask which of the two it is, because optimizing one can silently wreck the other.

2. Little's Law: Concurrency = Arrival Rate × Latency Intermediate

Little's law is the bridge between latency and throughput, and it is remarkable because it assumes almost nothing: any system that is stable (nothing accumulates without bound) and observed long enough obeys it, no matter how requests are routed, batched, or scheduled inside. Let $L$ be the average number of requests inside the system at any instant (the concurrency, or "in-flight" count), let $\lambda$ be the average arrival rate of requests, and let $W$ be the average time a request spends in the system (its latency). Then

$$L = \lambda \, W.$$

The derivation is a counting argument. Watch the system for a long window of length $T$. Every request that is inside the system contributes time to a running total of "request-seconds." Count that total two ways. From the throughput side, $\lambda T$ requests pass through, each spending on average $W$ seconds inside, so the total request-time accumulated is $\lambda T \cdot W$. From the concurrency side, at every instant there are on average $L$ requests present, each accruing one second per second, so over $T$ seconds the total request-time is $L \cdot T$. The two counts measure the same area under the same curve, so $L\,T = \lambda\,W\,T$, and dividing by $T$ gives $L = \lambda W$. Stability is the only fine print: if arrivals outrun departures, the queue grows without bound, $W \to \infty$, and there is no steady state to average over.

Rearranged, the law is a planning tool. Since $\lambda$ (throughput in steady state, because arrivals equal departures) equals $L / W$, you can read off any one quantity from the other two. If you know how many requests your system can hold concurrently before it saturates ($L$) and how long each takes ($W$), you know the maximum throughput it can sustain. That single rearrangement, $\lambda = L / W$, is how you size a serving fleet without guessing.

Practical Example: Sizing an Embedding Service With One Equation

Who: A platform engineer standing up a text-embedding microservice for a retrieval pipeline.

Situation: A product team needs to embed 12,000 short documents per second at peak, each call hitting a GPU-backed model server.

Problem: Nobody knew how many replicas to provision, and the cloud bill scaled linearly with the guess.

Dilemma: Over-provision for safety and waste money on idle GPUs, or under-provision and watch the queue blow up at peak, sending $W$ toward infinity exactly when Little's law warns the system has left its stable regime.

Decision: They measured before provisioning. One replica processed a request in $W = 40$ ms of latency and could hold $L = 32$ requests concurrently (one in-flight batch plus queued slots) before latency degraded.

How: By Little's law, a single replica sustains $\lambda = L / W = 32 / 0.040 = 800$ requests per second at the edge of stability. To hit 12,000 per second they needed $12000 / 800 = 15$ replicas, plus headroom, so they deployed 18.

Result: Peak traffic landed at a measured concurrency near 30 per replica, just under the saturation point, with stable 40 ms latency and no runaway queue.

Lesson: You do not need a simulator to size a stateless serving fleet; you need $L$, $W$, and the discipline to leave a stability margin so the law keeps holding.

3. The Batching Trade-Off: Buying Throughput With Latency Intermediate

GPUs are throughput machines: they amortize their fixed per-call overhead over many examples processed in parallel. Feeding a model one request at a time wastes most of that parallelism, so serving systems batch, accumulating several arriving requests and running them through the model together. Batching is the single most important lever in AI serving, and it is the cleanest illustration of the latency/throughput trade-off, because it spends one to buy the other.

Model the cost of running a batch of size $B$ as a fixed overhead $c_0$ plus a marginal cost $c_1$ per request in the batch, so the batch takes $c_0 + c_1 B$ to process. Throughput is the batch size divided by the batch time,

$$\text{throughput}(B) = \frac{B}{c_0 + c_1 B} \;\xrightarrow{\;B \to \infty\;}\; \frac{1}{c_1},$$

which rises with $B$ and saturates at $1/c_1$ once the fixed overhead is fully amortized. But a request that arrives just as a batch begins filling must wait for the batch to fill and then for the whole batch to run. The larger the batch, the longer that wait, so latency grows roughly linearly with $B$. Bigger batches mean higher throughput and higher latency, always, and the right batch size is wherever throughput is "enough" and latency is still under budget, never simply the largest batch that fits in memory. The runnable demo in the next section makes this trade visible and verifies Little's law on the same simulated traffic.

Library Shortcut: Continuous Batching Instead of Hand-Tuned Static Batches

The static picture above, where a batch fills, runs, and empties, leaves the GPU idle while it waits for the slowest request in the batch to finish generating. Modern LLM servers replace it with continuous batching (also called in-flight batching), which admits and retires requests token by token so a finished sequence frees its slot immediately for a waiting one. You do not implement this yourself; an inference engine does it for you:

# pip install vllm
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")   # engine owns the batching loop
params = SamplingParams(max_tokens=128)

# Hand vLLM the whole burst; it forms and reshapes batches per decode step,
# keeping GPU utilization high without a fixed batch size or a fill timeout.
prompts = ["Summarize: " + doc for doc in incoming_documents]
outputs = llm.generate(prompts, params)               # continuous batching, internally

Code 3.4.1: A few lines hand the entire request burst to vLLM, which runs continuous batching internally; the per-step batch formation, slot reuse, and paged KV-cache management that you would otherwise hand-tune are the engine's job. We unpack distributed continuous batching across nodes in Chapter 24.

4. Measuring the Distribution: p50, p95, p99, and Little's Law Intermediate

Averages lie about latency. A service can have a perfectly acceptable mean while a meaningful fraction of users wait far longer, because latency distributions are right-skewed: most requests are fast, a few are very slow, and the slow ones pull the mean up just enough to hide how bad they are. The honest way to report latency is by percentiles. The p50 (median) is the latency that half of requests beat; the p95 is beaten by 95 percent, so one request in twenty is slower; the p99 is beaten by 99 percent, so one in a hundred is slower. Service-level objectives are almost always written against a high percentile ("p99 under 200 ms"), never the mean.

The demo below simulates a serving queue under Poisson arrivals with static batching, then does two things at once on the same run: it computes the latency percentiles, and it checks Little's law numerically by comparing the measured average concurrency $L$ against the product $\lambda W$ computed from the measured arrival rate and mean latency. If the law holds, the two should match to within sampling noise.

import numpy as np

rng = np.random.default_rng(7)
LAM = 800.0          # target arrival rate, requests/sec (Poisson)
N = 200_000          # requests to simulate
C0, C1 = 0.004, 0.0009   # batch cost = C0 + C1*B seconds (fixed + per-request)
BATCH = 16           # static batch size

# Poisson arrivals: inter-arrival gaps are exponential with mean 1/LAM.
gaps = rng.exponential(1.0 / LAM, size=N)
arrival = np.cumsum(gaps)

start = np.empty(N)      # when each request's batch begins running
finish = np.empty(N)     # when its response is complete
server_free = 0.0        # single server; next time it can start a batch
for b0 in range(0, N, BATCH):
    b1 = min(b0 + BATCH, N)
    B = b1 - b0
    # The batch can start only once it is full (last arrival in) and the server is free.
    begin = max(arrival[b1 - 1], server_free)
    dur = C0 + C1 * B
    start[b0:b1] = begin
    finish[b0:b1] = begin + dur
    server_free = begin + dur

latency = finish - arrival                       # W per request: wait + service
W = latency.mean()
lam_meas = N / (arrival[-1] - arrival[0])         # measured throughput, req/sec

# Average concurrency L: integrate the in-flight count over time. Each request
# is "in system" on [arrival, finish); count starts (+1) and ends (-1), sort, sweep.
events = np.concatenate([arrival, finish])
delta = np.concatenate([np.ones(N), -np.ones(N)])
order = np.argsort(events, kind="mergesort")
events, delta = events[order], delta[order]
inflight = np.cumsum(delta)                       # number in system after each event
dt = np.diff(events)
L_meas = np.sum(inflight[:-1] * dt) / (events[-1] - events[0])   # time-average

p = np.percentile(latency * 1000, [50, 95, 99])   # in milliseconds
print(f"requests              : {N}")
print(f"p50 / p95 / p99 (ms)  : {p[0]:.2f} / {p[1]:.2f} / {p[2]:.2f}")
print(f"mean latency W (ms)   : {W*1000:.2f}")
print(f"throughput lambda     : {lam_meas:.1f} req/s")
print(f"Little's law  L = lam*W: {lam_meas*W:.3f}")
print(f"measured concurrency L : {L_meas:.3f}")
print(f"relative gap          : {abs(L_meas - lam_meas*W)/L_meas:.2e}")

Code 3.4.2: A single-server batching queue under Poisson arrivals. The same run yields the latency percentiles and an independent check of Little's law: the time-averaged concurrency $L$ is compared against $\lambda W$ from the measured rate and mean latency.

requests              : 200000
p50 / p95 / p99 (ms)  : 32.17 / 48.27 / 59.50
mean latency W (ms)   : 32.75
throughput lambda     : 800.3 req/s
Little's law  L = lam*W: 26.209
measured concurrency L : 26.207
relative gap          : 7.51e-05

Output 3.4.2: Little's law holds to a relative gap of $7.5 \times 10^{-5}$ ($L$ and $\lambda W$ agree once enough requests are averaged), confirming the counting argument. The p99 latency of 60 ms is nearly double the 32 ms median, the right-skew that averages conceal.

Two facts stand out in the output. First, Little's law is not folklore here; the measured concurrency and $\lambda W$ agree to five significant figures over the long observation window, exactly as the counting argument promised, because the simulated system is stable. Second, the p99 is nearly twice the median. A dashboard reporting only the 33 ms mean would mislead an on-call engineer into believing every user is served alike, when in fact one request in a hundred waits about 60 ms, almost double the typical wait, and as we are about to see, that slower request is the one that defines the user experience under fan-out.

Figure 3.4.1: A right-skewed serving-latency distribution, the shape produced by Code 3.4.2. The median (p50) sits near the bulk of fast requests, but the p95 and especially the p99 sit far out in the long right tail. Reporting only the mean or median understates what a real fraction of users experiences, which is why service-level objectives are written against the high percentiles marked here.

5. Why Tails Dominate Under Fan-Out: The Max of Many Advanced

Here is the fact that makes tail latency the central concern of distributed serving rather than a footnote. When a single user request fans out to many internal services and must wait for all of them to respond, the latency the user feels is the maximum of the individual latencies, not the average. And the maximum of many samples is governed by the tail of the distribution, so a rare slow response that almost never matters for one call becomes almost certain to matter when you make many calls at once.

Make it quantitative. Suppose each of $n$ independent backend calls is fast with probability $q$ (say its latency is under the SLO with probability $q = 0.99$, so each has a 1 percent chance of being slow). The probability that all $n$ are fast is $q^{n}$, so the probability that the fanned-out request is slow, dragged out by at least one straggler, is

$$P(\text{request slow}) = 1 - q^{\,n}.$$

With a 1 percent per-call slow probability ($q = 0.99$), a single call is slow 1 percent of the time, but a fan-out to $n = 100$ services is slow with probability $1 - 0.99^{100} \approx 0.63$. The event that is rare for one backend, a p99 stall, has become the typical outcome for the aggregate request. This is why a service whose mean latency looks excellent can still feel sluggish at the product level: the user sits behind a fan-out, and the fan-out surfaces the tail of every dependency at once. The remedy is to attack the tail directly (hedged requests that send a duplicate to a second replica after a short delay, tail-tolerant routing, and aggressive timeouts), techniques the distributed-inference systems of Chapter 23 build into the serving layer, and which the evaluation discipline of Chapter 5 teaches you to measure with rigor.

Key Insight: At Scale You Optimize the Tail, Not the Mean

When a request depends on $n$ backends and waits for the slowest, its latency is the max of $n$ draws, and $P(\text{slow}) = 1 - q^{\,n}$ grows fast with $n$. A 99th-percentile event in one service becomes the majority outcome after a fan-out of a hundred. The practical consequence: cutting your p99 is worth more than cutting your mean, because the p99 of a dependency is what the next layer's fan-out will repeatedly hit.

6. Pipelining: Hiding Latency Behind Throughput Intermediate

If batching trades latency for throughput, pipelining recovers throughput without paying the full latency of each stage in series. Split the work into stages (preprocess, run the model, postprocess) and let different requests occupy different stages at the same time, the way an assembly line keeps every station busy. The latency of one request is still the sum of its stage times, because that request must pass through every stage. But the throughput is set by the slowest stage alone, not by the sum, because once the pipeline is full a new request finishes every time the bottleneck stage completes.

Formally, for a pipeline of stages with times $t_1, t_2, \ldots, t_k$, one request's latency is $\sum_i t_i$, while the steady-state throughput is $1 / \max_i t_i$, the rate of the bottleneck stage. Pipelining does not make any single request faster; it overlaps requests so the system's throughput is no longer hostage to the total path length. This same idea, overlapping independent work to hide a long serial cost, returns as pipeline parallelism in large-model training (Chapter 16), where a model too big for one device is split into stages and micro-batches flow through them, and as communication/computation overlap in Chapter 4, where an all-reduce is hidden behind the backward pass.

Research Frontier: Disaggregating the LLM Latency Profile (2024 to 2026)

Large-model serving has a split personality that the latency/throughput model predicts: the prefill phase (reading the prompt) is compute-bound and high-throughput, while the decode phase (emitting tokens one at a time) is memory-bandwidth-bound and dominates per-token latency. A vigorous 2024 to 2025 research line, prefill/decode disaggregation, runs the two phases on separate pools of machines so a long prefill cannot stall another request's decode and inflate its tail. Production-grade designs in this lineage (DistServe, Splitwise, and the disaggregated path now in vLLM and SGLang) report large gains in goodput under strict tail-latency SLOs by tuning each pool independently. The metrics they optimize, time-to-first-token and inter-token latency at the p99, are exactly the tail quantities of this section, and we treat the distributed serving mechanics in Chapter 24.

Fun Note: The Tail That Ate the Average

A classic operational sketch: a service team celebrates a 50 ms mean and ships it. A week later, support tickets pile up about "slowness," yet the dashboard still reads 50 ms. The catch is that the product page makes 40 backend calls in parallel and waits for all of them, so even a tidy 1 percent p99 means roughly a third of page loads catch at least one straggler. The mean never moved; the user experience fell off a cliff. The fix was not a faster mean but a tighter p99 plus a hedge on the slowest call. The lesson sticks because the math, $1 - 0.99^{40} \approx 0.33$, is short enough to do in your head.

We now have the formal core of serving performance: latency and throughput as distinct quantities, Little's law binding them, batching as the trade between them, percentiles as the honest way to report latency, the max-of-many argument that makes tails dominate under fan-out, and pipelining as the way to hide serial latency behind throughput. The next section turns from "how fast is one configuration?" to "how does performance change as I add machines?", deriving the two laws, Amdahl's and Gustafson's, that bound the speedup any parallel system can achieve. That story begins in Section 3.5.

Exercise 3.4.1: Read the Law Backwards Conceptual

A GPU inference service measures an average of $L = 24$ requests in flight and a mean latency of $W = 60$ ms. (a) Use Little's law to compute the sustained throughput $\lambda$. (b) The team wants to double throughput. Explain, using the law, the two distinct ways to do it (change $L$ or change $W$) and what each one costs in hardware or model engineering. (c) Why does simply telling clients to send requests faster, without changing $L$ or $W$, not raise sustained throughput and instead risk leaving the stable regime?

Exercise 3.4.2: Find the Batch Sweet Spot Coding

Modify Code 3.4.2 to sweep the batch size $B$ over the values $\{1, 2, 4, 8, 16, 32, 64\}$ at a fixed arrival rate. For each $B$, record the throughput and the p99 latency. Plot p99 latency against throughput (one point per batch size). Identify the batch size beyond which throughput barely improves but p99 keeps climbing, and state the rule you would give a teammate for picking $B$ under a "p99 under 50 ms" SLO. Confirm Little's law still holds at every batch size you test.

Exercise 3.4.3: How Bad Does Fan-Out Get? Analysis

A user request fans out to $n$ independent backends and waits for the slowest. Each backend independently exceeds its SLO with probability $1 - q$. (a) Write the probability the aggregate request exceeds the SLO as a function of $n$ and $q$, and evaluate it for $q = 0.99$ at $n = 10, 50, 200$. (b) Suppose you can pay engineering effort to push each backend from a 1 percent slow rate to a 0.1 percent slow rate ($q = 0.999$). Recompute the aggregate slow probability at $n = 200$ and state the multiplicative improvement. (c) Argue from these numbers why tail-latency reduction in a single dependency has outsized value in a deeply fanned-out system, and connect this to the hedged-request idea mentioned in Section 5.