"My average response time is excellent. I would like that engraved on the wreckage, next to the one request out of a thousand that arrived after the airbag had already decided."
A Control Loop, Counting Milliseconds Until It Must Act
When an AI system has to act before a physical deadline, correctness is no longer enough; the answer must arrive in time, and "in time" is governed not by the average latency but by the worst latency the system will plausibly produce. A self-driving car, an augmented-reality headset, a robot arm, and a trading engine all share one constraint that ordinary batch and serving systems do not: there is a clock, the clock does not wait, and a correct decision delivered late is a wrong decision. This section makes the latency budget explicit, decomposes it into the stages between sensing and acting, and argues that the tail of the latency distribution (the p99 and p999), not its center, is the quantity safety depends on. Distribution is the double-edged tool throughout: pushing inference across a network adds tail latency, yet the standard cures for tail latency, replication, hedging, and edge placement, are themselves distributed-systems techniques. Section 34.1 named latency as a driver for moving compute to the edge; here we turn that driver into a budget you can decompose, measure, and engineer.
Every system in the earlier parts of this book optimized throughput: gradients per second, tokens per second, queries per second served by a fleet. A latency-critical system inverts the objective. It may serve only a handful of requests per second, but each one carries a deadline measured in milliseconds, and missing the deadline has a consequence in the physical world rather than a slightly slower dashboard. The relevant question stops being "how many requests can we serve?" and becomes "what is the latest possible moment a request could finish, and is that moment before the deadline?" That shift, from the center of the latency distribution to its far-right tail, reorganizes every design choice that follows.
The setting is the edge, because that is where deadlines and physics meet. A perception model that steers a car cannot wait for a round trip to a regional data center; an AR renderer that lags the wearer's head motion induces nausea within tens of milliseconds; a programmable logic controller running a learned policy on a factory line samples and actuates on a fixed cycle. These systems were introduced as a class in Section 34.1, where latency was one of the four forces pushing computation off the cloud. We now treat that force quantitatively, with the evaluation vocabulary built in Chapter 5 and the performance models of Chapter 3.
1. Hard, Firm, and Soft Real Time Beginner
The word "real-time" is overloaded, and the distinctions matter because they dictate how aggressively you must engineer the tail. A hard real-time system treats a missed deadline as a total failure: the value of a result drops to zero, or below zero, the instant the deadline passes. An anti-lock braking controller and an airbag trigger are hard real-time; a late decision is not a degraded decision but a dangerous one. A firm real-time system also assigns zero value to a late result, but a missed deadline is merely a dropped frame rather than a catastrophe, so an occasional miss is tolerable if it is rare enough. A soft real-time system assigns declining but positive value to late results: a recommendation that arrives 200 milliseconds late is worse than a prompt one, yet still useful.
Most edge AI lives in the firm and soft regimes, with hard real-time reserved for the innermost safety loops, which are frequently kept on a small certified controller separate from the learned model. The practical consequence is a budget with a probability attached. A hard requirement reads "latency must never exceed $D$." A firm requirement reads "latency must exceed $D$ no more than once in $10^{4}$ frames," which is precisely a statement about the p99.99 of the latency distribution. This is why, for AI systems specifically, the tail percentile is the natural specification: learned models on general-purpose hardware cannot offer a true worst-case guarantee, so we engineer a percentile and bound the miss rate instead.
A latency-critical AI system is specified by a deadline and a permitted miss rate together: "p99 under 20 ms" or "no more than one frame in $10^{5}$ exceeds 30 ms." The mean latency is nearly irrelevant to this specification and can be actively misleading, because the cheap optimization that halves the mean (a faster common path) often does nothing for the rare slow path that actually breaches the deadline. Every technique in this section is aimed at the tail, and you should distrust any latency claim reported as an average.
2. The Sense-Infer-Actuate Budget Beginner
An edge AI system closes a loop: it senses the world, infers a decision, and actuates. The deadline applies to the whole loop, so the engineering task is to decompose the end-to-end budget into its stages, attribute a slice of the budget to each, and find the stage whose tail dominates. Write the end-to-end latency as the sum of the per-stage latencies,
$$L_{\text{e2e}} = L_{\text{sense}} + L_{\text{net,up}} + L_{\text{queue}} + L_{\text{infer}} + L_{\text{net,down}} + L_{\text{act}},$$where $L_{\text{sense}}$ is sensor capture and preprocessing, $L_{\text{net,up}}$ and $L_{\text{net,down}}$ are network transfers if any stage runs off-device, $L_{\text{queue}}$ is time waiting for a busy accelerator, $L_{\text{infer}}$ is the model forward pass, and $L_{\text{act}}$ is command transmission and actuation. The deadline requires $L_{\text{e2e}} \le D$ with the agreed probability. Figure 34.7.1 lays out one such budget as a waterfall and contrasts an on-device path against a path that crosses the network to a fog or cloud server.
The waterfall exposes the first design lever. Moving inference to a more powerful fog or cloud server shrinks $L_{\text{infer}}$ but introduces $L_{\text{net,up}}$, $L_{\text{net,down}}$, and a server-side $L_{\text{queue}}$ that the on-device path never paid. For a tight deadline the network terms frequently dominate, which is the quantitative core of the argument in Section 34.1 that latency drives computation toward the device. Local inference is, before anything clever, the simplest tail-latency technique there is: it deletes the two network terms from the sum.
3. Tail Latency Is the Thing That Matters Intermediate
The reason the tail, rather than the mean, governs a latency-critical system is structural, and it sharpens the moment the system is distributed. Consider a request that must gather results from $n$ components before it can act: a scatter-gather over $n$ sensor shards, an ensemble of $n$ models, or a pipeline of $n$ network hops that must all complete. The request finishes only when the slowest component finishes, so its latency is the maximum of $n$ latencies. If each component independently exceeds a threshold $t$ with probability $p$, the probability that at least one of them does, and therefore that the whole request is slow, is
$$P(\text{request slow}) = 1 - (1-p)^{n} \approx np \quad \text{for small } p.$$A one-percent chance of a slow component is negligible on its own, but across $n = 100$ components it becomes a $1 - 0.99^{100} \approx 63\%$ chance that some component is slow, and the request waits for it. This is the tail-at-scale effect identified by Dean and Barroso: fanning a request out across many machines amplifies the rare slow case into the common case, so a system whose components each have an excellent p99 can have a terrible p99 as a whole. Distribution, the very thing that supplies the parallel compute, is what manufactures the tail. We met the same percentile vocabulary as a serving concern in Chapter 23; here the consequence is a missed physical deadline rather than a slow page.
Two further properties of the tail matter at the edge. Jitter, the variation in latency from one cycle to the next, is as harmful as raw latency for a control loop, because a controller tuned for a fixed sample interval becomes unstable when that interval wanders. Determinism, a small and predictable spread, is therefore often worth more than a lower mean: a model that always answers in 18 to 20 ms is more useful to a control loop than one that averages 12 ms but occasionally takes 90. Real-time operating systems, pinned cores, and disabled power-management states all buy determinism by trimming the tail, even at some cost to the average.
Who: A systems engineer on an augmented-reality team shipping a hand-tracking interaction model.
Situation: The motion-to-photon path had to stay under 20 ms or wearers reported discomfort within minutes; the team's dashboard showed a healthy 11 ms mean.
Problem: Field reports of nausea persisted despite the good average, and nobody could reproduce it on the bench.
Dilemma: Chase a lower mean by shrinking the model further, sacrificing tracking accuracy, or investigate the distribution and risk finding nothing actionable.
Decision: They instrumented the full path and looked at the p99 and p999 instead of the mean, following the evaluation discipline of Chapter 5.
How: Tracing revealed a p999 of 74 ms caused by the inference thread sharing a core with the compositor and losing the scheduler lottery a few times per second; the mean hid it completely.
Result: Pinning inference to a dedicated core and disabling dynamic frequency scaling cut the p999 to 19 ms with no change to the model, and the field reports stopped.
Lesson: The mean was never the symptom. A latency-critical system is debugged at the tail, and the cure was a determinism fix, not a speed fix.
4. Engineering the Tail: Deadlines, Hedging, Anytime Models Intermediate
Three families of technique cut or contain the tail, and they compose. The first is deadline-aware scheduling: every request carries its remaining time budget, the scheduler orders work by earliest deadline first, and any request whose deadline has already passed is dropped rather than computed, because a late answer has no value and only delays the next one. This turns the deadline from a hope into an input the system reasons about, and it is the firm-real-time analogue of the serving SLOs that govern queue admission in Chapter 24.
The second is redundant, hedged requests, the distributed cure for the distributed disease. Instead of sending one request and praying it is not the slow one, send the same request to two replicas and act on whichever returns first, cancelling the other. Because a slow outcome on a single call has probability $p$, a slow outcome on the faster of two independent calls has probability $p^{2}$ when the deadline is past both, so a 2% tail becomes roughly a 0.04% tail. Hedging spends extra capacity, typically well under a full doubling if the second request is delayed slightly and fired only when the first is late, to buy a dramatically thinner tail. It is the same insight that motivates straggler mitigation by speculative re-execution in Chapter 18, applied now to a real-time deadline.
The third is the model-side guarantee: an anytime or early-exit model that always has a usable answer ready by the deadline, refining it only while time remains. An early-exit network attaches classifier heads to intermediate layers and returns from the first head confident enough, or from whichever head the clock reaches; a cascade runs a tiny model first and escalates to a larger one only if the budget allows. The contract inverts the usual one: rather than promising a fixed-quality answer at variable latency, an anytime model promises a fixed deadline at variable quality, which is exactly the contract a hard clock wants. We can now state the tail-at-scale relation and its hedged remedy together.
This section is the cleanest statement of a tension that runs through the whole book. Scattering a request across $n$ machines multiplies the chance that one is slow, $1 - (1-p)^n$, so distribution creates tail latency; replicating a request across redundant machines and taking the first reply, $p^2$ for a hedge of two, so distribution removes tail latency. The same primitive, sending work to more than one machine, is the problem when the answer needs all of them and the solution when it needs any of them. Every later reliability technique, from the Byzantine-robust aggregation of Chapter 35 to the deadline-driven control of robots in Section 34.8, is a choice about which side of that line a given request sits on.
5. Watching Hedging Cut the Tail Intermediate
The argument so far is analytic; the demonstration below makes it concrete by simulation. We model per-call latency as a fast body plus a 2% heavy tail, the signature of a system where most calls are quick but a few hit a garbage-collection pause, a queued accelerator, or a congested link. We then measure three regimes against a 30 ms deadline: a single call, a scatter-gather that waits for the slowest of $n$ parallel calls (the tail-at-scale effect), and a hedge that fires two calls and takes the first to return.
import numpy as np
rng = np.random.default_rng(7)
TRIALS = 200_000
# Per-call latency (ms): a fast body plus a heavy 2% tail. The tail is the
# straggler: a GC pause, a queued accelerator, a cold edge node, a slow link.
def sample_latency(size):
body = rng.lognormal(mean=np.log(8.0), sigma=0.35, size=size) # ~8 ms typical
slow = rng.random(size) < 0.02 # 2% are stragglers
body[slow] += rng.uniform(60.0, 140.0, size=slow.sum()) # +60..140 ms tail
return body
pct = lambda x, q: np.percentile(x, q)
single = sample_latency(TRIALS) # one call
print(f"single call p50={pct(single,50):6.1f} p99={pct(single,99):6.1f} p999={pct(single,99.9):6.1f}")
# Scatter-gather: wait for the SLOWEST of n calls -> tail amplification.
for n in (2, 5, 10):
worst = np.max(sample_latency((TRIALS, n)), axis=1)
print(f"max of n={n:<2d} calls p50={pct(worst,50):6.1f} p99={pct(worst,99):6.1f} p999={pct(worst,99.9):6.1f}")
# Hedge: fire 2 calls, take the FIRST -> tail suppression.
hedged = np.min(sample_latency((TRIALS, 2)), axis=1)
print(f"hedged (first of 2) p50={pct(hedged,50):6.1f} p99={pct(hedged,99):6.1f} p999={pct(hedged,99.9):6.1f}")
budget = 30.0 # deadline accounting
print(f"\ndeadline = {budget:.0f} ms")
print(f" single call miss rate : {100*np.mean(single > budget):5.2f} %")
print(f" max of 5 miss rate : {100*np.mean(np.max(sample_latency((TRIALS,5)),axis=1) > budget):5.2f} %")
print(f" hedged(2) miss rate : {100*np.mean(hedged > budget):5.2f} %")
single call p50= 8.1 p99= 109.4 p999= 144.0
max of n=2 calls p50= 9.8 p99= 128.6 p999= 146.6
max of n=5 calls p50= 12.2 p99= 140.6 p999= 149.6
max of n=10 calls p50= 14.3 p99= 144.5 p999= 151.4
deadline = 30 ms
single call miss rate : 2.01 %
max of 5 miss rate : 9.62 %
hedged(2) miss rate : 0.04 %
The output is the section in miniature. The median latency is nearly identical in every regime, so a mean-based dashboard would call all four systems equivalent. The deadline miss rate is not equivalent at all: scatter-gather over five replicas nearly quintuples it, while a two-way hedge cuts it by a factor of fifty and brings the p99 below the deadline. The expensive resource, an extra replica's worth of work, bought a thin tail rather than a faster average, which is the only trade that a latency-critical system is willing to make. Notice also that the hedge spends compute it usually does not need; in practice the second call is delayed by a tied-request timer so it fires only when the first is already running late, recovering most of the benefit at a fraction of the cost.
Code 34.7.1 implemented hedging by hand to expose the mechanism. In production you rarely write the racing logic yourself: modern RPC stacks expose hedging and deadline propagation as configuration. A gRPC service method can declare a hedging policy in a few lines of service config, and the Envoy proxy supports request hedging and per-try timeouts at the mesh layer, so every call inherits the behavior without touching application code:
# gRPC service config (JSON), attached at channel creation.
service_config = {
"methodConfig": [{
"name": [{"service": "perception.Inference"}],
"timeout": "0.030s", # 30 ms end-to-end deadline
"hedgingPolicy": {
"maxAttempts": 2, # fire a 2nd call if the 1st is slow
"hedgingDelay": "0.012s" # ... but only after 12 ms (tied-request timer)
}
}]
}
channel = grpc.insecure_channel(target, options=[
("grpc.service_config", json.dumps(service_config)),
("grpc.enable_retries", 1),
])
6. Where the Deadlines Bite: Applications Beginner
The techniques above earn their cost in four recurring settings, and naming them grounds the abstraction. Autonomous driving runs a perception-prediction-planning loop on-vehicle at frame rate; the perception model must clear its budget every frame because the planner downstream cannot act on a frame that has not arrived, and the whole stack is kept local precisely to delete the network terms of $L_{\text{e2e}}$. Augmented and virtual reality are bounded by motion-to-photon latency, the time from a head movement to the corresponding pixel update, where exceeding roughly 20 ms induces discomfort, so the rendering and tracking models run on-headset or on a tethered companion rather than in the cloud. Industrial control places learned policies and anomaly detectors inside a fixed sense-actuate cycle on a programmable controller, where jitter, not mean latency, sets the achievable control bandwidth. High-frequency systems, from algorithmic trading to real-time bidding, treat microseconds of tail as money and colocate inference next to the data feed to delete every avoidable hop.
Across all four, the same pattern holds: the deadline is physical, the tail is the enemy, and the placement of computation is chosen to control the network terms first and the compute terms second. The next section carries this directly into robotics, where the sense-infer-actuate loop is not a metaphor but the literal architecture of the machine. The deadline-driven control loops of Section 34.8 are exactly the hard and firm real-time systems this section has been budgeting for.
Latency-critical AI is an active research front on three sides. On the model side, dynamic-depth and early-exit transformers (the lineage of BranchyNet and FastBERT, extended in 2024 to 2025 toward token-level early exit for on-device LLMs) push the anytime contract into large models, returning a usable answer by the deadline and refining only while budget remains. On the systems side, deadline-aware and SLO-driven schedulers for edge inference, including work on real-time GPU partitioning and preemptible kernels, attack $L_{\text{queue}}$ so a high-priority frame can interrupt a low-priority one rather than wait behind it. On the methods side, the worst-case-execution-time community and the learning community are converging on probabilistic timing guarantees for neural networks, asking not "what is the average latency?" but "what deadline can we certify at a $10^{-6}$ miss rate?" The unifying theme is a shift from optimizing the mean to certifying the tail, the same shift this section argues every latency-critical designer must make.
There is a grim folk wisdom in real-time engineering that the median latency is the number you put on the slide and the p999 is the number that gets you paged at three in the morning. The simulation in Output 34.7.1 is a tidy illustration: every regime has a comparable, cheerful p50, and only the tail betrays which system is about to miss a deadline that matters. If a vendor quotes you an average latency for a safety-critical system, they have answered a question you did not ask.
Using the off-device path in Figure 34.7.1, identify which two stages contribute the most tail latency, and propose one placement change and one scheduling change that would bring the path back inside the 30 ms deadline without shrinking the model. For each proposed change, state which term of $L_{\text{e2e}}$ it reduces and what new cost or risk it introduces. Then explain why simply switching to a faster server (lower $L_{\text{infer}}$) does not by itself fix the breach.
Extend Code 34.7.1 with a delayed (tied-request) hedge: fire the second call only if the first has not returned within a delay $\tau$, and act on whichever finishes first. Sweep $\tau$ from 0 to 30 ms and, for each value, report the deadline miss rate against the 30 ms budget and the expected number of calls issued per request (a proxy for extra compute cost). Plot miss rate against cost and identify the $\tau$ that achieves under 0.1% misses at the lowest extra cost. Explain why $\tau = 0$ (always hedge) and large $\tau$ (never hedge in time) are both suboptimal.
A request fans out to $n$ shards and must wait for all of them; each shard independently exceeds the deadline with probability $p = 0.005$. Using $P(\text{slow}) = 1 - (1-p)^n$, find the largest $n$ for which the whole-request miss rate stays below 1%. Now suppose each shard can itself be hedged two ways, reducing its effective $p$ to $p^2$; recompute the largest tolerable $n$. Discuss what this says about the maximum fan-out a scatter-gather edge system can sustain, and connect your answer to the tail-at-scale reasoning and to the per-node latency budgets you would draw from Chapter 3.