Section 26.6: Fleet-Wide Monitoring and Observability

"Each of us swore we were healthy. The aggregate disagreed, and only one of us was right."
A Replica Reporting In From the Edge of the Fleet

Big Picture

You cannot manage what you cannot observe, and at fleet scale you cannot observe by collecting raw events; you observe by merging summaries. A distributed AI service runs across thousands of replicas, each answering its own slice of traffic, each with its own latency profile, error rate, and load. Knowing the fleet-wide p99 latency or error rate is not a matter of averaging dashboards by eye; it requires that every replica emit a summary that combines correctly with every other replica's summary into one number you can trust. This section builds the three layers you must watch (systems, model, and data), shows why mergeable sketches are the only honest way to aggregate percentiles and cardinalities across replicas, and turns per-replica partials into a single fleet p99 and an SLO-burn alert in pure Python. The same mergeability that made the gradient exact in Section 1.1 now makes the dashboard exact.

The previous section put experiment tracking on a distributed footing: a place to record what each training run did across many workers. Once the model is trained, registered, and serving traffic, the question flips from "what did this run produce?" to "what is the running system doing right now, everywhere, at once?" That is observability, and for a distributed AI system it is harder than for a classic web service in two specific ways. First, the thing being served is a learned function whose behavior can drift even when the code does not change, so you must watch the model and its inputs, not only the machine. Second, the answer to any fleet-wide question is scattered across thousands of replicas that each see only their own traffic, so producing a coherent global view is itself a distributed-aggregation problem. We take these in turn: the three layers worth watching, the aggregation machinery that makes a fleet view possible, distributed tracing of a single request, and alerting on the budget you promised to keep.

Figure 26.6.1: Fleet observability end to end. On the left, every replica keeps a small mergeable summary (a t-digest quantile sketch plus counters). A streaming merge unions them into the central fleet dashboard, from which an SLO-burn alert can fire. Below, a single request is traced as a chain of spans across the router, retrieval, and LLM services so the slow hop is identifiable. The footer names the three layers and the typical tool chain that carries the signals.

1. Three Layers You Must Watch Beginner

A distributed AI service fails in three different planes, and a complete observability story watches all of them. The first plane is the systems and infrastructure layer: the same signals any high-throughput service tracks, plus a few that are specific to accelerated inference. You watch latency as percentiles rather than means, because the mean hides the tail that users actually feel; for an ordinary endpoint that is p50 and p99 end to end, and for a token-streaming LLM it splits into time to first token (TTFT) and time per output token (TPOT), since a user perceives those very differently. You watch throughput as queries per second, error rate as the fraction of failed responses, GPU utilization and high-bandwidth-memory occupancy because an idle or thrashing accelerator is money burned, queue depth because a growing queue is the leading indicator of a latency cliff, and cost per thousand requests because at fleet scale a small per-request waste multiplies into the budget. These tie directly to the fleet-sizing arithmetic of Section 23.4 and to the evaluation metrics of Chapter 5; observability is where those design-time targets meet runtime reality.

The second plane is the model, or ML, layer, and it is what makes AI observability different from web observability. A deployed model is a learned function, and its behavior can degrade while every machine reports green. So you watch the distribution of predictions and their confidence, because a sudden shift in the predicted-class mix or a collapse in confidence often precedes a measurable accuracy drop. For an LLM endpoint the model-layer signals become output-quality proxies you can compute online: refusal and safety-filter rates, malformed-output and JSON-parse-failure rates, response-length and token-usage distributions, and where you have a judge or a thumbs signal, an aggregate quality score. The third plane is the data, or input, layer: the quality of what arrives before the model ever runs. You watch input feature distributions for drift, the rate of missing or malformed fields, schema violations, and for text the language mix and prompt-length distribution. The data layer is the earliest warning of the three, because bad input reliably produces bad output a few milliseconds later. Section 26.7 takes the drift detection that lives in the model and data layers and makes it a distributed algorithm in its own right; here we treat drift as one of the signals the fleet dashboard must surface.

Key Insight: A Green Fleet Can Be Serving Wrong Answers

Systems-layer health and model-layer correctness are independent. Every replica can report low latency, zero errors, and healthy GPUs while the model quietly returns degraded predictions because the input distribution shifted under it. This is the defining hazard of AI observability: the machine is a poor proxy for the service. You must instrument the learned function and its inputs as first-class signals, not infer their health from CPU graphs. A monitoring stack that watches only the systems layer will pass every check on the day your model starts being wrong.

2. The Aggregation Problem: Mergeable Summaries Intermediate

Every signal above is easy to define on one replica and surprisingly subtle to compute across a fleet. The trap is the percentile. You cannot average per-replica p99 values to get the fleet p99: the average of percentiles is not the percentile of the union, and the error is largest exactly when one replica is unhealthy, which is when you most need the truth. The naive fix, shipping every raw latency to a central node and computing the true percentile there, does not survive contact with scale; a million requests per minute across thousands of replicas is far too much data to centralize for a dashboard that refreshes every few seconds. The way out is the same idea that made MapReduce combiners work in Section 6.8: have each replica emit a small, bounded-size sketch that summarizes its data and that merges associatively with other sketches, so the fleet summary is the merge of the replica summaries.

Two sketches carry most of the load. For percentiles, a t-digest keeps a compressed set of weighted centroids over the value distribution, using more resolution in the tails (where p99 lives) than in the middle; two t-digests merge by pooling and re-compressing their centroids, and the merged digest answers any quantile $q$ to bounded relative error. For counting distinct things cheaply, such as the number of unique users or unique prompts seen, HyperLogLog estimates cardinality from a fixed-size register array; the union of two HyperLogLog sketches is the element-wise maximum of their registers, so distinct-count is mergeable too. Both are mergeable in the precise sense that the operation is associative and commutative: the order in which replicas report, and the grouping into intermediate aggregators, does not change the answer. That property is what lets a streaming aggregation pipeline (the same shape as the windowed aggregators of Section 9.9) build a fleet dashboard incrementally, folding each replica's partial into a running total without ever holding the raw stream.

Thesis Thread: Mergeability Returns, Now as the Dashboard

The exactness that opened this book was a mergeability property: the gradient is an average, an average is a sum divided by a count, and a sum regroups freely across workers, so data parallelism is exact (Section 1.1). The MapReduce combiner (Section 6.8) and the streaming aggregator (Section 9.9) generalized it to any associative reduction. Fleet observability is the same primitive wearing a third hat: counters merge by addition exactly, and quantiles and cardinalities merge approximately but with bounded error through sketches. Whenever you need a fleet-wide number, ask the mergeability question first: is the per-replica summary something I can combine without re-reading the raw data? If yes, the fleet view is cheap and correct; if no, you are about to average percentiles and lie to yourself.

3. Aggregating a Fleet From Per-Replica Partials Intermediate

The code below builds the core of a fleet aggregator from scratch. Each of two hundred replicas serves five thousand requests and keeps a single mergeable summary: a t-digest-style quantile sketch over its latencies plus request and error counters. A bad deploy has landed on about nine percent of the replicas, giving them a slow code path and elevated errors. The aggregator merges all two hundred partials into one fleet sketch, reads off the fleet p50 and p99 and error rate, evaluates an SLO-burn rule, and, crucially, drills down to the worst replicas using only the per-replica partials it already holds, never re-reading a raw latency.

import math, random

class Sketch:
    """Compressed quantile sketch over latencies plus request/error counts."""
    def __init__(self, max_centroids=64):
        self.centroids = []          # list of [mean_latency_ms, weight]
        self.max_centroids = max_centroids
        self.n = 0                   # total requests
        self.errors = 0              # error responses (HTTP 5xx / timeouts)

    def add(self, x, is_error=False):
        self.n += 1
        if is_error:
            self.errors += 1
        self.centroids.append([float(x), 1.0])
        if len(self.centroids) > self.max_centroids * 4:
            self._compress()

    def _compress(self):
        # Sort by latency, then greedily fuse neighbours into max_centroids bins
        # weighted by request count: this is the mergeable, bounded-size core.
        self.centroids.sort(key=lambda c: c[0])
        total = sum(w for _, w in self.centroids)
        step = total / self.max_centroids
        fused, acc_w, acc_xw = [], 0.0, 0.0
        for mean, w in self.centroids:
            acc_w += w
            acc_xw += mean * w
            if acc_w >= step:
                fused.append([acc_xw / acc_w, acc_w])
                acc_w, acc_xw = 0.0, 0.0
        if acc_w > 0:
            fused.append([acc_xw / acc_w, acc_w])
        self.centroids = fused

    def merge(self, other):
        # Mergeability: a fleet sketch is the union of replica sketches.
        self.centroids.extend(other.centroids)
        self.n += other.n
        self.errors += other.errors
        self._compress()
        return self

    def quantile(self, q):
        self._compress()
        cs = sorted(self.centroids, key=lambda c: c[0])
        target = q * sum(w for _, w in cs)
        run = 0.0
        for mean, w in cs:
            run += w
            if run >= target:
                return mean
        return cs[-1][0]

random.seed(7)
N_REPLICAS = 200
fleet_partials = []
bad = set(range(0, N_REPLICAS, 11))           # a bad deploy hit 19 of 200 replicas
for r in range(N_REPLICAS):
    s = Sketch()
    unhealthy = r in bad
    base = 520.0 if unhealthy else 90.0       # TTFT-like base latency in ms
    err_rate = 0.18 if unhealthy else 0.004
    for _ in range(5000):
        lat = random.lognormvariate(math.log(base), 0.45)
        s.add(lat, is_error=(random.random() < err_rate))
    fleet_partials.append(s)

# Fleet aggregation: fold every per-replica partial into ONE fleet sketch.
fleet = Sketch()
for s in fleet_partials:
    fleet.merge(s)

fleet_p99 = fleet.quantile(0.99)
fleet_err = fleet.errors / fleet.n
print(f"replicas merged        : {N_REPLICAS}")
print(f"total requests         : {fleet.n:,}")
print(f"fleet p50 latency (ms) : {fleet.quantile(0.50):7.1f}")
print(f"fleet p99 latency (ms) : {fleet_p99:7.1f}")
print(f"fleet error rate       : {fleet_err*100:6.3f}%")

# SLO-burn alert. SLO: 99% of requests under 400 ms, error budget 1%.
P99_SLO_MS, ERROR_BUDGET = 400.0, 0.01
err_burn = fleet_err / ERROR_BUDGET
print(f"\nerror burn rate        : {err_burn:6.2f}x budget")
if fleet_p99 > P99_SLO_MS or err_burn > 2.0:
    print("ALERT  fast SLO burn: paging on-call")
else:
    print("OK     within SLO; no page")

# Pinpoint the worst replicas without re-scanning raw data: each partial
# already carries its own error rate and p99.
ranked = sorted(((i, p.errors / p.n, p.quantile(0.99)) for i, p in enumerate(fleet_partials)),
                key=lambda t: t[1], reverse=True)[:3]
print("\nworst replicas (replica, err%, p99 ms):")
for i, e, p99 in ranked:
    print(f"  replica {i:3d}  err {e*100:6.2f}%  p99 {p99:7.1f}")

Code 26.6.1: A fleet aggregator built from mergeable summaries. The Sketch class fuses both the quantile centroids and the request/error counters under one associative merge, so the fleet p99 and error rate are computed by folding partials, and the drill-down to the worst replicas reuses the same partials the aggregator already holds.

replicas merged        : 200
total requests         : 1,000,000
fleet p50 latency (ms) :    97.1
fleet p99 latency (ms) :   863.4
fleet error rate       :  2.090%

error burn rate        :   2.09x budget
ALERT  fast SLO burn: paging on-call

worst replicas (replica, err%, p99 ms):
  replica  99  err  18.90%  p99  1415.9
  replica   0  err  18.68%  p99  1358.7
  replica  44  err  18.66%  p99  1613.4

Output 26.6.1: One million requests across two hundred replicas collapse into a single fleet p99 of 863 ms and a 2.09 percent error rate, both breaching the SLO, so the burn rule pages on-call. The same partials then localize the fault to the bad-deploy replicas, each running near 19 percent errors, without the aggregator ever touching a raw latency.

Two properties of this aggregator matter at fleet scale. It is bounded: each replica ships a few dozen centroids and two integers regardless of how many requests it served, so the network and the aggregator cost scale with the number of replicas, not the number of requests. And it is exact where it can be: the counters add with no approximation at all, so the fleet error rate is the true error rate; only the quantiles carry the small, bounded sketch error. The drill-down at the end is the payoff of keeping the partials around. A fleet alert tells you something is wrong; the per-replica summaries tell you where, turning a page at 3 a.m. into a one-line answer rather than a forensic dig through raw logs.

Library Shortcut: OpenTelemetry, Prometheus, and Grafana Do the Plumbing

Code 26.6.1 hand-rolls the sketch, the merge, and the alert to expose the mechanism. In production you do not write any of it. You instrument the service once with OpenTelemetry, whose SDK records latency histograms, counters, and spans and exports them over a standard protocol. A Prometheus server scrapes each replica's /metrics endpoint and stores the time series; its histogram type carries pre-bucketed observations that aggregate across replicas, and histogram_quantile() reads off a fleet p99 from the merged buckets. Grafana renders the dashboard, and Prometheus's alerting rules fire the SLO-burn page:

# Instrument once with OpenTelemetry; the SDK handles merge-friendly export.
from opentelemetry import metrics
meter = metrics.get_meter("inference")
latency = meter.create_histogram("ttft_ms", unit="ms")      # mergeable buckets
errors  = meter.create_counter("requests_errors_total")     # mergeable counter

def serve(request):
    t0 = now_ms()
    try:
        return model(request)
    except Exception:
        errors.add(1); raise
    finally:
        latency.record(now_ms() - t0)

Code 26.6.2: Roughly fifty lines of sketch, merge, and alert logic from Code 26.6.1 collapse to two metric handles and three calls. OpenTelemetry exports buckets and counters that Prometheus aggregates across the fleet, so the same fleet p99 falls out of a one-line histogram_quantile() query, and Grafana plus the Prometheus rules engine supply the dashboard and the SLO-burn alert.

4. Tracing One Request Across Many Services Intermediate

Metrics tell you the fleet p99 is 863 ms; they do not tell you which hop spent the time. A single request to a modern AI endpoint is not served by one process but by a chain of services: a router classifies and load-balances it, a retrieval service fetches context from a vector index (Section 23.4 sized that fleet), and an LLM service generates the answer. When that request is slow, the latency could be in any hop, and an end-to-end number cannot decompose the blame. Distributed tracing solves this by assigning each request a trace_id at the entry point and propagating it through every downstream call. Each service records a span: a start time, a duration, the service name, and the parent span it descended from. Stitched together by their shared trace_id, the spans form a tree that shows exactly where the time and the errors went, the chain drawn along the bottom of Figure 26.6.1.

Tracing is what turns "the system is slow" into "the retrieval span took 700 of the 863 milliseconds, and it is slow because the index shard on host 12 is rebuilding." It is also the natural home for the LLMOps additions to observability. Because the trace already follows a request through every service, it is the right place to attach the prompt and the response for later evaluation, the retrieved documents for grounding analysis, and, for an agent, the full sequence of reasoning and tool-call steps as nested spans. An agent that calls a search tool, then a calculator, then the model again becomes a readable trace tree rather than an opaque latency number, which is the only practical way to debug why an agent looped, stalled, or chose the wrong tool.

Practical Example: The p99 That Lived in the Retriever

Who: An SRE on the platform team for a retrieval-augmented assistant serving a few thousand requests per second across a multi-region fleet.

Situation: The fleet p99 latency doubled overnight while p50, error rate, and every GPU graph stayed flat and green.

Problem: The end-to-end p99 was a single number with no decomposition, and averaging per-replica p99 dashboards by eye suggested the regression was everywhere and nowhere.

Dilemma: Roll back the most recent model deploy on suspicion, which was fast but probably wrong since the model graphs were healthy, or invest an hour in tracing to localize the cause before touching anything.

Decision: They traced first, pulling a sample of slow traces filtered by the entry-point trace_id and breaking each into its router, retrieval, and LLM spans.

How: The traces showed the LLM span unchanged and the retrieval span carrying nearly all of the added latency; a merged t-digest over retrieval spans, grouped by index shard, isolated two shards mid-rebuild.

Result: They drained traffic from the two rebuilding shards, p99 returned to baseline within minutes, and the model deploy, which was innocent, was never rolled back.

Lesson: A fleet metric finds the symptom; a trace finds the cause. Without per-span attribution, the team would have rolled back a healthy model and still owned the outage.

5. Alerting on SLO Burn Advanced

Observability earns its cost only when it wakes the right person at the right moment, and the discipline that decides when is the service-level objective. An SLO is a promise expressed as a target over a window: for example, 99 percent of requests complete under 400 ms and the error rate stays under one percent over a rolling 30 days. The allowance for failure, here one percent of requests, is the error budget. The key quantity for alerting is the burn rate: how fast you are consuming that budget relative to the rate that would exhaust it exactly at the window's end. A burn rate of one means you will spend the whole budget right on schedule; a burn rate of ten means you will exhaust a month's budget in three days. Code 26.6.1 fired its page when the error burn crossed two times budget or the fleet p99 breached its target, which is a deliberately simplified single-window rule.

The reason to alert on burn rate rather than on a raw threshold is that it separates the urgent from the merely noted. A slow, low-multiple burn is a ticket to investigate during business hours; a fast, high-multiple burn is a page, because at that rate the budget is gone before anyone notices a dashboard. Mature setups combine a fast-burn rule on a short window (catch acute outages quickly) with a slow-burn rule on a long window (catch chronic, low-grade degradation that a short window would miss), so that both the sudden fire and the slow leak page appropriately and a brief blip does not. The burn rate is itself a fleet-aggregated quantity: it is computed from the merged error counter and the merged latency sketch of every replica, which is why the mergeability of Section 2 is not an aesthetic nicety but the precondition for alerting on the truth instead of on one noisy replica.

Research Frontier: Observability for LLMs and Agents (2024 to 2026)

The hardest open problem in this section is the model layer for generative systems: a span tells you the LLM took 700 ms, but not whether the answer was good. The 2024 to 2026 response has two threads. First, standardization: the OpenTelemetry project published GenAI semantic conventions that fix span and attribute names for model calls (model id, token counts, prompt and completion, tool calls), so that prompts, responses, and agent steps become traceable in a vendor-neutral way rather than in each tool's private schema. Second, quality as a streamed signal: tools in the lineage of LangSmith, Langfuse, Arize Phoenix, and OpenLLMetry log every prompt and response and run online evaluators (LLM-as-judge scores, refusal and hallucination detectors, retrieval-grounding checks) to turn output quality into a time series you can alert on, alongside the work on continuous evaluation in Chapter 5. The frontier is making those judge-based quality signals cheap and reliable enough to gate a deploy automatically, so that a drop in answer quality pages on-call as readily as a spike in latency does today.

Fun Note: The Loneliest Replica

In the fleet of Code 26.6.1, the broken replicas each swore they were fine; their own gauges read green because a replica that returns errors quickly looks fast and busy, not sick. Only the merge told the truth. There is a small lesson hiding in the joke: a replica is the worst possible judge of the fleet, because it cannot see the eighteen percent of users it is quietly failing. Self-reported health is how a thousand-machine system convinces itself nothing is wrong while a tenth of its traffic burns.

We now have the running system under watch across all three layers, aggregated faithfully into one view, traceable down to the slow span, and wired to page on SLO burn. The most important model-layer signal, the slow degradation of a learned function as the world shifts beneath it, deserves more than a line on a dashboard. The next section makes drift detection a distributed algorithm of its own, computing the divergence between a reference distribution and the live stream across the fleet with the same mergeable discipline used here. That story begins in Section 26.7.

Exercise 26.6.1: Why Not Average the Percentiles? Conceptual

Consider a fleet of ten replicas, nine of which serve every request in 100 ms and one of which serves every request in 1000 ms, all under equal load. State the true fleet p99 latency, then compute what you would get by taking the unweighted average of the ten per-replica p99 values. Explain in terms of the union of distributions why the average of percentiles is not the percentile of the union, and describe one realistic fleet scenario in which this error would cause an on-call engineer to miss a genuine outage.

Exercise 26.6.2: Add Distinct-User Counting Coding

Extend the Sketch class in Code 26.6.1 with a HyperLogLog-style register array so each replica also estimates the number of distinct users it served (hash a synthetic user id into a register and keep the running maximum of leading-zero counts). Implement the mergeable union as the element-wise maximum of two register arrays, fold it into merge, and report the fleet-wide distinct-user estimate alongside the p99 and error rate. Verify that merging in a different replica order leaves the estimate unchanged, and explain why element-wise maximum is the correct merge operation for cardinality.

Exercise 26.6.3: Design a Two-Window Burn Alert Analysis

Replace the single-window SLO rule in Code 26.6.1 with a two-rule scheme: a fast-burn rule that pages when the error budget is being consumed at fourteen times budget over a one-hour window, and a slow-burn rule that pages at three times budget over a six-hour window. For an SLO of one percent errors over 30 days, compute the absolute error rate each rule corresponds to. Then argue which rule would catch a sudden total outage of one region, which would catch a slow leak from a gradually degrading model, and why a single fixed threshold cannot serve both purposes without either paging on noise or missing the slow leak.