"Every replica swore its little slice of traffic looked completely normal. It took all of us, added up, to notice the world had moved on without us."
A Histogram That Only Saw One Shard
A model is correct only with respect to the data distribution it was trained on; when production traffic drifts away from that distribution, accuracy decays silently, and the only way to catch it is to measure the distribution of the entire live stream, which no single replica ever sees. A deployed model does not break loudly the way a crashed server does. It keeps answering, confidently, while the inputs it answers about slowly stop resembling its training data, and its accuracy erodes long before anyone files a bug. Drift detection is the instrument that turns this silent decay into a measured signal, and at scale it is inherently a distributed-aggregation problem: the relevant distribution is the fleet-wide one, assembled from mergeable summaries that each replica computes over its own shard of traffic. This section builds that instrument and wires its alarm to the retraining pipeline, closing the loop that keeps a deployed model alive.
The previous section gave us fleet-wide observability: latency, throughput, and error rates aggregated across every replica so that the serving system reports its own health as one number rather than a thousand. Those metrics tell you whether the system is running. They say nothing about whether the model is still right. A model can be fast, available, and quietly wrong, returning a prediction in eight milliseconds for an input that belongs to a world it was never trained on. Drift detection is the missing half of monitoring: instead of watching the system's vital signs, it watches the statistical relationship between the data the model sees today and the data it learned from, and raises an alarm when that relationship decays. It is the sensor that decides when the continuous-training pipeline of Section 26.4 should fire.
This deepens the streaming view from Chapter 9, where drift appeared as a property of a single online learner reacting to a changing stream. Here the model is fixed, the stream is the aggregate of every replica's traffic, and the question is not "how do I adapt one learner online?" but "how do I detect, across a whole serving fleet, that the deployed model has gone stale, and trigger a retrain before users notice?"
1. Three Kinds of Drift, and the Label-Delay Problem Beginner
Drift is not one phenomenon. It helps to keep three kinds separate, because they have different causes, different detectability, and different remedies. Data drift (also called covariate drift) is a shift in the input distribution: $P(x)$ changes while the underlying input-to-output rule stays fixed. A fraud model starts seeing transactions from a new country; a vision model starts seeing photos from a new phone's camera. Concept drift is a shift in the relationship itself: $P(y \mid x)$ changes, so the same input should now map to a different answer. Spending patterns that were benign last year are fraudulent this year; the meaning of a word shifts. Prediction drift is a shift in the model's output distribution $P(\hat{y})$: the mix of classes the model emits, or the shape of its score histogram, moves. Prediction drift is often the first observable symptom of the other two, because it is computed purely from outputs you already have.
The only direct measure of staleness is performance against ground-truth labels, and in production those labels arrive late, if at all. A loan model learns whether an applicant defaulted months later; a recommendation learns whether a click converted hours later; many predictions never get a label at all. This is the label-delay problem: true accuracy is a lagging indicator you cannot act on in time. So drift detection monitors proxies you can compute immediately from inputs and outputs alone, data drift and prediction drift, and treats a proxy alarm as an early warning that accuracy is probably eroding, well before the delayed labels confirm it.
The practical consequence is a layered strategy. You monitor the cheap, immediate, unlabeled signals continuously (input distributions, output distributions, confidence), and you reconcile them against true accuracy whenever the delayed labels finally land. A spike in data drift that is later confirmed by a drop in labeled accuracy calibrates your thresholds; a spike that is not confirmed teaches you that this particular input shift was harmless. Over time the proxies become trustworthy early sensors for a quantity you can only verify in arrears.
2. Detection Methods: Measuring the Distance Between Two Distributions Intermediate
Every unlabeled drift detector reduces to the same shape: summarize a fixed reference window (the distribution the model was validated on), summarize a sliding current window of live traffic, and compute a distance between the two. When the distance crosses a threshold, you declare drift. The detectors differ only in what feature they summarize and which distance they use.
For a single numeric feature, the workhorse is the Population Stability Index (PSI), which bins both windows onto a shared grid and sums the per-bin divergence. With reference proportions $r_b$ and current proportions $c_b$ over bins $b = 1, \dots, B$,
$$\mathrm{PSI} = \sum_{b=1}^{B} (c_b - r_b)\, \ln\!\frac{c_b}{r_b}.$$This is a symmetrized, binned cousin of the Kullback-Leibler divergence. A common rule of thumb reads $\mathrm{PSI} < 0.1$ as no material shift, $0.1 \le \mathrm{PSI} < 0.2$ as moderate, and $\mathrm{PSI} \ge 0.2$ as a population that has materially moved and warrants action. The Kolmogorov-Smirnov (KS) statistic instead takes the maximum gap between the two empirical cumulative distributions, $D = \sup_x |F_{\text{ref}}(x) - F_{\text{cur}}(x)|$, and comes with a hypothesis test. For high-dimensional features such as embeddings, the Maximum Mean Discrepancy (MMD) compares distributions in a kernel feature space without binning, which is what you reach for when the thing that drifted is a 768-dimensional vector rather than a scalar. Alongside these distribution tests, two almost-free signals deserve continuous monitoring: prediction confidence (a model growing systematically less certain is often the first whisper of drift) and class balance (the proportion of each predicted class), both of which are prediction-drift proxies computable from outputs you are already logging.
A retail team once wired a PSI alarm directly to an automatic retrain. It fired, correctly, every single year on Black Friday, when the input distribution genuinely did shift, then dutifully retrained the model on one freak day of traffic and shipped something worse than what it replaced. The drift was real. The response was the bug. Seasonality is not staleness, and a detector that cannot tell the difference is a very expensive smoke alarm wired to the sprinklers.
3. Why Drift Detection Is a Distributed-Aggregation Problem Intermediate
Here is the distributed twist that makes this a Part V problem rather than a textbook statistics exercise. The distribution you care about is the distribution of the whole traffic stream, across every replica in the serving fleet. No single replica sees it. A load balancer might route a new geography's traffic disproportionately to three replicas out of fifty; each of those three sees a local shift that looks like noise, while the fleet-wide shift that actually matters is invisible to all of them individually. Computing drift correctly means computing it over the union of every replica's traffic, which is exactly the aggregation pattern this book has used since Chapter 6.
The mechanism is a mergeable summary. Each replica does not ship its raw predictions to a central place; that would be a firehose. Instead each replica maintains a histogram (or a quantile sketch, the same family of mergeable structures introduced with MapReduce-style aggregation in Chapter 6) over a shared, fleet-wide set of bin edges. Because the bins are agreed in advance, the histograms are additive: the central aggregator simply sums the per-replica bin counts to obtain the exact fleet-wide histogram, then normalizes once. This rides the very same telemetry path that Section 26.6 built for latency and error metrics; drift sketches are just another mergeable metric flowing through the same pipeline. The aggregated current distribution is then compared against the stored reference, and a single fleet-wide drift score comes out.
The additive histogram you sum across replicas here is the same mergeable-summary idea that powered combiners in MapReduce (Chapter 6) and fleet metric aggregation in the previous section. Drift detection is not a new distributed primitive; it is the old aggregation primitive pointed at a new quantity. Whenever a property of the whole system must be computed from per-machine pieces without shipping the raw data, ask whether the per-machine summary is mergeable. If it is, the central computation is just a sum, and it is exact, exactly as the gradient all-reduce of Chapter 1 was exact.
4. A Fleet-Wide Drift Detector From Scratch Intermediate
The code below makes the whole pipeline concrete with nothing but NumPy. Four replicas each bin their own slice of traffic over shared edges; the aggregator sums the histograms into one fleet-wide distribution; a PSI score is computed against a reference window each monitoring window; and a retrain trigger fires when PSI crosses $0.2$. The stream is deliberately stable for four windows, then drifts: the input mean slides and the spread widens, the signature of covariate drift.
import numpy as np
rng = np.random.default_rng(7)
# Fixed bin edges agreed fleet-wide so every replica's histogram is mergeable.
EDGES = np.linspace(-6.0, 6.0, 21) # 20 bins over the score range
N_REPLICAS = 4
WINDOW = 5_000 # predictions per replica per window
THRESHOLD = 0.2 # PSI > 0.2 == material drift, retrain
def replica_histogram(samples):
"""One replica bins ONLY its own slice of traffic into a shared grid."""
counts, _ = np.histogram(samples, bins=EDGES)
return counts.astype(np.float64)
def fleet_distribution(per_replica_counts):
"""Central aggregator sums mergeable histograms, then normalizes once."""
total = np.sum(per_replica_counts, axis=0)
return total / total.sum()
def psi(reference, current, eps=1e-6):
"""Population Stability Index between two binned distributions."""
r = np.clip(reference, eps, None)
c = np.clip(current, eps, None)
return float(np.sum((c - r) * np.log(c / r)))
# Reference window: the input score distribution the model was validated on.
ref_counts = [replica_histogram(rng.normal(0.0, 1.0, WINDOW)) for _ in range(N_REPLICAS)]
reference = fleet_distribution(ref_counts)
# Stream eight monitoring windows. The serving population drifts after window 4:
# the mean of the input scores slides and the spread widens (covariate drift).
print(f"{'window':>6} {'fleet mean':>11} {'PSI':>8} trigger")
print("-" * 38)
for w in range(8):
if w < 4:
mu, sigma = 0.0, 1.0 # stable regime
else:
mu, sigma = 0.6 + 0.25 * (w - 4), 1.0 + 0.18 * (w - 4) # drifting regime
per_replica = [replica_histogram(rng.normal(mu, sigma, WINDOW)) for _ in range(N_REPLICAS)]
current = fleet_distribution(per_replica)
score = psi(reference, current)
fired = "RETRAIN" if score > THRESHOLD else "."
print(f"{w:>6} {mu:>11.2f} {score:>8.4f} {fired}")
fleet_distribution before any distance is computed, so the PSI score reflects the whole stream rather than any one replica's local view.window fleet mean PSI trigger
--------------------------------------
0 0.00 0.0011 .
1 0.00 0.0008 .
2 0.00 0.0018 .
3 0.00 0.0008 .
4 0.60 0.3682 RETRAIN
5 0.85 0.6938 RETRAIN
6 1.10 1.1481 RETRAIN
7 1.35 1.6833 RETRAIN
The detector did exactly what Figure 26.7.1 promised: it stayed quiet while the fleet-wide distribution matched the reference, and it fired the moment that distribution moved, with a score that grows monotonically as the gap widens. Note that the PSI was computed on the summed histogram; had we instead averaged four per-replica PSI scores, a shift concentrated on a few replicas could have been diluted below the threshold, which is precisely the local-blindness trap that motivates fleet-wide aggregation.
Code 26.7.1 is the teaching version. In production you would not hand-roll the binning, the multiple-testing correction, and the report. Evidently computes data-drift, prediction-drift, and target-drift reports across many features with per-feature tests chosen automatically; NannyML specializes in the label-delay problem, estimating model performance without labels from the confidence distribution and reconciling it when labels arrive; and river supplies streaming detectors (ADWIN, Page-Hinkley, KSWIN) that update incrementally per event for the online setting of Chapter 9. A full multi-feature drift report collapses to a few lines:
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
report = Report(metrics=[DataDriftPreset()]) # per-feature tests, auto-chosen
report.run(reference_data=reference_df, current_data=current_df)
report.as_dict()["metrics"][0]["result"]["dataset_drift"] # True / False fleet verdict
5. Drift in LLM Serving: Embeddings and Quality Proxies Advanced
For a classifier, the monitored feature is a tidy numeric score. For a large language model serving free-form text, there is no single scalar to bin, and the methods generalize rather than transfer directly. The standard move is to monitor the embedding distribution of inputs and outputs: encode each prompt and each response into a vector, and track that high-dimensional distribution over windows with a binning-free distance such as MMD, or by reducing to a few summary statistics (mean cosine distance to the reference centroid, the share of inputs whose nearest reference cluster is far away). A surge of out-of-distribution prompts, a new jailbreak template, a topic the model was never tuned for, shows up as the input embedding cloud drifting away from the reference cloud, long before any human flags a bad answer.
Because an LLM has no cheap ground-truth label either, output quality is tracked through quality proxies: response length distribution, refusal rate, toxicity and safety-classifier scores, retrieval-grounding scores in a RAG system, and increasingly an LLM-as-judge score sampled on a fraction of traffic. Each is a mergeable per-replica statistic that aggregates fleet-wide exactly like the histogram in Code 26.7.1, so the distributed machinery is unchanged; only the monitored quantity is richer.
The frontier is detecting decay without waiting for labels. NannyML's confidence-based performance estimation (CBPE) and its direct-loss-estimation successors (2024 to 2025) estimate a model's true accuracy from the shape of its predicted-probability distribution alone, turning the label-delay problem from a blocker into an estimate you can monitor live. For generative systems, a fast-moving line monitors embedding-space drift of prompts and responses and pairs it with automated LLM-as-judge evaluation run continuously on sampled production traffic, with open tooling such as Evidently's LLM evaluation suite and observability platforms like Arize Phoenix and LangSmith productizing it through 2025 and into 2026. The hard open problems are disentangling genuine concept drift from benign seasonality at fleet scale, and controlling the false-alarm rate when hundreds of features and prompt clusters are each tested every window. We borrow the multiple-testing discipline for that from the evaluation methodology of Chapter 5.
6. From Detection to Action: Closing the Loop Without Crying Wolf Intermediate
A drift score is only useful if it drives an action, and the action is a small state machine. Below the threshold, do nothing. Above it, raise an alert to the on-call channel and, depending on confidence, trigger the retraining pipeline that Section 26.1 and Section 26.4 built, which pulls a fresh labeled window, retrains, evaluates against a held-out set, and promotes the new model only if it actually beats the incumbent. The detector closes the loop from a deployed-and-decaying model back to a freshly trained one, which is the whole point of an MLOps pipeline: a model that maintains itself.
The dominant failure mode is the false alarm, and two causes account for most of them. The first is seasonality: traffic legitimately shifts on weekends, holidays, and promotions, and a detector that treats every shift as decay will retrain on freak data and ship regressions, as the Black Friday note warned. The remedy is to compare against a seasonally matched reference (this Monday against prior Mondays, not against an all-time average) and to require persistence, a score that stays high for several consecutive windows rather than one spike. The second is multiple testing: with hundreds of features each tested every window, some will cross any fixed threshold by pure chance, and the fleet-level false-alarm rate compounds. The remedy is a correction such as controlling the false-discovery rate across features, so the system reports the few features that genuinely moved rather than the noise floor, which is the same statistical hygiene the evaluation chapter applies to comparing systems.
Who: An ML platform engineer at a payments company running a fraud classifier across a fifty-replica serving fleet.
Situation: A new market launched, and its transactions, routed by geography, landed on only a handful of replicas while the rest served the old traffic mix unchanged.
Problem: Per-replica drift dashboards looked calm; each replica's local PSI hovered in the noise, because the shifted traffic was a small fraction of any one replica's stream, and accuracy was quietly sliding on the new market.
Dilemma: Lower every replica's local threshold and drown in false alarms from ordinary per-replica noise, or aggregate fleet-wide and risk a real but localized shift being diluted in the average.
Decision: They switched from averaging per-replica scores to summing mergeable per-replica histograms into one fleet-wide distribution before computing PSI, exactly the pattern in Code 26.7.1, and segmented the report by region so a localized shift stayed visible instead of being averaged away.
How: Each replica emitted an additive histogram over shared bin edges on the existing metrics path from Section 26.6; the aggregator summed them, computed segmented PSI, required two consecutive windows over threshold, then called the retrain trigger.
Result: The fleet-wide segmented PSI for the new region crossed threshold within two windows, the retrain pulled labeled data weighted toward the new market, and the promoted model recovered accuracy there with no regression on existing regions.
Lesson: Drift is a property of the whole stream. Compute it from mergeable summaries aggregated across the fleet, segment to keep localized shifts visible, and require persistence before you act.
With drift detection in place, the serving system gains the last sense it was missing: it now knows not just whether it is running and how fast, but whether it is still right, and it can summon a fresh model when it is not. The next section turns to the question of how to roll a candidate model out safely once retraining has produced one, comparing the incumbent and the challenger on live traffic without betting users on an unproven model, through A/B testing and shadow deployment at fleet scale, in Section 26.8.
For each scenario, classify the drift as data drift, concept drift, or prediction drift, state whether an unlabeled detector could catch it, and name the proxy you would monitor: (a) a sentiment model trained before a product launch starts seeing reviews full of a new product name it has never encountered; (b) a credit model where, after a recession, applicants with previously safe profiles begin defaulting while their application features look identical to before; (c) a content classifier whose output mix suddenly tilts from ninety percent "safe" to sixty percent "safe" with no code change. Explain why the label-delay problem makes one of these three much harder to confirm than the others.
Extend Code 26.7.1 in two ways. First, alongside PSI compute a two-sample Kolmogorov-Smirnov statistic between the fleet-wide current and reference samples (you may reconstruct approximate samples from bin counts, or store raw samples per window) and print both scores side by side; compare which one fires earlier as the drift deepens. Second, replace the single-window trigger with a persistence rule that fires only when the score exceeds the threshold for two consecutive windows. Re-run and confirm the persistence rule still catches the real drift at windows 4 onward while it would suppress a single isolated spike. Explain what false alarm the persistence rule is designed to prevent.
Suppose a fleet of $K = 200$ replicas each serves $2{,}000$ predictions per minute, and you must choose how often to ship drift histograms to the central aggregator. Shipping every second gives fast detection but $200$ messages per second of telemetry; shipping every five minutes is nearly free but delays detection. Using the mergeable-histogram structure, argue why the detection latency is bounded by the shipping interval but the statistical power depends on the total sample count accumulated, not the shipping frequency. Then estimate, for a true shift large enough to give PSI $\approx 0.4$ on a full window, roughly how many total predictions you need before the score reliably clears a $0.2$ threshold, and use that to recommend a shipping interval. Tie your answer to the fleet-metric aggregation trade-offs of Section 26.6.