Section 38.7: Online Evaluation | Building Scalable AI

"An offline metric told me I was a genius. Then they shipped me to one half of a billion users, and the other half quietly explained that I was not."
An A/B Test, Splitting a Billion Users Into Two Honest Halves

Big Picture

A recommender that looks better offline is a hypothesis, not a result; the only verdict that counts is what real users do when the new model is actually serving them, and obtaining that verdict at scale is itself a distributed-systems problem. The previous sections built the recommender: sharded embeddings, distributed candidate generation, a ranking model trained across many workers, and the contextual bandits of Section 38.6 that learn online. This section answers the question that decides whether any of that work ships: does the new system make users click, watch, and return more than the old one? We measure it by randomly splitting live traffic across a distributed fleet, logging billions of interactions, aggregating outcome metrics with the same collective machinery as the rest of the book, and applying statistics carefully enough that we are not fooled by noise. Evaluation here is not an afterthought to the system; it is a distributed system in its own right, and the methodology of Chapter 5 meets the metric aggregation of Chapter 6 at full production scale.

Every prior section of this chapter improved the recommender against an offline number: a higher ranking AUC, a better held-out NDCG, a lower validation loss. Those numbers are necessary, because they let us reject bad models cheaply before any user sees them, but they are not sufficient, and the gap between them and reality is the central lesson of online evaluation. Chapter 5 developed evaluation as a methodology for distributed AI systems in the abstract; this section applies that methodology to the hardest case, a live recommender whose own outputs change the data it is later evaluated on. We begin with why offline metrics mislead, then build the live experiment that settles the question.

1. Why Offline Metrics Are Necessary but Not Sufficient Beginner

An offline metric scores a model on logged data: take yesterday's impressions, ask the new model how it would have ranked them, and compute AUC or NDCG against the clicks that actually happened. This is fast, cheap, and safe, and it correctly screens out models that are simply broken. Its limitation is structural rather than incidental. The logged data was generated by the old policy, so it contains only the items the old model chose to show; the new model's favorite item for a user may never appear in the logs at all, and a metric computed on those logs cannot reward or punish a choice it never observes. A recommender also changes user behavior: a better feed keeps users on the platform longer, which produces more impressions, which feed back into tomorrow's training data. None of that loop is visible to a static offline metric. The empirical consequence, reported across the industry, is blunt: offline AUC and NDCG correlate only loosely with online lift, and a model that wins offline routinely loses online, or moves the offline metric while leaving revenue and retention flat.

Key Insight: Offline Metrics Rank Hypotheses; Only Online Experiments Rank Outcomes

Offline AUC and NDCG are computed on data the old policy generated, so they reward agreement with past choices, not the value of new ones, and they cannot see the feedback loop in which a recommender reshapes the very behavior it is scored against. Treat a strong offline number as a license to run a live experiment, never as the experiment's result. The decision to ship is made on the online lift, with the offline metric serving only as the cheap gate that decides what is worth testing at all.

2. A/B Testing at Fleet Scale Beginner

The instrument that produces an online verdict is the randomized controlled experiment, the A/B test. We split incoming traffic into a control arm that keeps serving the old model and a treatment arm that serves the new one, hold everything else fixed, let both run on live users, and compare an outcome metric such as click-through rate, watch time, or downstream retention. Randomization is what makes the comparison causal: because users are assigned to arms by chance, the two populations are statistically identical in expectation, so any difference in the outcome is attributable to the model and not to who happened to be in which group. The challenge specific to this book is that the experiment runs across a distributed serving fleet handling millions of requests per second, and the assignment, serving, logging, and aggregation must all be distributed too.

Assignment is solved with a hash, not a coin flip, because a coin flip is stateful and a fleet of thousands of stateless servers cannot share state per request. Each server computes a deterministic hash of the user identifier together with the experiment name, maps the hash to a bucket, and serves the arm that owns that bucket. Two servers handling the same user, in the same request or a week apart, compute the same hash and therefore agree on the arm with no coordination at all. This is the same partitioning-by-hash idea introduced in Chapter 2 and used for data shards in Chapter 8, repurposed to shard users across experiment arms. Figure 38.7.1 traces the full pipeline from hashing through aggregation to the significance test.

Figure 38.7.1: The online-evaluation pipeline as a distributed system. A stateless hash of the user id and experiment name assigns each request to a bucket; control buckets serve the old model and treatment buckets the new one. Each fleet emits an interaction log; a distributed reduce (Section 6) sums clicks and impressions per arm over billions of events, and a two-proportion z-test turns the two summed counts into a lift, a p-value, and a confidence interval.

Computing the metric is where Chapter 6 returns. Click-through rate is clicks divided by impressions, and both counts are sums over an event stream too large for any one machine: a map step tags each logged interaction with its arm, and a reduce step sums clicks and impressions per arm. The metric we ultimately compare is a ratio of two distributed reductions, exactly the aggregation pattern that all-reduce performs in training (Chapter 15), now run over logs instead of gradients. The experiment platform, the service that assigns buckets, enforces that overlapping experiments do not collide, and pipes the logs into the aggregation, is one of the largest internal distributed systems any recommendation company operates.

3. Statistical Significance, Sample Size, and the Peeking Trap Intermediate

Two arms will almost never show identical click-through rates even when the models are identical, because clicks are random. The question is whether an observed difference is larger than the noise we would expect from chance alone. Let arm $c$ (control) and arm $t$ (treatment) each receive $n$ impressions with observed click rates $\hat{p}_c$ and $\hat{p}_t$. Under the null hypothesis that the true rates are equal, the standardized difference follows an approximately standard-normal distribution, giving the two-proportion z-statistic

$$ z = \frac{\hat{p}_t - \hat{p}_c}{\sqrt{\hat{p}\,(1-\hat{p})\left(\tfrac{1}{n}+\tfrac{1}{n}\right)}}, \qquad \hat{p} = \frac{\text{clicks}_c + \text{clicks}_t}{2n}, $$

where $\hat{p}$ is the pooled rate used to estimate the variance under the null. A two-sided p-value is $2\,\Phi(-|z|)$, and we declare the difference significant at level $\alpha = 0.05$ when that p-value falls below $0.05$. We report the effect itself with a 95% confidence interval for the absolute lift $\hat{p}_t - \hat{p}_c$, using the unpooled standard error,

$$ \big(\hat{p}_t - \hat{p}_c\big) \pm 1.96 \sqrt{\frac{\hat{p}_c(1-\hat{p}_c)}{n} + \frac{\hat{p}_t(1-\hat{p}_t)}{n}}. $$

Before launching, we must know whether the experiment can even detect the lift we care about. The required sample size per arm to detect an absolute difference $\delta = p_t - p_c$ at significance $\alpha$ and power $1-\beta$ is approximately

$$ n \;\approx\; \frac{\Big(z_{\alpha/2}\sqrt{2\bar{p}(1-\bar{p})} + z_{\beta}\sqrt{p_c(1-p_c)+p_t(1-p_t)}\Big)^2}{\delta^2}, \qquad \bar{p} = \tfrac{p_c+p_t}{2}, $$

with $z_{\alpha/2} = 1.96$ and $z_\beta = 0.842$ for the conventional 5% significance and 80% power. The inverse-square dependence on $\delta$ is the brutal fact of recommendation experiments: detecting a 0.3 percentage-point lift on a 5% baseline needs on the order of $10^5$ impressions per arm, and detecting a tenth of that needs a hundred times more. This is exactly why fleet scale matters for evaluation and not only for serving: small but real improvements are only measurable because the fleet generates enough traffic to drive $\delta^2 n$ above the noise floor.

Key Insight: Peeking Inflates False Positives, So Decide the Stopping Rule Before You Look

A p-value below 0.05 is calibrated for a single test at a predetermined sample size. If you watch a live dashboard and stop the moment significance appears, you have run many tests and reported the most extreme, so your true false-positive rate is far above 5%. The demo below shows it climbing past 19% under ten naive looks on an experiment with no real effect. The fixes are to fix the sample size in advance, or to use a sequential method (alpha-spending boundaries, group-sequential designs, or always-valid confidence sequences) whose thresholds are explicitly adjusted for repeated looking.

The demonstration in Code 38.7.1 makes all four quantities concrete on simulated data. It runs one A/B test with a true 6% relative lift, computes the lift, the two-proportion z-test, and the confidence interval; reports the required sample size; and then runs an A/A experiment (identical arms) with repeated peeking to expose how the false-positive rate balloons. Output 38.7.1 shows the result.

import numpy as np
from math import sqrt, erf

def norm_sf(z):           # upper tail of standard normal; two-sided p uses 2*sf(|z|)
    return 0.5 * (1.0 - erf(z / sqrt(2.0)))

rng = np.random.default_rng(7)
p_ctrl, p_treat = 0.0500, 0.0530      # true rates: a +6% relative lift
n_per_arm = 200_000                    # impressions hashed into each bucket

# One A/B test: Bernoulli clicks over the logged impressions of each arm.
clicks_c = rng.binomial(n_per_arm, p_ctrl)
clicks_t = rng.binomial(n_per_arm, p_treat)
phat_c, phat_t = clicks_c / n_per_arm, clicks_t / n_per_arm

# Two-proportion z-test (pooled variance under the null p_c == p_t).
pool = (clicks_c + clicks_t) / (2 * n_per_arm)
se_pool = sqrt(pool * (1 - pool) * (2.0 / n_per_arm))
z = (phat_t - phat_c) / se_pool
p_value = 2 * norm_sf(abs(z))

# 95% CI for the absolute lift (unpooled standard error).
se_diff = sqrt(phat_c*(1-phat_c)/n_per_arm + phat_t*(1-phat_t)/n_per_arm)
diff = phat_t - phat_c
lo, hi = diff - 1.96*se_diff, diff + 1.96*se_diff

print(f"control CTR  : {phat_c:.5f} | treatment CTR : {phat_t:.5f}")
print(f"abs lift     : {diff:+.5f}  95% CI [{lo:+.5f}, {hi:+.5f}]  rel {diff/phat_c:+.2%}")
print(f"z = {z:.3f}  two-sided p = {p_value:.4f}")

# Required sample size per arm: alpha=0.05, power=0.80.
z_a, z_b = 1.96, 0.842
pbar, delta = (p_ctrl+p_treat)/2, p_treat - p_ctrl
n_req = ((z_a*sqrt(2*pbar*(1-pbar)) +
          z_b*sqrt(p_ctrl*(1-p_ctrl)+p_treat*(1-p_treat)))**2) / delta**2
print(f"required n/arm : {int(np.ceil(n_req)):,}")

# Peeking on an A/A test (no true effect): test at every interim look vs once.
def run_peeking(n_looks, trials=4000):
    fp_fixed = fp_peek = 0
    step = n_per_arm // n_looks
    for _ in range(trials):
        c_cum = t_cum = 0; hit = False
        for look in range(1, n_looks + 1):
            c_cum += rng.binomial(step, p_ctrl)      # both arms share p_ctrl: A/A
            t_cum += rng.binomial(step, p_ctrl)
            nn = step * look
            pl = (c_cum + t_cum) / (2*nn)
            se = sqrt(pl*(1-pl)*(2.0/nn)) + 1e-12
            zz = (t_cum/nn - c_cum/nn) / se
            if 2*norm_sf(abs(zz)) < 0.05: hit = True   # stop-if-significant
        if hit: fp_peek += 1
        if 2*norm_sf(abs(zz)) < 0.05: fp_fixed += 1     # final look only
    return fp_fixed/trials, fp_peek/trials

fixed, peek = run_peeking(n_looks=10)
print(f"A/A false positive: test once = {fixed:.3f} | peek every look = {peek:.3f}")

Code 38.7.1: A complete A/B analysis in standard-library statistics: one test with its lift, z-test, and confidence interval; the required sample size from the formula above; and a peeking experiment on identical arms that measures the inflated false-positive rate directly.

control CTR  : 0.04993 | treatment CTR : 0.05249
abs lift     : +0.00256  95% CI [+0.00119, +0.00393]  rel +5.13%
z = 3.673  two-sided p = 0.0002
required n/arm : 85,225
A/A false positive: test once = 0.047 | peek every look = 0.194

Output 38.7.1: The real 6% lift is detected (p = 0.0002) with a confidence interval that excludes zero, and the sample-size formula asks for about 85,000 impressions per arm. The A/A peeking test is the warning: a single test at the end holds its false-positive rate near the nominal 0.05, while peeking after each of ten looks falsely declares a winner 19.4% of the time on two identical models.

Library Shortcut: statsmodels Gives the z-Test and Power in One Line Each

Code 38.7.1 spells out the z-statistic and the sample-size formula to make them legible, but you would not maintain that arithmetic in production. The statsmodels library provides the two-proportion test and the power-based sample size directly, with the pooling, the two-sided p-value, and the normal approximations handled internally:

from statsmodels.stats.proportion import proportions_ztest
from statsmodels.stats.proportion import samplesize_proportions_2indep_onetail
import numpy as np

# Two-proportion z-test from raw counts: (clicks, impressions) per arm.
zstat, pval = proportions_ztest(count=np.array([10498, 9986]),
                                nobs=np.array([200_000, 200_000]))
print(f"z = {zstat:.3f}  p = {pval:.4f}")

# Sample size per arm for a 5.0% -> 5.3% lift at 80% power.
n = samplesize_proportions_2indep_onetail(diff=0.003, prop2=0.050,
                                          power=0.80, alpha=0.05)
print(f"required n/arm = {int(np.ceil(n)):,}")

Code 38.7.2: The same z-test and sizing as Code 38.7.1 in two library calls. Roughly thirty lines of hand-written statistics collapse to two functions, and statsmodels also exposes confidence intervals, multiple-comparison corrections, and the sequential boundaries that the peeking discussion calls for.

4. Counterfactual and Off-Policy Evaluation Advanced

A/B testing is the gold standard, but it is expensive: every experiment spends live traffic, only a few can run at once on the same surface, and a bad treatment harms real users while it runs. We would like to estimate how a new ranking policy would perform from data the old policy already logged, without exposing anyone to it. This is off-policy, or counterfactual, evaluation, and it is the same problem the bandits of Section 38.6 face when learning from logged feedback: the logs tell us what happened under the policy that produced them, and we want an unbiased estimate of a different policy.

The workhorse estimator is inverse propensity scoring (IPS). Suppose the logging policy showed action $a_i$ to context $x_i$ with probability $\pi_0(a_i \mid x_i)$ (the propensity), and we observed reward $r_i$ (a click, a watch). For a new policy $\pi$ we want its expected reward $V(\pi)$. IPS reweights each logged event by how much more or less likely the new policy is to have taken the same action:

$$ \hat{V}_{\text{IPS}}(\pi) = \frac{1}{N} \sum_{i=1}^{N} \frac{\pi(a_i \mid x_i)}{\pi_0(a_i \mid x_i)}\, r_i. $$

The ratio $\pi/\pi_0$ corrects the mismatch between which actions were logged and which the new policy prefers, and the estimator is unbiased provided the logging policy gave every action a nonzero chance ($\pi_0 > 0$ wherever $\pi > 0$). Its weakness is variance: when the new policy strongly favors an action the old policy rarely showed, the weight $\pi/\pi_0$ explodes, and a single lucky reward dominates the estimate. Practical variants tame this, weight clipping, self-normalized IPS, and doubly robust estimators that blend IPS with a learned reward model, and they are the standard offline screen that decides which candidate policies are worth a live A/B slot. Computing $\hat{V}_{\text{IPS}}$ is itself a distributed reduction: a weighted sum over billions of logged events, mapped and reduced exactly as in Section 2, which is why off-policy evaluation runs on the same data platform as the metric aggregation.

Thesis Thread: Evaluation Is a Distributed Aggregation, Same as Training

Every number in this section, the per-arm click-through rate, the pooled variance, the IPS value estimate, is a sum or a ratio of sums over more events than one machine can hold. The map-then-reduce shape is identical to the gradient all-reduce of Chapter 15 and the MapReduce aggregation of Chapter 6; only the operands changed, from gradients to outcome counts. Measuring a distributed AI system at scale is not a different kind of computation from running one. The experiment platform is a distributed system whose job is to aggregate, and the discipline that keeps its answers honest is the evaluation methodology of Chapter 5.

5. Guardrails, Interleaving, and What to Trust Intermediate

A single primary metric is never the whole story. A treatment that raises click-through rate might do so by promoting clickbait that lowers long-term retention, or by serving slower responses that quietly drive heavy users away. Mature experiment platforms therefore track guardrail metrics alongside the primary one: latency, error rate, session length, revenue, unsubscribe rate, and complaint volume. The primary metric decides whether to ship; the guardrails decide whether shipping is safe, and a treatment that wins on clicks but trips a latency or retention guardrail is held back regardless. This is the online face of the multi-metric evaluation discipline from Chapter 5, and the latency and throughput guardrails are precisely the system metrics defined in Chapter 3, now read as experiment outcomes rather than capacity-planning inputs.

When the effect we want to measure is small and a full A/B test would take too long, interleaving offers a far more sensitive instrument for ranking comparisons specifically. Instead of giving each user one ranker, interleaving merges the two rankers' lists into a single list shown to every user, then attributes each click to whichever ranker contributed the clicked item. Because every user sees both rankers' items in the same session, the comparison is within-user rather than between-user, which removes most of the variance from differing user populations and can reach a reliable verdict with one to two orders of magnitude less traffic than a between-user A/B test. The cost is scope: interleaving compares two ranking functions and cannot measure system-wide outcomes like retention or revenue, so it is a fast pre-filter that narrows the candidates an A/B test then confirms.

Practical Example: The Model That Won Offline and Lost the Guardrail

Who: An ML platform engineer on the home-feed ranking team at a large video service.

Situation: A new ranking model beat the incumbent on offline NDCG by a comfortable margin and was queued to ship on the strength of that number.

Problem: The offline win did not say what users would do, and the team had been burned before by models that looked better on logged data and flatter in production.

Dilemma: Ship on the offline metric and move fast, or spend a scarce A/B slot and a week of traffic to confirm a lift that the offline number already suggested was real.

Decision: They ran the A/B test, hashing users into a 5% treatment bucket against a 5% control, with watch time as the primary metric and latency, session length, and reported-content rate as guardrails.

How: The serving fleet hashed each user id and experiment name into a bucket; a nightly MapReduce job summed clicks, watch time, and impressions per arm over roughly two billion logged events, and a two-proportion test plus a watch-time CI ran on the summed counts.

Result: Click-through rate rose 4%, matching the offline promise, but the p99 latency guardrail tripped because the new model's heavier feature lookups slowed tail requests, and session length fell. The model was held until the feature path was optimized; the offline-only ship would have degraded the experience for millions.

Lesson: The primary metric earns the ship; the guardrails grant permission. An offline win is a hypothesis, and only the live experiment, watched across every guardrail, tells you whether the hypothesis survives contact with real users.

Research Frontier: Always-Valid Inference and Trustworthy Online Experiments (2024 to 2026)

The peeking problem has driven a wave of anytime-valid inference: confidence sequences and e-values that stay valid no matter how often you look, letting practitioners watch a live dashboard and stop whenever they like without inflating error rates (the line from Howard et al. on confidence sequences through recent e-value experimentation frameworks). A second active thread reduces variance so smaller effects become detectable: CUPED-style covariate adjustment using pre-experiment data, and variance-reduction via control variates, now standard in industrial platforms. A third addresses interference, the assumption that one user's treatment does not affect another's outcome, which breaks in social and marketplace recommenders where treated users change what untreated users see; cluster-randomized and switchback designs are the current answer. Off-policy evaluation has converged on doubly robust and switch estimators that bound the IPS variance, and benchmarks such as the Open Bandit Pipeline let teams compare estimators on real logged recommendation data. The common thread is that as recommenders grow more adaptive, the evaluation system must grow at least as sophisticated to keep its verdicts honest.

Fun Note: The Twyman Corollary for Recommenders

Twyman's law says any figure that looks interesting or unusual is probably wrong. Its recommender corollary: any A/B result that shows a 40% lift is not a breakthrough, it is a logging bug, a bot-traffic leak, or a bucket that accidentally also turned on a coupon. The experienced experimenter's first response to a spectacular win is not celebration but a check of the instrumentation, and the spectacular win usually does not survive it.

With offline gating, fleet-scale A/B testing, honest significance, off-policy estimation, guardrails, and interleaving in hand, we can measure whether the distributed recommender this chapter built actually serves users better than what it replaced. The final section steps back to the end-to-end system, assembling candidate generation, ranking, the online learner, and this evaluation harness into one operating picture, in Section 38.8.

Exercise 38.7.1: Why Hashing, Not a Coin Flip Conceptual

A junior engineer proposes assigning each request to an arm by drawing a fresh random number on the server that handles it. Explain two concrete failures this causes on a distributed serving fleet: one when the same user issues two requests to two different servers, and one when a user returns a week later. Then explain why hashing the user id together with the experiment name (rather than the user id alone) lets many experiments run concurrently without their bucket assignments becoming correlated. Connect your answer to the stateless-hash partitioning of Chapter 2.

Exercise 38.7.2: Sequential Testing Done Right Coding

Extend the peeking experiment in Code 38.7.1 with a corrected sequential rule. Implement a simple alpha-spending boundary (for example, a Pocock boundary that tests each of the ten looks at a constant adjusted threshold, or a Bonferroni split $\alpha/10$ per look) and rerun the A/A simulation. Report the false-positive rate under your corrected rule and confirm it returns near the nominal 0.05. Then rerun the A/B case with the true 6% lift and report how much later (in cumulative impressions) the corrected rule declares significance compared with the naive peek, quantifying the price of validity.

Exercise 38.7.3: When IPS Blows Up Analysis

Consider an off-policy estimate where the new policy $\pi$ places probability 0.9 on an action the logging policy $\pi_0$ showed with probability 0.01. Compute the importance weight $\pi/\pi_0$ for a logged event on that action, and explain what happens to the variance of $\hat{V}_{\text{IPS}}$ when a handful of such high-weight events carry most of the rewards. Describe how weight clipping and self-normalized IPS each trade a small bias for a large variance reduction, and argue why a doubly robust estimator that adds a learned reward model is more robust when the logging propensities $\pi_0$ are themselves estimated rather than known exactly.