Part VIII: Case Studies and Capstone Projects
Chapter 38: Distributed Recommendation at Scale

Real-Time Personalization

"A click, becoming a better recommendation three seconds later. I do not retrain overnight any more; I learn between your taps."

A Behavioral Stream That Updates Its Own Beliefs
Big Picture

Real-time personalization turns recommendation into an online distributed learning loop: a stream of behavioral events (clicks, dwell, skips) updates a per-user state and a small set of model beliefs in seconds, and that fresh state feeds a low-latency serving fleet that must answer the next request before the user's attention moves on. The offline pipeline of the previous sections (batch-trained embeddings, nightly retraining, precomputed candidate lists in Section 38.5) gives the system its long-memory backbone. This section adds the short-memory reflex on top of it. We capture signals as a stream, fold the most recent actions into a session representation, decide whether to update the model online or merely stream fresh features into a static model, and confront the explore-exploit problem that every feedback-driven recommender faces: a system that always serves what currently looks best never learns whether something else is better. The discipline that resolves it, contextual bandits and Thompson sampling, is reinforcement learning wearing a recommender's clothes, and it is distributed because the learning happens over a behavioral stream while the serving happens on a fleet.

The recommender built across the earlier sections of this chapter is, at heart, a batch system. Embeddings are trained on yesterday's interaction log over a sharded parameter store, candidate sets are generated by approximate nearest-neighbor search, and a ranking model scores them. That backbone is correct and necessary, and it captures the slow-moving truth about a user: their durable tastes, the items they return to, the long arc of their preferences. What it cannot capture is the user who arrived two minutes ago in a mood the nightly model never saw. Someone shopping last week for hiking boots may today be buying a birthday gift for a child; a reader who usually opens finance articles is, this session, three clicks deep into a cooking thread. The batch model answers "who is this person in general?" Real-time personalization answers "what is this person doing right now?" and the two questions have different time constants, different data sources, and different machinery.

That machinery is the subject of Chapter 9, which built stream processing and online learning from the ground up: event time versus processing time, windowing, stateful operators, exactly-once semantics, and the contrast between training a model online and merely computing features online. This section is that chapter applied. We assume the windowed aggregations, the keyed state backends, and the watermarking are in place, and we ask the recommendation-specific question on top of them: given a live stream of one user's actions, how should the system change what it shows them, and how does it learn from whether that change worked?

1. The Signal Stream: Clicks and Dwell as First-Class Data Beginner

Real-time personalization begins by treating user behavior as a stream rather than a table. Every impression, click, scroll, dwell interval, add-to-cart, and abandonment is an event with a timestamp and a user key, emitted the moment it happens and ingested into a log such as Kafka or Pulsar. Two of these signals carry most of the within-session information. A click is an explicit positive: the user chose this item over the others on screen. Dwell time, how long the user stayed on the item, is the implicit complement that disambiguates the click: a three-second bounce and a four-minute read are both clicks, but they mean opposite things, and dwell separates them. The stream processor keys these events by user, maintains a short rolling window of the most recent actions per key, and emits an updated session state whenever a new event lands. This is precisely the keyed stateful operator pattern of Chapter 9, now carrying recommendation payloads.

The architecture, shown in Figure 38.6.1, is a loop rather than a pipeline. Events flow from the serving fleet into the stream processor; the processor updates per-user session state and, on a slower cadence, model beliefs; the refreshed state and beliefs flow back into the serving fleet, which shapes the next set of impressions, which generate the next events. The loop is what makes the system online: feedback from a recommendation re-enters the system that produced it, closing the circuit that a batch pipeline leaves open until the next nightly run.

Serving fleet rank + explore, < 50 ms / request Stream processor keyed by user, windowed state Session state + model beliefs decayed history, posteriors Next request user keeps browsing click / dwell events fold in event fresh state to fleet shapes next impressions (the loop)
Figure 38.6.1: Real-time personalization as a closed loop. The serving fleet emits behavioral events into a stream processor keyed by user; the processor folds each event into per-user session state and, on a slower cadence, model beliefs; the refreshed state flows back to the fleet to shape the next impressions. Unlike the open-ended batch pipeline of Section 38.5, feedback re-enters the system within seconds, which is what makes the learning online.
Key Insight: Two Memories, Two Time Constants

A real-time recommender runs two learners with different clocks. The batch backbone has a time constant of hours to days and encodes durable preference; the online layer has a time constant of seconds to minutes and encodes present intent. They are not competitors. The online layer is a fast correction riding on the slow backbone, and the engineering question is not "online or batch?" but "what belongs in each memory, and how do they compose at serving time?" Conflating the two (trying to make the batch model fast, or the online model deep) is the most common architectural mistake.

2. Session State: A Decaying Representation of Recent Intent Intermediate

The session representation summarizes the user's recent actions into a vector that the ranker can consume alongside the long-term user embedding. Sequential and session-based recommenders, from the early GRU4Rec recurrent models to transformer-based session encoders such as SASRec and BERT4Rec, read the ordered sequence of recent item interactions and produce a next-item prediction or a context vector. The unifying idea, independent of the encoder, is recency weighting: an action taken ten seconds ago should weigh more than one taken ten minutes ago, and the influence of any single action should fade as the session continues.

A simple and widely deployed form of this is an exponentially decayed sum of item embeddings. Let $e_{i}$ be the embedding of the item touched at session step $i$, let the steps arrive at times $t_i$, and let $s_t$ be the session state at the current time $t$. With a decay rate $\lambda > 0$, the state is

$$s_t = \sum_{i : t_i \le t} \, r_i \, e_{i} \, \exp\!\big(-\lambda (t - t_i)\big),$$

where $r_i$ is a per-action weight that encodes signal strength, for example $r_i = \min(1, \text{dwell}_i / \tau)$ so a longer dwell counts for more up to a saturation point $\tau$. The decay term $\exp(-\lambda(t - t_i))$ gives the state a half-life of $\ln 2 / \lambda$: actions older than the half-life contribute less than half their original weight. Crucially, this update is incremental. When a new event arrives, the stream operator multiplies the stored state by the elapsed-time decay and adds the new term, so the per-event cost is $O(d)$ in the embedding dimension $d$ and independent of session length. That property is what lets the update run inside a stateful stream operator at the event rate of the whole fleet, the exact stateful-operator setting of Chapter 9.

Two design choices follow from the decay form. The half-life sets the system's notion of "now": a few minutes for a news feed where intent shifts fast, longer for a shopping session that spans a deliberate research arc. And session boundaries are soft. Rather than declaring a session over by a fixed timeout, the decay lets old intent fade continuously, so a user who returns after a pause resumes with a faint echo of the earlier session rather than a hard reset. This is the same windowing-versus-decay tension that Chapter 9 framed for stream aggregation in general.

3. Online Updates Versus Streaming Features Into a Static Model Intermediate

There is a fork in the road that determines most of the system's complexity, and Chapter 9 named it precisely: streaming features into a static model is not the same as training a model online. In the first design, the model parameters are fixed between batch retrains; the stream only computes fresh feature values (the session state $s_t$, a recent click-through estimate, a "trending now" counter) and the static ranker consumes them. The model's weights never move during the day. This is the safer and more common choice, because a stale feature degrades gracefully whereas a corrupted online update can poison the model for every user at once.

In the second design, the model itself learns online: incremental gradient steps on a streaming loss adjust parameters as feedback arrives, so the system adapts not just its inputs but its decision function within the day. This is genuinely online distributed learning, and it inherits every hazard Chapter 9 catalogued: feedback loops where the model's own choices bias the data it learns from, distribution shift that arrives without warning, and the difficulty of evaluating a model that changes faster than any offline test set can track. The pragmatic resolution most production systems reach is a hybrid: stream features into a mostly-static ranker for stability, and confine the genuinely online learning to a small, well-instrumented component where the feedback is fast, the action space is narrow, and the blast radius of a bad update is bounded. That small component is almost always a bandit, which is the subject of the next two sections.

Practical Example: The Feature That Was Fresh and the Model That Was Not

Who: A personalization engineer on the home-feed team of a large content platform.

Situation: The team wanted the feed to react within a session when a user suddenly engaged with a new topic, but the ranking model retrained only every six hours.

Problem: An early attempt updated the full ranker online with streaming gradients. A burst of bot traffic on one topic shifted the shared weights, and the bad update degraded recommendations for every user for forty minutes before it was rolled back.

Dilemma: Keep the model static and lose within-session reactivity, or learn online and accept that one poisoned stream can harm the entire user base at once.

Decision: They split the responsibilities. The heavy ranker stayed static between batch retrains and consumed a streamed session-state feature (the decayed embedding of Section 2); a separate per-slot bandit, scoped to a handful of feed modules, did the only online learning.

How: The stream processor computed $s_t$ per user and a Thompson posterior per module; the static ranker read $s_t$ as an input, and the bandit chose which module to surface. A poisoned stream could now skew at most one module's exploration, not the global ranker.

Result: Within-session reactivity improved measurably while a single bad stream could no longer move the shared model. The blast radius of online learning was bounded by construction.

Lesson: Stream features into the big model for stability; reserve online learning for a small component whose worst case you can name. Freshness and safety are not opposites if you put them in different boxes.

4. The Explore-Exploit Problem Intermediate

Every feedback-driven recommender faces a dilemma that batch systems are spared. To learn whether item $A$ is better than item $B$, the system must show each of them and observe the response, but every impression spent on the worse item is an impression that could have gone to the better one. A system that always serves whatever currently looks best (pure exploitation) stops gathering evidence and never discovers that a rarely-shown item would have outperformed its incumbent; a system that shows items at random (pure exploration) learns fast but serves badly. This is the multi-armed bandit problem, and it is the same exploration-versus-exploitation trade-off that distributed reinforcement learning confronts at larger scale in Chapter 20; here it appears in its simplest and most directly useful form.

Formally, each candidate item is an arm $k$ with an unknown click-through rate $\mu_k$. At each impression the policy chooses an arm, observes a reward (a click is $1$, no click is $0$), and updates its estimate. The cost of not having served the best arm $k^\star = \arg\max_k \mu_k$ is the cumulative regret after $T$ impressions,

$$R_T = \sum_{t=1}^{T} \big(\mu_{k^\star} - \mu_{a_t}\big),$$

where $a_t$ is the arm chosen at step $t$. A random policy incurs regret that grows linearly in $T$, because it keeps paying the full gap on average forever. A good bandit policy drives regret to grow only logarithmically, $R_T = O(\log T)$, because it narrows its exploration as its estimates sharpen and eventually serves the best arm almost always. That gap between linear and logarithmic regret is the entire value proposition of principled exploration, and the demonstration in Section 5 measures it directly.

Two policies dominate practice. The upper-confidence-bound rule picks the arm with the largest optimistic estimate,

$$a_t = \arg\max_{k} \left( \hat{\mu}_k + c \sqrt{\frac{\ln t}{n_k}} \right),$$

where $\hat{\mu}_k$ is the empirical mean reward of arm $k$, $n_k$ is the number of times it has been played, and the bonus term shrinks as an arm is played more, so the policy explores under-sampled arms and exploits well-sampled ones. Thompson sampling takes a Bayesian route that is often easier to distribute: maintain a posterior over each arm's CTR and, at each step, sample one plausible CTR per arm and serve the arm with the highest sample. For binary click rewards the natural posterior is a Beta distribution, $\mu_k \sim \text{Beta}(\alpha_k, \beta_k)$, updated by

$$\alpha_k \leftarrow \alpha_k + \text{(clicks on arm } k\text{)}, \qquad \beta_k \leftarrow \beta_k + \text{(non-clicks on arm } k\text{)}.$$

The sampling step is the elegant part: arms the system is unsure about have wide posteriors and therefore sometimes draw a high sample and get served, which is exploration that arises automatically from uncertainty rather than from a hand-tuned schedule. The loop is drawn in Figure 38.6.2.

1. Sample draw CTR per arm 2. Serve highest-sampled arm 3. Observe click = 1, else 0 4. Update Beta posterior posterior sharpens each loop; exploration narrows as uncertainty falls
Figure 38.6.2: The Thompson sampling loop for one slot. Each impression draws a plausible CTR from every arm's Beta posterior, serves the arm with the highest draw, observes whether the user clicked, and updates that arm's posterior. Uncertain arms have wide posteriors and occasionally win the draw, so exploration emerges from uncertainty rather than from a fixed schedule, and it narrows automatically as the posteriors sharpen.

Real recommenders rarely use the context-free version above, because the best item depends on who is asking. The contextual bandit conditions the reward on a feature vector $x_t$ (the user's session state $s_t$, time of day, device) and learns a reward model $\mu_k(x_t)$; LinUCB and its Thompson analogue maintain a posterior over a linear reward function per arm. The context vector $x_t$ is exactly the session state of Section 2, which is why the streaming feature pipeline and the bandit are two halves of one system: the stream computes the context, the bandit acts on it.

5. A Bandit That Learns the Best Arm From Clicks Intermediate

The simulation below makes the regret claim of Section 4 concrete. Five candidate items have hidden true click-through rates; arm $3$ is best at $7.0\%$. A Thompson sampling policy with a Beta posterior per arm serves twenty thousand impressions, learning only from the simulated clicks, and we compare its cumulative regret against a policy that picks arms uniformly at random. The code is the loop of Figure 38.6.2 in twenty lines.

import numpy as np

rng = np.random.default_rng(7)

# Five candidate items ("arms"), each with a hidden true click-through rate.
true_ctr = np.array([0.030, 0.055, 0.041, 0.070, 0.022])  # arm 3 is best
K = len(true_ctr)
T = 20_000                                                 # impressions served
best = int(np.argmax(true_ctr))
best_ctr = true_ctr[best]

def run_thompson():
    # Beta(alpha, beta) posterior on each arm's CTR; start at the uniform prior.
    a = np.ones(K); b = np.ones(K)
    regret = 0.0
    cum_regret = np.empty(T)
    for t in range(T):
        theta = rng.beta(a, b)          # sample a plausible CTR per arm
        arm = int(np.argmax(theta))     # serve the arm that looks best this draw
        reward = 1 if rng.random() < true_ctr[arm] else 0   # did the user click?
        a[arm] += reward                # update the posterior with feedback
        b[arm] += 1 - reward
        regret += best_ctr - true_ctr[arm]
        cum_regret[t] = regret
    return a, b, cum_regret

def run_random():
    regret = 0.0
    cum_regret = np.empty(T)
    for t in range(T):
        arm = rng.integers(K)
        regret += best_ctr - true_ctr[arm]
        cum_regret[t] = regret
    return cum_regret

a, b, ts_regret = run_thompson()
rand_regret = run_random()

post_mean = a / (a + b)
ts_pulls = (a + b - 2).astype(int)        # plays per arm (priors subtracted)

print("arm true_ctr  posterior_mean  pulls")
for k in range(K):
    mark = "  <- best" if k == best else ""
    print(f" {k}   {true_ctr[k]:.3f}      {post_mean[k]:.3f}        {ts_pulls[k]:6d}{mark}")
print()
print(f"impressions served            : {T}")
print(f"share of traffic to best arm  : {ts_pulls[best] / T:.1%}")
print(f"Thompson cumulative regret    : {ts_regret[-1]:.1f} expected clicks lost")
print(f"random-policy cumulative regret: {rand_regret[-1]:.1f} expected clicks lost")
print(f"regret reduction vs random    : {1 - ts_regret[-1] / rand_regret[-1]:.1%}")
Code 38.6.1: A Thompson sampling bandit over five items with different click-through rates, learning purely from simulated clicks. Each impression samples one CTR per arm from its Beta posterior, serves the highest, and folds the observed reward back into the posterior, exactly the four-step loop of Figure 38.6.2.
arm true_ctr  posterior_mean  pulls
 0   0.030      0.016           125
 1   0.055      0.055          1502
 2   0.041      0.043           554
 3   0.070      0.072         17542  <- best
 4   0.022      0.032           277

impressions served            : 20000
share of traffic to best arm  : 87.7%
Thompson cumulative regret    : 56.9 expected clicks lost
random-policy cumulative regret: 529.3 expected clicks lost
regret reduction vs random    : 89.3%
Output 38.6.1: Thompson sampling routed $87.7\%$ of impressions to the true best arm and lost only $56.9$ expected clicks to exploration, against $529.3$ for the random policy: an $89.3\%$ reduction in cumulative regret. The posterior means converged toward the true rates on the well-played arms, while the rarely-played poor arms kept loose estimates, which is exactly correct: there is no value in pinning down the CTR of an item you have already learned not to serve.

The numbers tell the Section 4 story exactly. The bandit did not waste impressions confirming what it already knew; it spent its exploration budget early, narrowed it as the posteriors sharpened, and converged to the best arm without ever being told which arm that was. The random policy, by contrast, paid the full average gap on every step, and its regret grew linearly to nearly ten times the bandit's. Scaled to a fleet serving billions of impressions, that gap is the difference between a recommender that compounds what it learns and one that relearns nothing.

Library Shortcut: Vowpal Wabbit and River Do the Bookkeeping

Code 38.6.1 maintained the Beta posteriors by hand to expose the mechanism. In production you reach for a library that handles contextual features, off-policy evaluation, and the streaming update loop. Vowpal Wabbit exposes contextual bandits as a first-class learner, and the river online-learning package wraps the same idea in a few lines:

from river import bandit, proba

# A Thompson-sampling bandit over the five candidate items, click rewards.
policy = bandit.ThompsonSampling(
    reward_obj=proba.Beta(),      # Beta posterior per arm, as in Code 38.6.1
    n_arms=5,
)

arm = policy.pull(range(5))       # choose which item to show this impression
# ... serve item `arm`, observe whether the user clicked ...
policy.update(arm, reward=1.0)    # fold the click back into the posterior
Code 38.6.2: The same Thompson sampling logic as Code 38.6.1, now three calls (pull, serve, update) against river. The library owns the posterior bookkeeping, the contextual feature handling, and the streaming-update loop; you supply only the reward signal and the arm set.

The bandit's exploration is a small but real cost: by construction it shows some users an item that is probably not the best, which is a deliberate sacrifice of immediate reward to buy information. Section 5's $56.9$ lost clicks were exactly that sacrifice, and a well-run system measures it. The metrics that quantify online recommendation quality, click-through rate, regret, and counterfactual lift estimated by off-policy evaluation, are the subject of Chapter 5, and they are what tells you whether an exploration budget is paying for itself.

6. The Freshness and Consistency Cost Advanced

Real-time personalization is not free, and its price is paid in freshness and consistency rather than in raw compute. Every link in the loop of Figure 38.6.1 adds latency between a user's action and the system's reaction: the event must travel from the serving node into the log, the stream processor must window and fold it, the updated state must propagate back to the serving fleet, and the next request must read it. The end-to-end freshness lag is the sum of these, and it competes directly with the per-request latency budget, because the serving node cannot wait for the state to refresh before answering. The system is therefore always acting on slightly stale state, and the engineering goal is to bound that staleness, not to eliminate it.

Consistency is the deeper cost. The session state and the bandit posteriors are mutable state shared between the stream processor that writes them and the fleet of serving replicas that read them, and that fleet is large. Pushing every posterior update synchronously to every replica would serialize the loop and destroy the latency budget, so production systems read from an eventually consistent store: a replica may serve from a posterior that is a few seconds behind the true one. For a bandit this is usually benign, because a slightly stale posterior still explores and exploits sensibly, and the next update corrects it. But it means two replicas can briefly disagree about which arm looks best, and a user who refreshes can see the recommendation flicker. The familiar tension between strong consistency and low latency, introduced in Chapter 2, returns here in recommendation dress: the system trades a controlled amount of staleness for the throughput to serve the whole fleet.

Thesis Thread: Personalization Is Online Distributed Learning Over a Stream

Real-time personalization is the case study's clearest instance of the book's thesis applied to learning rather than to training. The intelligence is distributed in space, a serving fleet of replicas reading shared state, and in time, an online loop that learns from a behavioral stream second by second. The stream processor of Chapter 9 supplies the substrate, the bandit of Chapter 20's reinforcement-learning lineage supplies the decision rule, and the consistency trade-off of Chapter 2 supplies the price. None of these is new; what is new is that they compose into a single low-latency loop where learning and serving are the same system seen at two time scales.

It is worth contrasting this centralized-stream design with the on-device alternative. The architecture here keeps behavior on a server-side stream and learns over the whole population at once, which maximizes signal but concentrates every user's clicks in one place. Federated and on-device personalization, developed in Chapter 14, makes the opposite trade: the model adapts to the user on their own device and only privacy-preserving updates leave it, so the freshness loop is local and the raw behavioral stream never crosses the network. The choice between them is a choice about where personalization lives, and it is governed by latency, privacy regulation, and how much the system can learn from one user in isolation versus from everyone at once.

Research Frontier: Generative and Lifelong Sequential Recommenders (2024 to 2026)

Session-based recommendation is being rewritten by the sequence-model revolution. Generative retrieval approaches such as Google's TIGER (Rajput et al., 2023) represent each item as a sequence of discrete semantic identifiers and let a transformer generate the next item's identifier autoregressively, collapsing candidate generation and ranking into one decoder. Meta's HSTU and the broader "generative recommenders" line (Zhai et al., 2024) reformulate ranking as a sequential transduction problem over the user's lifelong interaction stream, reporting large gains at industrial scale and pushing the session encoder toward true long-context modeling. In parallel, the exploration machinery of this section is moving from context-free bandits toward neural and deep-contextual bandits that share representation with the ranker, and toward off-policy and offline reinforcement learning that can reuse logged interaction streams without paying the full online-exploration cost. The throughline is that the real-time loop of Figure 38.6.1 is becoming a sequence model that learns continually, which sharpens rather than removes the freshness and consistency questions of Section 6.

Real-time personalization closes the gap between what a recommender knew last night and what its user is doing this second. It captures behavior as a stream, folds recent actions into a decaying session state, learns from feedback through a bandit that explores without starving exploitation, and pays for all of it in a bounded amount of staleness. The next section turns from the personalization loop to the infrastructure that must carry it, the serving fleet itself, in Section 38.7.

Exercise 38.6.1: Two Memories, One Decision Conceptual

Section 1 argues that a real-time recommender runs a slow batch backbone and a fast online layer with different time constants. For each of the following, state which memory should own the signal and why: (a) a user's three-year history of purchased categories; (b) the fact that the user opened four cooking articles in the last two minutes; (c) a brand-new item uploaded an hour ago with no interaction history; (d) a seasonal spike in winter-coat demand. For case (c), explain why the explore-exploit problem of Section 4 is the natural tool, and what regret the system pays to learn the new item's click-through rate.

Exercise 38.6.2: Measure the Cost of Exploration Coding

Extend Code 38.6.1 with an epsilon-greedy policy (with probability $\epsilon$ serve a uniformly random arm, otherwise serve the empirical-best arm) and compare its cumulative regret to Thompson sampling for $\epsilon \in \{0.01, 0.05, 0.2\}$. Plot regret against impressions for all four policies (three epsilon values plus Thompson). Explain why a fixed $\epsilon$ produces regret that grows linearly even after the best arm is found, whereas Thompson sampling's regret bends toward logarithmic, and what that implies about hand-tuned exploration schedules in a system that runs for months.

Exercise 38.6.3: The Freshness Budget Analysis

Suppose the loop of Figure 38.6.1 has these stages: event transport to the log ($20$ ms), stream windowing and state fold ($150$ ms), propagation of refreshed state to the serving fleet ($300$ ms through an eventually consistent store), and a serving-side read ($5$ ms). Compute the end-to-end freshness lag between a user's click and the first request that can act on it. If the user's median time between actions is $4$ seconds, what fraction of their actions are served on state that already reflects their previous action? Now argue, using the consistency discussion of Section 6, why driving the $300$ ms propagation toward zero by pushing every update synchronously to every replica would likely make the system slower overall, not faster.