Section 38.9: Project Extension | Building Scalable AI

"I was a tidy matrix factorization over a thousand movies. Then someone handed me the billion-item catalog, and I had to learn that I would never again fit on one machine."
A Toy Recommender, Dreaming of the Billion-Item Catalog

Big Picture

This chapter taught the production recommender as a federation of distributed services; this final section hands it back to you as a buildable project that grows a single-machine matrix-factorization baseline into that federation one stage at a time, with a measurable milestone at every step. You will start where every recommender starts, a single process that factorizes a user-item interaction matrix on a public dataset and retrieves the top items by brute-force scan. Then you will scale each stage out in the order the chapter introduced it: shard the embedding table when it outgrows one machine's memory (Section 38.2), build sharded approximate-nearest-neighbor retrieval (Section 38.3), train a distributed ranker (Section 38.4), stand up a feature store with point-in-time joins (Section 38.5), add a bandit for real-time personalization (Section 38.6), and finish with an A/B harness (Section 38.7) wired into the full service architecture (Section 38.8). The point is not the dataset; it is that one project, carried end to end, is a production recommender in miniature that touches the model, data, inference, and coordination axes of Section 1.2. What makes this case study distinctive among the book's is that the model itself, the embedding table, is the resource too big for one machine, so distribution is forced not by data volume alone but by the parameters.

A recommender read passively teaches the shape of a retrieve-then-rank funnel; a recommender rebuilt teaches why each stage of that funnel had to be distributed. The eight sections before this one walked the system as a finished artifact: a sharded embedding table feeding a sharded candidate retriever, a distributed ranker scoring the candidates, a feature store serving point-in-time-correct features, a bandit layer adapting in real time, and an A/B harness proving the whole thing lifts a business metric. This section inverts that posture. It gives you a staged construction plan that begins with a baseline small enough to fit one laptop and honest enough to measure against, then removes one single-machine assumption per stage. Each stage draws on a specific earlier section and a specific earlier chapter, so the project doubles as a guided tour back through the book: when you shard the embedding table you are applying the parameter-server and sharded-embedding machinery of Chapter 11, when you build the retriever you are applying the distributed vector search of Chapter 25, when you train the ranker you are applying the data parallelism of Chapter 15, and when you add the bandit you are applying the online-learning and streaming machinery of Chapter 9. The discipline that makes the project worth its time is the one Section 1.1 opened with: distribute a stage only when a ceiling forces it, and prove with a number that the distribution helped.

Figure 38.9.1: The staged build. The baseline at the top runs matrix factorization and brute-force retrieval on one machine; each numbered stage below scales one part of the recommender out, drawing on the section and chapter that own that mechanism, and attaches a measurable milestone (capacity, recall, ranking quality, correctness, online lift, tail latency and cost). Carried to the end, the six stages exercise the model, data, inference, and cluster-coordination axes of Section 1.2, which is why this single project is a production recommender in miniature.

1. The Baseline You Scale Out From Beginner

Every honest scale-out project begins with a single-machine baseline, for two reasons. The first is correctness: the distributed retriever must return essentially the same items as the brute-force baseline, and you cannot check that against a baseline you never built. The second is measurement: speedup, recall, and lift are all defined relative to a reference, and the reference is the one-machine system. Your baseline is a single process over a public interaction dataset at MovieLens scale (a few thousand users, a few thousand items, a few hundred thousand ratings) that learns a low-rank factorization $R \approx P Q^{\top}$, where $P$ holds a $K$-dimensional embedding per user and $Q$ a $K$-dimensional embedding per item, then retrieves the top items for a user by scoring every item and keeping the largest dot products. This is the two-tower model in its smallest honest form, and it fits one machine comfortably, which is exactly the point: it is the thing the rest of the project will outgrow.

The code below is that baseline compressed to one dependency-free file. It generates synthetic implicit interactions from latent user and item tastes, trains the factorization with a Bayesian-personalized-ranking objective, retrieves the exact top items per user, and then immediately shows the second scale-out move, replacing the brute-force scan with a clustered approximate-nearest-neighbor search (Section 38.3) that probes only a few item clusters, and measures how much recall it keeps against the exact retrieval. That recall-against-exact comparison, the retrieval analogue of the gradient identity in Section 1.1, is the invariant your project tracks at every retrieval stage.

import math, random

random.seed(0)
N_USERS, N_ITEMS, DIM = 200, 500, 16

def rand_vec(seed, dim):              # deterministic latent vector from a string seed
    r = random.Random(seed)
    return [r.gauss(0, 1) for _ in range(dim)]

user_lat = {u: rand_vec(f"u{u}", DIM) for u in range(N_USERS)}
item_lat = {i: rand_vec(f"i{i}", DIM) for i in range(N_ITEMS)}

def dot(a, b):
    return sum(x * y for x, y in zip(a, b))

# Synthetic implicit feedback: each user "engaged" with its true top-10 items.
interactions = {}
for u in range(N_USERS):
    s = [dot(user_lat[u], item_lat[i]) for i in range(N_ITEMS)]
    order = sorted(range(N_ITEMS), key=lambda i: s[i], reverse=True)
    interactions[u] = set(order[:10])

# Stage A: single-machine matrix factorization trained by BPR (two-tower).
K, lr, reg, EPOCHS = 8, 0.05, 0.01, 40
P = {u: [random.gauss(0, 0.1) for _ in range(K)] for u in range(N_USERS)}
Q = {i: [random.gauss(0, 0.1) for _ in range(K)] for i in range(N_ITEMS)}
pairs = [(u, i) for u in range(N_USERS) for i in interactions[u]]
for _ in range(EPOCHS):
    random.shuffle(pairs)
    for u, i in pairs:
        j = random.randrange(N_ITEMS)                 # sample a negative item
        while j in interactions[u]:
            j = random.randrange(N_ITEMS)
        pu, qi, qj = P[u], Q[i], Q[j]
        g = 1.0 / (1.0 + math.exp(dot(pu, qi) - dot(pu, qj)))   # BPR gradient weight
        for k in range(K):
            pu[k] += lr * (g * (qi[k] - qj[k]) - reg * pu[k])
            qi[k] += lr * (g * pu[k] - reg * qi[k])
            qj[k] += lr * (-g * pu[k] - reg * qj[k])

# Stage B: exact top-k retrieval (the brute-force baseline).
def exact_topk(u, k):
    return sorted(range(N_ITEMS), key=lambda i: dot(P[u], Q[i]), reverse=True)[:k]

# Stage C: sharded ANN-style retrieval. Cluster items, probe the nearest clusters.
NCLUST, NPROBE = 16, 6
centroids = [rand_vec(f"c{c}", K) for c in range(NCLUST)]
for _ in range(3):                                    # a few k-means refinement passes
    asg = {c: [] for c in range(NCLUST)}
    for i in range(N_ITEMS):
        asg[max(range(NCLUST), key=lambda c: dot(centroids[c], Q[i]))].append(i)
    for c in range(NCLUST):
        if asg[c]:
            centroids[c] = [sum(Q[i][k] for i in asg[c]) / len(asg[c]) for k in range(K)]
clusters = {c: [] for c in range(NCLUST)}
for i in range(N_ITEMS):
    clusters[max(range(NCLUST), key=lambda c: dot(centroids[c], Q[i]))].append(i)

def ann_topk(u, k):                                   # scan only the probed shards
    probe = sorted(range(NCLUST), key=lambda c: dot(centroids[c], P[u]), reverse=True)[:NPROBE]
    cand = [i for c in probe for i in clusters[c]]
    return sorted(cand, key=lambda i: dot(P[u], Q[i]), reverse=True)[:k]

# Evaluation: MF recall vs ground truth, and ANN recall vs the exact retrieval.
def recall(pred, truth):
    return len(set(pred) & set(truth)) / len(truth)

KK, mf_recall, ann_vs_exact, scanned = 10, 0.0, 0.0, 0
for u in range(N_USERS):
    ex, an = exact_topk(u, KK), ann_topk(u, KK)
    mf_recall += recall(ex, interactions[u])
    ann_vs_exact += recall(an, ex)
    probe = sorted(range(NCLUST), key=lambda c: dot(centroids[c], P[u]), reverse=True)[:NPROBE]
    scanned += sum(len(clusters[c]) for c in probe)
mf_recall /= N_USERS; ann_vs_exact /= N_USERS
frac = scanned / (N_USERS * N_ITEMS)

print(f"users / items / latent K        : {N_USERS} / {N_ITEMS} / {K}")
print(f"MF recall@{KK} vs ground truth    : {mf_recall:.3f}")
print(f"ANN recall@{KK} vs exact retrieval: {ann_vs_exact:.3f}")
print(f"fraction of catalog scanned      : {frac:.3f}  (nprobe={NPROBE}/{NCLUST})")
print(f"retrieval speedup vs exact scan  : {1.0 / frac:.1f}x")

Code 38.9.1: The recommender-in-miniature baseline and its first scale-out step. Stage A trains a matrix factorization; Stage B retrieves the exact top items by brute force; Stage C clusters the item embeddings and probes only the nearest clusters, the toy form of the sharded approximate-nearest-neighbor index of Section 38.3. The two reported recalls separate model quality (does the factorization find the right items?) from retrieval quality (does the shard-and-probe scheme keep them?).

users / items / latent K        : 200 / 500 / 8
MF recall@10 vs ground truth    : 0.441
ANN recall@10 vs exact retrieval: 0.961
fraction of catalog scanned      : 0.416  (nprobe=6/16)
retrieval speedup vs exact scan  : 2.4x

Output 38.9.1: The clustered retrieval keeps $96.1\%$ of the exact top-ten while scanning only $42\%$ of the catalog, a $2.4\times$ retrieval speedup. The separation of the two recalls is the lesson: the factorization's own recall against ground truth ($0.441$ here, limited by the tiny latent rank) is a model question for the ranker stage, while the near-unit ANN-versus-exact recall is the retrieval invariant your project must hold as you shard the index across machines.

Key Insight: Separate the Two Recalls, or You Will Debug the Wrong Stage

A recommender funnel has two distinct quality numbers and confusing them wastes weeks. The first is whether the model put the right items into reach at all, measured as recall of the exact top-$k$ against ground-truth engagement; it is a property of the embeddings and the loss, and you improve it by training a better model (Stage 3). The second is whether the retrieval mechanism keeps the items the model already ranked highly, measured as recall of the approximate retrieval against the exact retrieval; it is a property of the index, and you improve it by probing more shards or refining the partition (Stage 2). Output 38.9.1 reports both side by side precisely so you never attribute a retrieval regression to the model, or a weak model to the index. Every milestone in this project names which of the two it governs.

2. Staging the Scale-Out, Milestone by Milestone Intermediate

With the baseline in hand, you scale the recommender out one stage at a time, in the order of Figure 38.9.1, never advancing until the current stage hits its milestone. The discipline of one-stage-at-a-time matters because it isolates cause and effect: when recall drops or tail latency spikes, exactly one thing changed, and you know which section's mechanism to inspect. Table 38.9.1 is the project plan. Each row names the stage, the binding ceiling that forces the distribution, the section and chapter that supply the mechanism, and the quantitative milestone that tells you the stage is done.

Table 38.9.1: The staged build plan. Scale out top to bottom; do not advance until the milestone is met. Each stage removes one single-machine assumption and draws its mechanism from the named section and chapter.

Stage	Binding ceiling	Mechanism (section, chapter)	Milestone to hit
1. Shard the embedding table	Parameters exceed one node's memory	Section 38.2, Ch 11	Catalog larger than any single node holds; lookups served from shards
2. Sharded ANN retrieval	Candidate scan too slow over the full catalog	Section 38.3, Ch 25	Recall@$k$ within 1 point of exact at target latency
3. Distributed ranker	Ranking model and data exceed one GPU	Section 38.4, Ch 15	Ranking AUC at scale matched within 0.005 of single-node
4. Feature store, point-in-time joins	Features must be leakage-free across the fleet	Section 38.5, Ch 9	Zero train/serve skew; no future leakage in any join
5. Real-time bandit	Static model cannot adapt within a session	Section 38.6, Ch 9	Online reward lift over the static ranker, significant
6. A/B harness + service architecture	Request volume and decision validity	Section 38.7, Section 38.8	p99 latency under SLO; cost per 1000 requests bounded

The milestones are quantitative on purpose. "Sharded the embedding table" is not a milestone; "the table now spans four nodes and per-key lookups return in under a millisecond at the p99" is. Stage 1 is a capacity stage, judged by the catalog size it unlocks once the table no longer fits one box, the embeddings-at-scale move that Chapter 11 owns. Stage 2 is a quality-under-latency stage, where the move from exact scan to approximate nearest neighbor (Chapter 12, Chapter 25) trades a sliver of recall for a large latency win, and the milestone forbids trading away more than a point. Stage 3 is a model-quality stage, where data-parallel training (Chapter 15) must reproduce the single-node ranking AUC. Stage 4 is a correctness stage: a feature store with point-in-time joins exists to forbid the most common silent bug in recommendation, training on a feature value that was not yet known at serving time. Stage 5 is an online-lift stage, where a bandit (Chapter 9) earns its place only if it raises reward over the static ranker by a statistically significant margin. Stage 6 closes the loop with the A/B harness and the service architecture, proving the fully distributed system meets its tail-latency SLO at a defensible cost per thousand requests, the evaluation discipline of Chapter 5 applied to a live recommender.

Practical Example: The Capstone That Grew One Stage at a Time

Who: A graduate student building this exact project as a term capstone on a four-node lab cluster plus two cloud GPUs.

Situation: The single-machine matrix-factorization baseline over a MovieLens-scale dataset trained in under a minute and retrieved acceptable recommendations.

Problem: The assignment required a billion-item-style design, but jumping straight to a fully distributed system produced something slower than the baseline and impossible to debug.

Dilemma: Rebuild everything distributed in one leap, fast to write but a black box when it underperformed, or scale one stage at a time against milestones, slower to start but debuggable throughout.

Decision: The staged plan of Table 38.9.1. Profiling showed candidate retrieval, not training, dominated serving latency, so the student sharded the embedding table and built the approximate-nearest-neighbor retriever first, and stopped at each milestone before touching the next stage.

How: The table was sharded across four nodes; retrieval moved to a clustered index with recall held within half a point of exact; the ranker was trained data-parallel across the two GPUs with AUC matched to the single-node run; the feature store enforced point-in-time joins; a Thompson-sampling bandit re-ranked the top candidates; and an A/B harness compared the bandit arm to the static ranker.

Result: The final pipeline matched the baseline's offline ranking quality, served retrieval roughly three times faster, showed a small but significant online reward lift from the bandit, and met its p99 budget at a cost per thousand requests the student could defend line by line.

Lesson: Profile to find the binding stage, scale that one first, and never advance past an unmet milestone. A staged build is the only kind whose final numbers you can trust.

3. The Numbers Your Project Must Hit Intermediate

A recommendation project lives or dies by whether its claimed wins are real, so each milestone is a quantity you compute, not a feeling. Retrieval quality is recall at $k$, the fraction of the relevant items that appear in the returned top $k$,

$$\text{recall@}k = \frac{|\,\text{top-}k \cap \text{relevant}\,|}{|\,\text{relevant}\,|}, \qquad \text{AUC} = \Pr\big[\,s(\text{positive}) > s(\text{negative})\,\big],$$

where the ranking AUC is the probability that the ranker scores a clicked item above a non-clicked one, the standard pairwise quality metric for the Stage-3 ranker. For Stage 2, the milestone pins approximate recall against exact recall to within a few points, exactly the gap Output 38.9.1 measured at $0.961$ before any latency pressure; the retrieval speedup is the reciprocal of the catalog fraction you scan, so probing $42\%$ of clusters buys the $2.4\times$ of that output. For Stage 1, the relevant number is sizing the sharded embedding table, the parameter ceiling that defines this case study. A catalog of $N$ items with a $K$-dimensional float32 embedding needs $4 N K$ bytes of item-tower parameters, and the shard count is $\lceil 4 N K / M \rceil$ for a per-node memory budget $M$. Concretely, $N = 10^9$ items at $K = 128$ is $4 \times 10^9 \times 128 = 512$ gigabytes of item embeddings alone, so with $M = 64$ gigabytes per node you need at least $\lceil 512 / 64 \rceil = 8$ shards before any replication, and the user tower adds its own table on top, which is the stage-1 capacity argument from Section 38.2 made arithmetic. The service-level milestones are tail latency and unit cost: p99 latency must sit under the serving SLO at the target queries per second, and cost per thousand requests must stay within budget, the two numbers that decide whether the system ships. Compute all of these targets before you build, so each milestone is a prediction you test rather than a result you rationalize.

One stage has a milestone that is not a number but a guarantee. The feature store of Stage 4 exists to make point-in-time joins correct: when you assemble the training row for an interaction at time $t$, every feature value in that row must be the value that was known at time $t$, never a value computed afterward. Violating this is training on the future, the leakage bug that makes an offline AUC look spectacular and an online deployment collapse. The milestone is therefore binary, zero leakage, and you verify it by replaying the feature timeline and asserting that no join reads a value with a timestamp later than the label it is paired with, the streaming-correctness discipline of Chapter 9 applied to the most expensive silent failure in recommendation.

Library Shortcut: Each Stage Is a Few Lines in a Production Tool

The hand-rolled factorization and clustered retrieval in Code 38.9.1 are for understanding; in the real project each stage maps to a framework that handles the distribution for you. Code 38.9.2 names that mapping, turning the staged plan into a near-deployment plan exactly as the design checklist did in Section 1.8. TorchRec shards the embedding table across GPUs with a single planner call, and FAISS turns the from-scratch cluster-and-probe of Code 38.9.1 into an inverted-file index with one constructor:

# Each staged milestone -> the production tool that owns it.
STACK = {
    "shard embeddings":   "TorchRec           # sharded embedding tables across GPUs",
    "sharded retrieval":  "FAISS-IVF / ScaNN  # ANN shards, scatter-gather merge",
    "distributed ranker": "torch DDP / TorchRec DLRM  # data-parallel ranker train",
    "feature store (PIT)":"Feast / Tecton     # point-in-time-correct feature joins",
    "real-time bandit":   "Vowpal Wabbit / River  # online contextual bandit updates",
    "A/B + serving":      "Ray Serve + a stats harness  # replicate, route, evaluate",
}
for stage, tool in STACK.items():
    print(f"{stage:20s}-> {tool}")

Code 38.9.2: The six staged milestones mapped to the production tool that owns each. A finished Table 38.9.1 row selects a key, and the value is the framework the corresponding chapter teaches; the from-scratch table and retrieval of Code 38.9.1 collapse to a TorchRec sharding plan and a FAISS inverted-file index behind a Ray Serve deployment.

4. Extension Challenges Worth the Cluster Advanced

Once the six stages hit their milestones you have a working distributed recommender, and the project becomes a platform for the harder questions the chapter only gestured at. Each extension below adds one capability that a real billion-item system needs, and each reaches into a different part of the book, so finishing them turns the capstone from a funnel into a system. Add multi-objective ranking, where the ranker no longer optimizes click probability alone but a weighted blend of engagement, dwell time, and a long-term-value proxy, and measure the trade-off frontier as you sweep the objective weights; the milestone is a Pareto curve, not a single AUC. Tackle cold-start, the problem of users and items with no interaction history and therefore no learned embedding, by adding content features (a side tower over item metadata) that let the model place a brand-new item near similar ones; the milestone is recall on a held-out set of items the model never saw during training. Add a diversity-and-fairness pass over the final ranking, so that a single popular category cannot monopolize the slate and so that exposure is allocated across item providers within a stated bound; the milestone is a diversity or exposure metric held at a target while recall stays within budget.

Two further extensions push the candidate generator and the trust boundary. Add graph candidates, generating a slice of the retrieval set from a user-item interaction graph rather than embedding similarity alone, the distributed-graph-ML machinery of Chapter 13; co-visitation and random-walk candidates often surface long-tail items that pure embedding retrieval misses, and the milestone is a long-tail-coverage lift. If your recommender serves multiple tenants or markets and you aggregate signals across them, add the reliability and privacy guarantees of Part VII so that one corrupted upstream signal cannot poison the slate and no single user's interactions leak through a cross-tenant aggregate, the robust-aggregation and differential-privacy ideas of Chapter 35. Each extension is a small, bounded change to a working system, which is exactly the posture in which distributed-systems concepts are learned best: against a baseline you can measure, in a funnel you already understand.

Research Frontier: Where Distributed Recommendation Is Heading (2024 to 2026)

The extensions above track live research lines, so a capstone that implements them is working at the current edge. Embedding-table scaling has moved past static sharding toward learned and compressed representations: methods in the lineage of compositional and hash embeddings, and the Meta DLRM and TorchRec ecosystem, push trillion-parameter tables onto manageable fleets, and quantized or product-keyed tables shrink the per-node footprint that Stage 1 sizes. On the retrieval side, learned candidate generation is displacing pure two-tower similarity: generative-retrieval models that decode item identifiers directly (the TIGER and semantic-ID lineage) and graph-augmented candidate sources are being pushed to web-scale catalogs. On the modeling side, recommendation is converging with sequence modeling, where transformer-style user-history encoders (the SASRec and HSTU lineage from Meta's generative-recommender work) treat the interaction stream as a sequence and report large offline and online gains, which reframes the Stage-3 ranker as a distributed sequence-model training problem closer to Chapter 19 than to classical ranking. Finally, real-time personalization is moving from session bandits toward full streaming model updates, where the feature store of Stage 4 and the online learner of Stage 5 merge into a continuously trained system, the frontier where the streaming machinery of Chapter 9 meets recommendation at scale.

5. Chapter Summary and What You Built Beginner

This section closes Chapter 38, so it is worth stating the through-line the whole chapter built. We began with the problem definition (Section 38.1): recommend from a catalog so large that the model itself, the embedding table mapping every user and item to a vector, does not fit on one machine. From there the chapter walked the retrieve-then-rank funnel stage by stage, and every stage was the same move applied to a different part of the system, partition the work across machines, move only what must be moved between them, and recombine the result correctly. Sharding the embedding table (Section 38.2) distributed the model itself, applying the sharded-embedding and parameter-server machinery of Chapter 11 to the parameters that no node could hold. Sharded candidate retrieval (Section 38.3) turned a query into the scatter-gather approximate-nearest-neighbor pattern of Chapter 25. The distributed ranker (Section 38.4) applied the data parallelism of Chapter 15 to score candidates at scale. The feature store with point-in-time joins (Section 38.5) and the real-time bandit (Section 38.6) brought the streaming-correctness and online-learning machinery of Chapter 9 to bear, and the A/B harness (Section 38.7) and service architecture (Section 38.8) proved, with the evaluation discipline of Chapter 5, that the distributed system lifts a real metric without breaking its latency budget. The chapter is, end to end, one distributed system whose distinguishing feature is that the model, not just the data, is too big for one machine.

Thesis Thread: When the Model Is the Thing That Will Not Fit

The book's spine is that AI at scale is the engineering of systems whose data, computation, models, inference, and decisions are distributed across many machines, and that each distribution is forced by a ceiling, not chosen for elegance. Most case studies in this part hit the data ceiling first; distributed recommendation is the clearest demonstration of the model ceiling, because the embedding table, a parameter for every user and every item in a billion-item catalog, is itself terabytes that no accelerator can hold. That single fact reorganizes the whole system: the table must be sharded (the model axis), candidates must be retrieved from a sharded index (the inference axis), the ranker must be trained data-parallel over interactions no node can store (the data axis), and the fleet that serves it all must be coordinated under a tail-latency budget (the coordination axis). The staged project in this section is the thesis made buildable: you start with a recommender that fits one laptop and watch the parameters, not the data, be the first thing to force it onto a cluster. That is the bridge into the remaining case studies, each of which leans hard on a different subset of these same axes.

Key Takeaway: Chapter 38 as a Buildable System

Distributed recommendation is not six unrelated tricks; it is one funnel in which every stage is the same partition-move-recombine move applied to a different resource, and the resource that forces distribution first is the model. (1) Shard the embedding table so the catalog exceeds any single node's memory. (2) Sharded approximate-nearest-neighbor retrieval to fetch candidates while holding recall. (3) Data-parallel the ranker so it matches single-node AUC at scale. (4) A feature store with point-in-time joins so training never reads the future. (5) A real-time bandit so the system adapts within a session for a measured online lift. (6) An A/B harness and service architecture so the distributed system meets its p99 SLO at a bounded cost per thousand requests. Built in this order against milestones, the funnel exercises the model, data, inference, and coordination axes of distribution, which is why it is a production recommender in miniature.

Project Ideas: Build the Recommender, Then Push It

Each idea is sized so that carrying it through the staged plan of Table 38.9.1 becomes a capstone in the sense of Chapter 41. Core build: start from the Code 38.9.1 baseline on a public dataset (MovieLens, or a larger public interaction log) and scale all six stages out on a small cluster or cloud spot instances, recording the recall@$k$, ranking AUC, p99 latency, online lift, and cost-per-1000-requests milestone at each stage; the deliverable is a writeup in which every number is measured against the baseline. Multi-objective ranking: blend engagement, dwell time, and a long-term-value proxy in the Stage-3 ranker and report the Pareto frontier as you sweep the weights. Cold-start: add a content side tower over item metadata and report recall on items the model never saw in training. Diversity and fairness: add a slate-level diversity or provider-exposure constraint and hold the metric at target while recall stays within budget. Graph candidates: generate part of the retrieval set from a user-item interaction graph (Chapter 13) and report the long-tail-coverage lift. Trust and privacy (multi-tenant): add Byzantine-robust signal aggregation and differentially private cross-tenant statistics (Chapter 35) so one corrupt signal cannot poison the slate and no user's interactions leak through an aggregate.

Exercise 38.9.1: Name the Ceiling, Stage by Stage Conceptual

For each of the six stages in Table 38.9.1, state the single binding ceiling (data, model, or throughput) that forces the distribution, and the axis from Section 1.2 it maps to. Then explain why distributed recommendation hits the model ceiling before the data ceiling, in contrast to the web-scale RAG case study of Chapter 36, by reasoning about what the embedding table stores. Finally, identify the one stage whose milestone is a binary correctness guarantee rather than a quantity, and explain why violating it makes an offline metric lie.

Exercise 38.9.2: Extend the Miniature Recommender Coding

Starting from Code 38.9.1, (a) sweep NPROBE from 1 to 16 and plot ANN recall-versus-exact against the fraction of the catalog scanned, then state the smallest NPROBE that meets a recall-held milestone of $0.95$ and the retrieval speedup it buys. (b) Add a simple content side tower for cold-start: hold out 50 items from training, give each a content vector that is the average of its cluster's trained embeddings, and report recall on those held-out items against a baseline that gives them random embeddings. (c) Add a second objective to the score by mixing the dot product with a popularity penalty $-\lambda \log(1 + \text{count}_i)$, sweep $\lambda$, and show how the trade-off between recall and the average popularity of the recommended items moves.

Exercise 38.9.3: Size the Table and Bound the Cost Analysis

A target catalog has $N = 2 \times 10^9$ items, each embedded to a $K = 128$-dimensional float32 vector, plus $5 \times 10^8$ users embedded the same way. (a) Compute the total embedding-table volume and the minimum shard count for a per-node memory budget of $M = 80$ gigabytes, before any replication, separating the item tower from the user tower. (b) The serving funnel retrieves $1000$ candidates and ranks them; if retrieval probes $1\%$ of the sharded index and the ranker scores all $1000$ candidates at $50$ microseconds each, estimate the per-request compute time and the queries per second one ranker replica sustains. (c) Given a target of $50{,}000$ queries per second and a cloud price you assume for a ranker replica, compute the number of replicas and the cost per thousand requests, and state which stage-6 milestone in Table 38.9.1 this number governs.