Section 38.1: Problem Definition | Building Scalable AI

"You scrolled, and I had thirty milliseconds. In that blink I asked a hundred machines what you might like, watched a billion items shrink to twenty, and handed you a row of suggestions as if I had been thinking about you all along."
A Recommendation, Assembled in Eighty Milliseconds From a Hundred Machines

Big Picture

A web-scale recommender is the one system in this book where the model itself, not just the data or the request volume, is the thing that does not fit on a machine. Its embedding tables hold a learned vector for every item in a catalog of billions and every user in a population of hundreds of millions, and those tables run to terabytes that no single accelerator can store. On top of that immovable model sits a serving path that must turn a billion candidate items into a ranked handful within tens of milliseconds, hundreds of thousands of times per second. This chapter takes one such recommender apart end to end. This opening section states the problem precisely, fixes the scale numbers that make a single machine impossible, decomposes the multi-stage pipeline into the components the rest of the chapter builds, and maps each component onto the six axes of distribution from Chapter 1. Everything that follows is engineering against the requirements written down here.

A recommender system answers a question that has no query text: given who you are and what you are looking at right now, which few items, out of a catalog of billions, should fill this row of the screen. On a laptop, over a few thousand products and a few hundred users, this is a weekend project: learn a small matrix factorization, score every item for every user, and sort. The interesting version, and the one this chapter studies, is the version that powers a large marketplace, video platform, or social feed. When the catalog is measured in billions of items, the audience in hundreds of millions of users, and the response budget in tens of milliseconds, every component that was trivial on the laptop becomes a distributed system in its own right. The purpose of this case study is to show how the machinery built in Parts III through VII assembles into one coherent product, and where the seams between those parts actually fall.

We will treat the system as a concrete target rather than an abstraction. Throughout the chapter, the running specification is a recommendation service with a catalog on the order of two billion items, an active user population in the hundreds of millions, a serving path that must return a ranked slate within a latency a person never notices, a peak throughput measured in hundreds of thousands of requests per second, and a freshness requirement that lets a click recorded seconds ago shape the next recommendation. Those five pressures, catalog and user scale, latency, throughput, freshness of signals, and cost, are the requirements; the rest of the chapter is the design that satisfies them.

1. The Requirements: Five Numbers That Rule Out One Machine Beginner

A problem definition is only useful if it commits to numbers, because the numbers decide which ceilings bind and therefore which axes of distribution we must reach for. We fix five requirements and carry them through the entire chapter. The first is scale: the system learns and serves embeddings for a catalog of $N_{\text{items}} \approx 2 \times 10^{9}$ items and $N_{\text{users}} \approx 4 \times 10^{8}$ users, plus categorical features (the advertiser, the query, the device, the context) whose vocabularies add billions more rows. The second is latency: the end-to-end serving budget, from request to ranked slate, sits on the order of tens of milliseconds, because a recommendation that arrives after the page has rendered is a recommendation nobody saw. The third is throughput: the service must sustain hundreds of thousands of requests per second at peak, every one of which fans out across the retrieval and ranking fleet. The fourth is freshness: a user who just clicked, skipped, or purchased should see that signal reflected within seconds, so the system is never a nightly batch but a continuously updated flow. The fifth is cost: the design must hit the first four at a price an operator would pay, which immediately disqualifies any approach that scores the whole catalog per request or holds a private copy of the embedding tables on every server.

These five requirements interact, and the interaction is the whole subject. Freshness fights cost, because applying every click to the embedding tables in real time burns memory bandwidth and network traffic. Latency fights catalog size, because a larger catalog means more candidates to consider per request. Throughput fights latency, because every concurrent request competes for the same retrieval and ranking hardware. A single machine fails all five at once and for independent reasons: it cannot store the embedding tables, cannot scan billions of candidates inside the budget, cannot apply the update stream, and cannot serve the request volume. The next section makes that failure quantitative.

Key Insight: The Model, Not the Data, Is What Overflows Here

In the RAG case study of Chapter 36 the corpus was the giant; the generator fit on a few accelerators. A recommender inverts this. Its deep network is small, often a few hundred megabytes of dense weights, but its embedding tables, one learned vector per id across catalog, users, and every categorical feature, are terabytes. The binding ceiling is parameter memory, which is why this case study leans hardest on the distribute-the-model axis and on the sharded parameter servers of Chapter 11. Read a recommender's requirements and the first ceiling you hit is almost always the size of the embedding table, not the size of the dataset.

2. Why a Single Machine Is Not Even Close Beginner

The argument for distribution here is not rhetorical; it falls out of arithmetic on the five requirements. Take the embedding tables first. Every id the model knows about, an item, a user, an advertiser, a query token, gets a learned vector of dimension $d$ stored at $b$ bytes per dimension. The number of rows across all tables and the table size in bytes are

$$R = N_{\text{items}} + N_{\text{users}} + N_{\text{feat}}, \qquad S_{\text{emb}} = R \cdot d \cdot b,$$

where $N_{\text{feat}}$ collects the categorical-feature vocabularies. With billions of rows, a dimension $d = 256$, and $b = 4$ bytes per dimension in single precision, $S_{\text{emb}}$ runs to several terabytes, one to two orders of magnitude past the memory of any single node. The number of parameter-server shards the table must be split into is the table size divided by the memory each node can devote to it,

$$K_{\text{shard}} = \left\lceil \frac{S_{\text{emb}}}{M_{\text{node}}} \right\rceil,$$

which for a node offering tens of gigabytes of fast memory to embeddings lands in the dozens to low hundreds. The serving path is no kinder. It cannot score all $N_{\text{items}}$ candidates per request, so it runs a funnel that shrinks the catalog through successive stages, retrieval then ranking then final selection, with a multiplicative reduction

$$\rho = \prod_{j} \frac{n_{j}}{n_{j+1}},$$

that takes billions of items down to the few tens actually shown. Even the surviving candidates obey a concurrency wall. The query path follows Little's law: the number of requests in flight at any instant is the arrival rate times how long each one stays in the system,

$$C = \lambda \cdot L,$$

so a throughput $\lambda$ of hundreds of thousands of requests per second at a latency $L$ of tens of milliseconds keeps thousands of requests being served concurrently, which no single server's compute or memory can hold. Figure 38.1.1 shows the multi-stage pipeline and the axis each stage loads; Code 38.1.1 puts real numbers behind every one of these formulas.

Figure 38.1.1: The multi-stage distributed recommender as a dataflow. The sharded embedding tables (top) hold terabytes of learned vectors that never fit on one node and serve both training and the online path. The serving path (middle) embeds the user and context, retrieves about ten thousand candidates from a billion-item catalog by approximate nearest neighbor, ranks the survivors down to a few hundred, and selects a final slate of roughly twenty, all within the latency budget. The streaming layer (lower band) turns clicks and views seconds old into fresh features and embedding updates, and the online evaluation loop (bottom) measures the served slates. Each stage is labeled with its approximate scale (from Code 38.1.1) and the axis of distribution it loads most heavily.

The diagram fixes the vocabulary for the chapter. The terabyte-scale embedding tables at the top are the model that cannot live on one machine and that every other stage reads from. The middle row is the per-request serving funnel that runs under the latency SLO. The lower band is the streaming feature and freshness layer that keeps the model current; the bottom strip is the online evaluation that keeps the whole thing honest. The numbers on each box come from the same back-of-envelope model, which we now run with real arithmetic so that no estimate in this chapter is a guess.

import math

# --- Catalog and user scale (the five requirements) ---
N_items   = 2_000_000_000    # catalog size (billions of items)
N_users   = 400_000_000      # active users (hundreds of millions)
N_feat    = 4_000_000_000    # extra categorical-feature ids (ad, query, context)
dim       = 256              # embedding dimension d
bytes_id  = 4                # fp32 per dimension, b
qps       = 300_000          # peak queries per second, lambda
latency   = 0.030            # end-to-end SLO budget, seconds (tens of ms)
mem_node  = 64               # GB of fast memory a node devotes to embeddings

# --- Embedding table size: rows R = items + users + features ---
R       = N_items + N_users + N_feat
S_emb   = R * dim * bytes_id                    # table size in bytes
emb_tb  = S_emb / 1e12
shards  = math.ceil(S_emb / (mem_node * 1e9))   # K_shard parameter-server shards

# --- Candidate funnel: catalog -> retrieved -> ranked -> shown ---
funnel  = [N_items, 10_000, 500, 20]
rho     = [funnel[j] / funnel[j + 1] for j in range(len(funnel) - 1)]

# --- Concurrency from Little's law: C = lambda * L ---
C = qps * latency

# --- Cost per 1000 requests at an assumed serving-fleet rate ---
fleet_nodes   = 1200
node_per_hour = 3.50                              # USD per node-hour
cost_per_1k   = (fleet_nodes * node_per_hour) / (qps * 3600) * 1000

print(f"embedding rows R (all tables): {R:,}")
print(f"embedding table S_emb        : {emb_tb:,.1f} TB")
print(f"parameter-server shards      : {shards:,}")
print(f"candidate funnel             : {' -> '.join(f'{x:,}' for x in funnel)}")
print(f"per-stage reduction factors  : {' , '.join(f'{r:,.0f}x' for r in rho)}")
print(f"queries in flight C=QPS*L    : {C:,.0f}")
print(f"cost per 1000 requests       : ${cost_per_1k:.3f}")

Code 38.1.1: The five-requirement back-of-envelope model for the running specification. Each printed quantity corresponds to one formula in this section: the row count $R$, the table size $S_{\text{emb}}$, the shard count $K_{\text{shard}}$, the funnel sizes and their reduction factors $\rho$, the Little's-law concurrency $C$, and a cost-per-thousand-requests estimate.

embedding rows R (all tables): 6,400,000,000
embedding table S_emb        : 6.6 TB
parameter-server shards      : 103
candidate funnel             : 2,000,000,000 -> 10,000 -> 500 -> 20
per-stage reduction factors  : 200,000x , 20x , 25x
queries in flight C=QPS*L    : 9,000
cost per 1000 requests       : $0.004

Output 38.1.1: Real output. The embedding tables alone are $6.6$ terabytes and demand on the order of one hundred shards; the serving funnel must shrink the catalog by a factor of two hundred thousand at retrieval before ranking even begins; and the query path keeps roughly nine thousand requests in flight at once. Every number here is a wall the single machine hits, and each maps to a different axis of distribution.

The output settles the question. The embedding tables are $6.6$ terabytes, which alone exceeds the memory of any single accelerator by one to two orders of magnitude and forces the table across roughly one hundred shards before training or serving can begin. The serving funnel must reduce two billion candidates to twenty, a total compression of a hundred million to one, with the first and largest cut, a factor of two hundred thousand, taken at retrieval rather than by brute-force scoring. The query path keeps thousands of requests in flight simultaneously. None of these is a marginal overshoot that a faster chip would absorb; each is one to several orders of magnitude beyond a single node, and each lands on a different stage of the pipeline. That separation is exactly what makes the six-axis decomposition the right tool.

Thesis Thread: The Model Itself Is Too Big for One Machine

Chapter 1 named three pressures that force distribution: data, model, and throughput. Most case studies in this part lean on data and throughput. The recommender is the book's cleanest instance of the second pressure, the one where the model overflows. Its $6.6$-terabyte embedding table cannot be replicated onto a serving node the way a few-hundred-megabyte network can; it must be sharded across machines that the serving path queries over the network, exactly the sharded-parameter pattern introduced in Chapter 11 and generalized to model shards in Chapter 16. When you reach the capstone in Chapter 41 and must defend a distribution axis, this is the worked example for "the parameters do not fit": scale-out is not an optimization on the recommender, it is the only form in which the recommender can exist.

3. Decomposing the Pipeline Onto the Six Axes Intermediate

The six axes of distribution from Section 1.1, distribute data, distribute training, distribute the model, distribute inference, coordinate the cluster, and distribute intelligence, give us a map onto which every stage of Figure 38.1.1 places cleanly. Mapping the stages is the planning act of the whole chapter: it tells us which earlier part of the book owns each stage and which later section of this chapter develops it. Table 38.1.1 is that map, and it doubles as the chapter's table of contents.

Table 38.1.1: Each pipeline stage assigned to the axis of distribution it loads most heavily, the earlier chapter that owns the underlying machinery, and the later section of this chapter that builds it.

Pipeline stage	Primary axis	Owning earlier chapter	Built in this chapter
Learn the embeddings	Distribute the model	Ch 11 (parameter servers, sharded embeddings)	Section 38.2
Candidate generation / retrieval	Coordinate the cluster	Ch 25 (ANN, vector search)	Section 38.3
Ranking	Distribute inference	Ch 23, Ch 24	Section 38.4
Feature store	Distribute data	Ch 8 (storage and loading)	Section 38.5
Real-time personalization	Distribute data	Ch 9 (stream processing)	Section 38.6
Online evaluation	Distribute intelligence	Ch 5 (evaluation)	Section 38.7
End-to-end architecture	Coordinate the cluster	Ch 23	Section 38.8
Capstone build	All axes	Ch 41	Section 38.9

Reading the table top to bottom traces the dataflow of Figure 38.1.1, and reading the third column traces the path back through the book. Learning the embeddings is the act of distributing a model too large for one node: the tables are sharded across parameter servers that workers push gradients to and pull current vectors from, the push-pull pattern of Chapter 11, here scaled to billions of rows. Candidate generation is approximate nearest-neighbor retrieval over a sharded vector index, the scatter-gather across shards that Chapter 25 owns, and its tight latency makes it primarily a cluster-coordination problem. Ranking is distributed inference: a deep scoring model replicated across the fleet and run on the few hundred candidates the funnel let through, drawing on the inference-systems and serving economics of Chapter 23 and Chapter 24. The feature store is distributed data, the storage and loading machinery of Chapter 8 specialized to serve features at both training and request time. Real-time personalization is stream processing, the online-AI material of Chapter 9 that turns a click seconds old into a fresh feature. Evaluation closes the loop with the methodology of Chapter 5. No stage is new machinery; the contribution of this chapter is the assembly and the seams.

Practical Example: The Recommender That Outgrew Its Server's Memory

Who: A machine learning engineer at a growing video platform owning the home-feed model.

Situation: The recommender ran as a single service: one process loaded the whole embedding table into a high-memory server, scored a few million candidate videos per request, and returned a ranked feed.

Problem: The catalog crossed a hundred million items, the embedding table outgrew the largest single server's memory, and per-request scoring of the full catalog pushed tail latency past the budget.

Dilemma: Rent an ever-larger memory-heavy machine to hold the table and keep brute-force scoring, a scale-up move that postponed the wall by one order of magnitude and left throughput capped at one box, or re-architect into sharded embeddings with a retrieve-then-rank funnel, the structure of Figure 38.1.1.

Decision: They re-architected, sharding the embedding table across parameter servers, replacing full-catalog scoring with approximate-nearest-neighbor retrieval, and replicating a smaller ranking model behind the retrieval stage.

How: They ran the back-of-envelope model of Code 38.1.1 on their own catalog size and QPS first, which told them the table needed dozens of shards and the funnel had to cut the catalog by five orders of magnitude before ranking, and they sized the cluster from those numbers rather than from guesses.

Result: The feed met its latency SLO at the larger catalog, and because the embedding table was sharded and the funnel replaced the full scan, growing the catalog further meant adding shards and retrieval replicas without touching the ranking code.

Lesson: When the embedding table, not the request volume, is the first ceiling, no single machine is the answer. Sharding the model and replacing the full scan with a funnel is the move, and the arithmetic of Section 2 tells you how far to take each.

4. The Training Path Versus the Serving Path Intermediate

One structural decision organizes the rest of the chapter, and it is visible in Figure 38.1.1 as the difference between the streaming layer that updates the tables and the serving funnel that reads them. The training path and the serving path have opposite cost profiles and must be engineered against opposite metrics. The training path is throughput-bound and latency-tolerant: it streams interaction events into the sharded embedding tables and the ranking model, it is measured in events processed per second and in how quickly a signal propagates, and a gradient applied a few seconds late costs almost nothing. The serving path is latency-bound and throughput-replicated: each request must finish inside the tens-of-milliseconds budget, it is measured in tail latency and in how many concurrent requests the fleet sustains, and being even a hundred milliseconds slow costs an impression. Conflating these two paths is a common design error, because the techniques that make training cheap, large asynchronous batches and relaxed deadlines, are exactly the techniques that ruin the serving path.

The two paths meet at two shared artifacts: the embedding tables and the feature store. The training path writes them; the serving path reads them. Freshness is precisely the rate at which updates flow from the streaming layer into the tables and features that the serving path is already querying, which is why neither can be a static snapshot but must be live, incrementally updated structures, a point Section 38.5 and Section 38.6 take up in detail. The concurrency the serving path must hold, the $C = \lambda L$ from Output 38.1.1, is the budget that Section 38.4 spends on replicated ranking. Keeping the two paths logically separate while letting them share exactly those two artifacts is the architectural spine of the chapter, and Section 38.8 draws the full picture.

Library Shortcut: The Laptop Recommender That Hides Every Distributed Problem

It is worth seeing the system this chapter does not study, because it shows exactly which problems scale removes the luxury of ignoring. A few lines of a modern library give a working recommender on a small catalog:

# pip install implicit scipy
import implicit
from scipy.sparse import csr_matrix

# user x item interaction matrix, small enough to live in RAM
interactions = csr_matrix(load_clicks())          # one machine, one process
model = implicit.als.AlternatingLeastSquares(factors=128)
model.fit(interactions)                            # learns ALL embeddings in memory
recs = model.recommend(user_id, interactions[user_id], N=20)  # scores every item

Code 38.1.2: A complete single-machine recommender in a handful of lines. It silently assumes the entire embedding table fits in RAM, training touches every interaction at once, and scoring can scan the whole catalog per request. Each of those assumptions is one of the requirements from Section 1, and each fails at web scale; the rest of this chapter is what these lines become when the table is terabytes, the events never stop arriving, and the catalog cannot be scanned per request.

5. What This Chapter Builds, Section by Section Beginner

With the requirements fixed, the scale proven, and the axes mapped, the path through the chapter is set. Section 38.2 learns the sharded embeddings, the terabyte-scale model that the parameter servers of Chapter 11 hold across machines. Section 38.3 builds candidate generation, the approximate-nearest-neighbor retrieval that takes the $200{,}000\times$ first cut of the funnel from Output 38.1.1. Section 38.4 ranks the survivors with a replicated deep scorer, spending the concurrency budget $C$ on distributed inference. Section 38.5 builds the feature store that both paths read and write. Section 38.6 adds real-time personalization, turning the click stream into fresh signals within seconds. Section 38.7 closes the loop with online evaluation, A/B testing, interleaving, and counterfactual estimates that judge served slates. Section 38.8 assembles the end-to-end architecture, and Section 38.9 turns it into a buildable project. Each section opens with the slice of Figure 38.1.1 it owns and the requirement from Section 1 it must satisfy.

Research Frontier: Where Web-Scale Recommendation Is Moving (2024 to 2026)

The boundary of this problem is shifting on every axis at once. On the model side, generative and sequence-based recommenders such as Meta's HSTU and the wider "generative recommenders" line (Zhai et al., 2024) recast ranking as next-token prediction over interaction histories, trading hand-built features for scale and pushing embedding tables larger still, which only sharpens the distribute-the-model pressure of Section 2. Semantic-id and learned-tokenizer approaches in the TIGER lineage replace billion-row id tables with compact generative codes, attacking the $S_{\text{emb}}$ ceiling directly. On the retrieval side, learned and disk-resident approximate-nearest-neighbor indices in the DiskANN and SPANN lineage pack billions of vectors per node and shrink the shard count $K_{\text{shard}}$ that Section 2 computed. On the serving side, the funnel itself is under revision as single-stage retrieve-and-rank models blur the boundary between candidate generation and ranking. The constant across all of it is the five-requirement tension of Section 1: every advance is judged by whether it improves one of scale, latency, throughput, freshness, or cost without surrendering the other four.

Fun Note: A Hundred Million to One, Twenty Times a Second in Your Pocket

Output 38.1.1 says the funnel compresses two billion candidates to twenty, a hundred-million-to-one reduction, and does it inside a budget of tens of milliseconds. Per item shown, the system has considered roughly a hundred million it chose not to show, and it repeats that feat hundreds of thousands of times a second. The screen you scroll betrays none of it: a calm little row of suggestions, behind which a hundred machines just held a very fast, very large election and agreed on twenty winners before your thumb finished moving.

Exercise 38.1.1: Read the Requirements, Name the Ceilings Conceptual

For each of the five requirements in Section 1 (catalog and user scale, latency, throughput, freshness, cost), name the single pipeline stage in Figure 38.1.1 that it pressures hardest, the axis of distribution from Table 38.1.1 that stage sits on, and one concrete failure that occurs if that stage is left on a single machine. Then identify which two of the five requirements are in the most direct tension with each other and explain why satisfying one cheaply makes the other harder. Contrast your answer with the RAG case study of Chapter 36: which requirement binds first there, and why does the recommender's first ceiling differ?

Exercise 38.1.2: Resize the System Coding

Modify Code 38.1.1 to model a catalog five times larger ($N_{\text{items}} = 10^{10}$) and an embedding dimension of $512$ instead of $256$. Report the new table size $S_{\text{emb}}$ and shard count $K_{\text{shard}}$, then compute how many shards are saved by storing the table in $b = 1$ byte (int8) instead of $4$ bytes. Next, add a freshness model: assume the click stream arrives at $500{,}000$ events per second and each event updates a handful of embedding rows; estimate the write throughput in gigabytes per second that the sharded tables must absorb, and discuss whether that load is spread evenly across the $K_{\text{shard}}$ shards or concentrated on the most popular items.

Exercise 38.1.3: The Concurrency Budget Across the Funnel Analysis

Using Little's law $C = \lambda L$ from Section 2, suppose the serving path must sustain $\lambda = 250{,}000$ requests per second at a target $p50$ latency of $0.025$ seconds, of which retrieval consumes $0.010$ seconds and ranking consumes the rest. Estimate the number of concurrent requests in each of the two stages separately, and argue from those two numbers which stage dominates the fleet sizing. Then discuss how the answer changes if a long tail makes the $p99$ latency four times the $p50$, and connect your reasoning to the replicated-inference treatment forthcoming in Section 38.4 and the serving economics of Chapter 24.