Section 38.4: Distributed Ranking Models

"They hand me a thousand candidates and a deadline measured in milliseconds. I score every one of them precisely, in a single breath, before the user finishes blinking."
A Ranking Model, Scoring a Thousand Candidates Before the User Blinks

Big Picture

Ranking is where a recommender spends its real compute: a deep model runs one forward pass per candidate per request, so the few thousand candidates that survived retrieval each get scored precisely instead of approximately. That single design choice forces two distributed problems at once. The model must be trained across many workers on click logs too large for one machine, with embedding tables too large for one accelerator, which makes the ranker a hybrid of the data parallelism of Chapter 15 and the sharded embeddings of Chapter 11. And it must be served under a strict latency budget, scoring an entire candidate set in one batched forward across a serving fleet, which is the distributed inference of Chapter 23. This section builds the ranking stage end to end, from its loss to its FLOP budget to a runnable scorer, and shows why both training and serving are distributed for the same reason: the work no longer fits on one box.

The retrieval stage of Section 38.3 did its job by being cheap: it reduced a catalog of millions to a candidate set of a few thousand using approximate nearest-neighbor lookups over sharded embedding tables, spending almost nothing per item. Ranking now does the opposite. It takes those few thousand survivors and scores each one with a heavy deep model that reads dense features, sparse categorical features, and the user's interaction history, producing a calibrated probability that the user will engage. The retrieval stage could afford to be approximate because it only had to avoid discarding good items; the ranking stage cannot, because its scores decide the exact order the user sees. Precision here is bought with compute, and compute at this volume is distributed compute.

This division of labor, a cheap recall-oriented retriever feeding a precise compute-heavy ranker, is the spine of every large-scale recommender, and it is why the two stages have such different distributed footprints. Retrieval distributes an index; ranking distributes a model and its training. We spend this section on the second half, because the ranking model is where the recommender's accuracy is won and where most of its serving cost is paid.

1. Why Ranking Is Compute-Heavy Beginner

The defining cost of ranking is structural: the model evaluates one forward pass per candidate per request. If retrieval delivers $C$ candidates and the ranking network costs $F$ floating-point operations for a single scored item, then the per-request ranking cost is

$$\text{FLOPs}_{\text{request}} = C \cdot F,$$

and the fleet must sustain this for every request that arrives. With $C$ in the low thousands and $F$ in the tens of millions for a realistic deep ranker, a single request already costs tens of billions of operations, and a feed serving tens of thousands of requests per second multiplies that into the petaFLOP-per-second regime. No single accelerator delivers that throughput under a latency budget, so the scoring work is replicated across a serving fleet exactly as the throughput pressure of Chapter 1 predicts. Ranking is not compute-heavy by accident; it is compute-heavy by contract, because precision per candidate is the product it sells.

This is the cleanest contrast with retrieval in the whole pipeline. Retrieval's cost grows with the catalog but is sublinear in it, because an approximate index touches only a tiny fraction of items per query. Ranking's cost grows linearly with the candidate set and with the model's depth, and there is no approximation to hide behind, because every candidate must be scored to be ordered. The lever a system designer pulls is therefore the product $C \cdot F$: shrink the candidate set, shrink the model, or buy more machines. Most production systems tune all three at once.

Key Insight: Ranking Cost Is the Product of Candidate Count and Model Size

A recommender's serving bill is dominated by $C \cdot F$, the candidate count times the per-item model cost, because ranking runs a full forward pass for every candidate on every request. Retrieval is cheap because it is sublinear in the catalog; ranking is expensive because it is linear in the candidates and grows with model depth. Every serving-cost decision (how many candidates to admit, how deep a ranker to deploy, how large a fleet to provision) is a negotiation over this one product, and the latency SLO is the constraint that keeps it honest.

2. The Anatomy of a Deep Ranking Model Intermediate

Deep rankers have converged on a recognizable shape. The wide-and-deep design pairs a linear "wide" component that memorizes feature crosses with a deep component that generalizes; the DLRM family (Deep Learning Recommendation Model) makes the split explicit by routing sparse categorical features through embedding tables and dense numerical features through a multilayer perceptron, then combining them through an explicit feature-interaction layer before a final scoring head. Deep and Cross Networks (DCN) add a cross layer that learns bounded-degree feature interactions efficiently, and attention layers let the model weigh a user's interaction history so that a recent purchase counts more than a stale one. What unites these designs is the two-track structure: an embedding track for high-cardinality sparse features and a dense track for continuous features, fused and scored.

That two-track structure is exactly what makes the ranker a hybrid distributed model, and it is why the architecture diagram and the parallelism diagram are the same picture. The embedding tables hold one learned vector per category value, and with billions of category values (user identifiers, item identifiers, fine-grained context) these tables reach hundreds of gigabytes, far past the memory of one accelerator, so they are sharded across machines as in Chapter 11 and as the retrieval embeddings of Section 38.2 already were. The dense MLP, by contrast, is small enough to replicate on every worker. Figure 38.4.1 shows the resulting DLRM-style flow and labels which part is model-parallel and which is data-parallel.

Figure 38.4.1: A DLRM-style ranking model and its hybrid parallelism. Sparse categorical features (left) index into embedding tables sharded across machines (model-parallel, as in Chapter 11); dense numerical features (right) flow through a small multilayer perceptron replicated on every worker (data-parallel, as in Chapter 15). The two tracks meet at a feature-interaction layer, and a top MLP emits one click probability per candidate. The same picture describes both how the model is built and how it is split across machines.

3. Training the Ranker: Hybrid Parallelism Over Click Logs Intermediate

The ranker learns from click logs: a stream of impressions, each labeled by whether the user engaged. The natural objective is the logistic (binary cross-entropy) loss over $N$ logged impressions, where $\hat{p}_i = \sigma(z_i)$ is the model's predicted click probability and $y_i \in \{0,1\}$ the observed label,

$$L(\theta) = -\frac{1}{N} \sum_{i=1}^{N} \Big[\, y_i \log \hat{p}_i + (1 - y_i) \log (1 - \hat{p}_i)\,\Big], \qquad \hat{p}_i = \frac{1}{1 + e^{-z_i}}.$$

This is an average over examples, and as the gradient identity of Section 1.1 established, an average decomposes exactly across workers. That is what licenses data-parallel training of the dense part of the ranker: shard the click logs across $K$ workers, let each compute the gradient of the loss on its shard, and all-reduce the gradients so every replica takes the same step. The optimization machinery for this synchronization, including the staleness and communication trade-offs, is the subject of Chapter 10.

The embedding tables complicate the clean picture, and this is the heart of why a recommender ranker is a hybrid. The dense MLP is replicated and updated by all-reduce, but the embedding tables are too large to replicate, so they are sharded and updated where they live. A single training step therefore mixes two collectives: an all-to-all that routes each worker's batch of sparse lookups to the shard that holds them and routes the resulting vectors back (the embedding exchange), and an all-reduce that synchronizes the replicated dense gradients (the parameter exchange). This combination, model-parallel embeddings plus data-parallel dense layers, is precisely the DLRM training pattern, and it draws on the sharded-parameter machinery of Chapter 16 for the embedding half and the gradient all-reduce of Chapter 15 for the dense half.

Thesis Thread: One Model, Two Axes of Distribution at Once

The ranker is the book's clearest example of a single model distributed along two axes simultaneously. Its embedding tables live on the model-parallel axis, sharded because they exceed one device's memory; its dense layers live on the data-parallel axis, replicated and synchronized by all-reduce. Neither axis alone suffices, and the two collectives (all-to-all for embeddings, all-reduce for dense gradients) run in the same training step. When Chapter 1 promised that real systems live where several axes of distribution meet, the deep ranker is the promise made concrete.

Library Shortcut: TorchRec Shards the Embedding Tables for You

Writing the embedding all-to-all and the dense all-reduce by hand is error-prone. TorchRec, PyTorch's recommendation library, expresses a DLRM-style ranker and shards its embedding tables across devices automatically, choosing per-table sharding (table-wise, row-wise, or column-wise) from a cost model, while ordinary DistributedDataParallel handles the dense layers:

# pip install torchrec ; launch with torchrun --nproc_per_node=K
import torch
from torchrec import EmbeddingBagCollection
from torchrec.distributed import DistributedModelParallel
from torchrec.distributed.planner import EmbeddingShardingPlanner

ebc = EmbeddingBagCollection(tables=ranking_tables, device=torch.device("meta"))
plan = EmbeddingShardingPlanner().plan(ebc, [sharder])   # cost-based shard choice
model = DistributedModelParallel(module=ranking_model, plan=plan)  # embeddings sharded,
                                                                   # dense replicated

Code 38.4.1: A DLRM-style ranker sharded with TorchRec. The planner picks a sharding scheme per embedding table and DistributedModelParallel wires up the all-to-all for sparse features automatically; the hundreds of lines of manual collective scheduling collapse to a planner call and a wrapper, and the library owns the all-to-all and all-reduce overlap.

4. Serving: Score the Whole Candidate Set in One Batched Pass Intermediate

At serving time the ranker faces the latency SLO. A request arrives with its $C$ candidates, and the model must score all of them and return an order within a budget often well under a hundred milliseconds. The decisive optimization is batching: rather than score candidates one at a time, the serving node stacks the entire candidate set into a single tensor and runs one forward pass, turning $C$ small matrix-vector products into one large matrix-matrix product that saturates the accelerator. Because every candidate for a given request shares the same user features, this batch is natural and tight, and it is the per-request analogue of the throughput-oriented batching that Chapter 23 develops for inference fleets and that Chapter 24 pushes to its limit for LLM serving.

The fleet dimension is the same throughput story as everywhere else in the book. One serving node sustains some number of requests per second within the SLO; the offered load exceeds that; therefore the ranker is replicated across a fleet behind a load balancer, each replica holding the dense model and reaching a sharded embedding service for the sparse lookups. The embedding lookup is itself a distributed call, which is why serving latency couples the ranker to the same sharded tables that Section 38.2 introduced and that Chapter 11 built. The code below trains a tiny logistic CTR ranker on synthetic click data, measures held-out ranking quality, and then performs the key serving move: scoring an entire candidate set in one batched forward pass.

import numpy as np

rng = np.random.default_rng(7)

# ---- Synthetic click logs -------------------------------------------------
# Each impression has d dense features; a hidden linear scorer plus noise
# decides whether the user clicked (label 1) or not (label 0). The base CTR
# is deliberately low, as it is in production feeds.
N, d = 40_000, 16
X = rng.standard_normal((N, d))
w_star = rng.standard_normal(d) * 0.6
bias_star = -2.2                                  # pushes the base rate down
logits = X @ w_star + bias_star
p = 1.0 / (1.0 + np.exp(-logits))
y = (rng.random(N) < p).astype(np.float64)

# Train / held-out split.
n_tr = 32_000
Xtr, ytr = X[:n_tr], y[:n_tr]
Xte, yte = X[n_tr:], y[n_tr:]

# ---- Tiny logistic CTR ranker, trained by mini-batch SGD ------------------
w = np.zeros(d)
b = 0.0
lr, epochs, bs = 0.3, 8, 256
for _ in range(epochs):
    perm = rng.permutation(n_tr)
    for i in range(0, n_tr, bs):
        idx = perm[i:i + bs]
        xb, yb = Xtr[idx], ytr[idx]
        pb = 1.0 / (1.0 + np.exp(-(xb @ w + b)))   # forward pass
        g = pb - yb                                # logistic gradient
        w -= lr * (xb.T @ g) / len(idx)
        b -= lr * g.mean()

def predict(Xm):
    return 1.0 / (1.0 + np.exp(-(Xm @ w + b)))

# ---- Held-out ranking metrics: AUC and normalized entropy -----------------
def auc(scores, labels):
    order = np.argsort(scores)
    ranks = np.empty_like(order, dtype=np.float64)
    ranks[order] = np.arange(1, len(scores) + 1)
    n_pos = labels.sum()
    n_neg = len(labels) - n_pos
    return (ranks[labels == 1].sum() - n_pos * (n_pos + 1) / 2) / (n_pos * n_neg)

def normalized_entropy(scores, labels):
    eps = 1e-12
    scores = np.clip(scores, eps, 1 - eps)
    ll = -(labels * np.log(scores) + (1 - labels) * np.log(1 - scores)).mean()
    base = labels.mean()
    base_ll = -(base * np.log(base) + (1 - base) * np.log(1 - base))
    return ll / base_ll                            # < 1 means better than base rate

pte = predict(Xte)
print("held-out base CTR    :", f"{yte.mean():.4f}")
print("held-out AUC         :", f"{auc(pte, yte):.4f}")
print("normalized entropy   :", f"{normalized_entropy(pte, yte):.4f}")

# ---- Batched scoring: rank one user's candidate set in ONE forward pass ----
n_cand = 2000
Xc = rng.standard_normal((n_cand, d))
scores = predict(Xc)                               # single matmul over all candidates
top = np.argsort(-scores)[:5]
print("candidates scored    :", n_cand)
print("top-5 candidate ids  :", top.tolist())
print("top-5 scores         :", [f"{scores[j]:.3f}" for j in top])

Code 38.4.2: A logistic CTR ranker trained by mini-batch SGD on synthetic click logs, evaluated by AUC and normalized entropy on a held-out set, then used to score a 2000-candidate set in one batched forward pass. The predict call on Xc is a single matrix multiply over the whole candidate set, the serving move that turns $C$ separate scorings into one accelerator-friendly batch.

held-out base CTR    : 0.2305
held-out AUC         : 0.8942
normalized entropy   : 0.6114
candidates scored    : 2000
top-5 candidate ids  : [1664, 961, 213, 218, 1209]
top-5 scores         : ['0.999', '0.994', '0.990', '0.990', '0.989']

Output 38.4.2: The trained ranker reaches an AUC of $0.894$ on the held-out logs, well above the $0.5$ of random ordering, and a normalized entropy of $0.611$, comfortably below $1.0$, meaning it predicts clicks far better than always guessing the base rate. The final lines show the serving move: 2000 candidates scored in one batched pass, with the top five surfaced in rank order.

5. Measuring a Ranker: AUC and Normalized Entropy Intermediate

A ranker is judged on two distinct questions, and the metrics of Chapter 5 give one for each. The first is ordering quality: does the model place items the user will click above items the user will not? Area under the ROC curve (AUC) answers this directly, as the probability that a randomly chosen clicked item is scored above a randomly chosen non-clicked item,

$$\text{AUC} = \Pr\big(\hat{p}(x^{+}) > \hat{p}(x^{-})\big),$$

so $0.5$ is random ordering and $1.0$ is perfect ranking. AUC is the right top-line metric for ranking precisely because it ignores the absolute scale of the scores and cares only about their order, which is what determines the feed the user sees. The held-out AUC of $0.894$ in Output 38.4.2 confirms the tiny ranker has learned a useful order.

The second question is calibration: are the predicted probabilities themselves trustworthy, so that a score of $0.1$ really means a one-in-ten chance of a click? This matters because downstream decisions (ad pricing, blending multiple objectives, capping low-value impressions) consume the probabilities, not just the order. Normalized entropy (NE), the standard recommender calibration metric, divides the model's average logistic loss by the loss of a constant predictor that always outputs the empirical base CTR $\bar{p}$,

$$\text{NE} = \frac{-\frac{1}{N}\sum_{i}\big[y_i \log \hat{p}_i + (1-y_i)\log(1-\hat{p}_i)\big]}{-\big[\bar{p}\log\bar{p} + (1-\bar{p})\log(1-\bar{p})\big]},$$

so $\text{NE} = 1$ means the model is no better than predicting the base rate and $\text{NE} < 1$ means it adds genuine predictive value. The normalized entropy of $0.611$ in Output 38.4.2 says the ranker cuts the base-rate loss by nearly forty percent. AUC and NE are complementary: a model can order well yet be poorly calibrated, and a deployed ranker is expected to win on both.

6. Multi-Objective Ranking Advanced

Real feeds do not optimize click-through rate alone. A click is a weak signal of value: it can be earned by clickbait, and maximizing it can hollow out the experience. Production rankers therefore predict several objectives at once (the probability of a click, the expected dwell time, the chance of a meaningful interaction such as a share or purchase) and a diversity term that discourages a monotonous feed of near-identical items. A common formulation scores each candidate by a weighted combination of the per-objective predictions,

$$\text{score}(x) = \sum_{j} \alpha_j \cdot \hat{p}_j(x) + \beta \cdot \text{diversity}(x),$$

where the $\alpha_j$ weight the engagement objectives and $\beta$ rewards adding variety to the slate already assembled. Architecturally this is cheap to add, because the heads share the expensive embedding and interaction layers and only the small top of the network forks per objective, so the dominant $C \cdot F$ cost is paid once and amortized across all objectives. The distributed footprint is unchanged: the same sharded embeddings, the same data-parallel dense layers, the same batched serving pass, now emitting a vector of predictions per candidate instead of a scalar.

Practical Example: The Feed That Optimized Clicks and Lost Its Users

Who: A ranking team at a content-feed company.

Situation: Their single-objective CTR ranker had been tuned for months and click-through rate was at an all-time high.

Problem: Daily active users and average session length were quietly declining even as clicks rose; the model had learned to surface high-CTR clickbait that users regretted opening.

Dilemma: Keep the clean, well-calibrated single objective that the whole pipeline was built around, or add dwell-time and diversity objectives that complicate training, evaluation, and the weighting policy.

Decision: They moved to a multi-objective ranker, adding a dwell-time head and a slate-diversity term while keeping the CTR head as one signal among several.

How: The shared embedding and interaction layers stayed; they forked a small per-objective top MLP for each head and tuned the blend weights $\alpha_j$ and $\beta$ on held-out engagement, leaving the sharded-embedding plus data-parallel-dense training and the batched serving pass untouched.

Result: Raw click-through rate dipped slightly, but dwell time, session length, and seven-day retention all rose, and the diversity term ended the runs of near-duplicate items.

Lesson: A ranker optimizes what you measure. Clicks are a proxy; when the proxy and the goal diverge, add the objectives that capture the goal, and note that multi-objective ranking costs almost nothing extra to serve because the heavy layers are shared.

Research Frontier: Generative and Sequence-Native Ranking (2024 to 2026)

The pointwise score-each-candidate ranker is being challenged by sequence-native and generative designs. Meta's HSTU and the broader "generative recommenders" line (Zhai et al., 2024) recast ranking and retrieval as autoregressive sequence modeling over a user's interaction history, reporting large gains and, notably, compute that scales with model size in the way that has driven progress in language models, which reframes the recommender as a scaling-law system rather than a feature-engineering one. In parallel, semantic-ID and generative-retrieval methods (the TIGER lineage) blur the retrieval and ranking stages by decoding item identifiers directly. These designs intensify the distributed problem this section described: the models are larger, the embedding and sequence tables are heavier, and the training is more compute-bound, which pushes recommenders further onto the hybrid model-parallel and data-parallel infrastructure of Chapter 16 rather than away from it.

Fun Note: A Forward Pass Per Candidate, Every Time

It is worth pausing on the arithmetic of indifference. The ranker runs its full deep model on the item you will scroll past without a glance, with exactly the same effort it spends on the one you will tap. It cannot know which is which until it has scored both, so it scores everything precisely and then throws almost all of that work away. The candidate that placed sixth got the same careful forward pass as the candidate that placed first; ranking is the art of doing a great deal of compute to discover what to ignore.

The ranking stage hands its ordered slate back to the pipeline, where the next section assembles the final feed under business rules and serving constraints. We have seen the ranker as the recommender's compute center: a deep model trained with hybrid model-parallel and data-parallel collectives over click logs, and served as one batched forward pass per request under a latency SLO. The thread that ties it to the rest of the book is that none of this distribution was chosen for elegance; each piece was forced by a ceiling, the embedding tables by memory, the training by data volume, the serving by throughput, exactly as Chapter 1 promised every distributed AI system would be.

Exercise 38.4.1: Where the Ranking FLOPs Go Conceptual

A feed admits $C = 2000$ candidates per request, the ranking network costs $F = 4 \times 10^{7}$ FLOPs per scored candidate, and the service handles $20{,}000$ requests per second. Using $\text{FLOPs}_{\text{request}} = C \cdot F$, compute the per-request and aggregate per-second FLOP load. Then explain, in terms of the product $C \cdot F$, two qualitatively different ways to halve the serving cost and one reason a team might prefer shrinking $C$ over shrinking $F$ even though both cut the bill equally.

Exercise 38.4.2: AUC Up, Calibration Down Coding

Starting from Code 38.4.2, multiply the trained weight vector and bias by a constant factor of $3$ before predicting on the held-out set (a monotone rescaling of the logits). Recompute AUC and normalized entropy. Explain why AUC is unchanged while normalized entropy worsens, and connect this to Section 5's claim that ordering quality and calibration are distinct properties a deployed ranker must satisfy together.

Exercise 38.4.3: Which Collective for Which Parameter Analysis

A DLRM-style ranker has $300$ GB of embedding tables and a $200$ MB dense MLP, trained on $K = 32$ workers. State which part is updated by all-to-all and which by all-reduce, and why replicating the embeddings (as the dense layers are replicated) is infeasible while sharding the dense MLP would be pointless. Estimate, to an order of magnitude, the per-step bytes moved by the dense all-reduce, and argue whether the embedding exchange or the dense synchronization is more likely to dominate communication time. Tie your reasoning to the hybrid parallelism of Chapter 16.