Section 38.8: System Architecture | Building Scalable AI

"A request, touching twelve services and returning before you notice. I carry no opinions of my own; I just make sure the slow one never gets to speak last."
A Latency Budget That Has Seen Every Stage Run Late

Big Picture

A production recommender is not a model; it is a distributed system in which every piece built earlier in this chapter becomes a service, and those services are wired into a single request path that must return ranked items within a fixed latency budget no matter how much of the system is, at this very moment, partly broken. The embedding tables of Section 38.2, the candidate generation of Section 38.3, the ranking model of Section 38.4, the feature store of Section 38.5, and the real-time personalization of Section 38.6 do not run as one program. They run as separate fleets, on separate machines, scaled independently, talking over the network, and the architecture is the discipline of making that crowd of fleets answer one user in tens of milliseconds. This section assembles the pieces, hands each stage a slice of the service-level objective (SLO), sizes the fleets from the request rate, and shows what happens when one stage spends more than its allotted slice.

Every prior section of this chapter solved one problem in isolation: how to shard a billion-row embedding table, how to retrieve a few hundred candidates from a billion items, how to fetch features without a per-request database storm, how to score candidates with a deep model, how to apply diversity and policy rules to the final list. Each was correct on its own bench. The architecture is what happens when you must run all of them, for one request, at peak traffic, while a network link is flapping and a ranking replica is restarting. The unifying constraint is a number the product team gives you and will not negotiate: the recommendation must arrive within, say, 120 milliseconds at the 99th percentile, because beyond that the slot on the page is filled with a fallback and the user has already scrolled. Everything below is organized around spending that 120 milliseconds wisely across machines that each want more of it than they can have.

Figure 38.8.1: The full distributed recommender. The online serving loop (top, blue) carries one request left to right through gateway, retrieval (which itself queries the ANN index and the embedding parameter-server fleet of Section 38.2), feature fetch, ranking inference, and re-ranking, returning a ranked list within the SLO. The offline training loop (bottom, orange dashed) consumes the logs that serving emits, runs streaming feature pipelines and the training fleet, validates and canaries new models in the registry, and publishes fresh artifacts back up into the live services. The two loops run at completely different rates: the serving loop closes in milliseconds, the training loop in hours.

1. The Request Path: One User, Five Fleets Beginner

Follow a single recommendation request through Figure 38.8.1. It arrives at the gateway, which authenticates the user, attaches the request context (device, locale, session), and routes to a recommender endpoint. The gateway calls the retrieval service, which generates candidates: it looks up the user's embedding from the parameter-server fleet, runs an approximate-nearest-neighbor query against the sharded index, and returns a few hundred candidate item identifiers from a catalog of billions. Those candidates carry only identifiers, so the next stage, the feature fetch, reads the dense and sparse features for the user and for every candidate from the online feature store in one batched call. The enriched candidates go to the ranking service, a fleet of accelerator-backed inference servers that scores each candidate with the deep model from Section 38.4 and returns scores. Finally the re-rank stage applies diversity, freshness, and business-policy rules, trims to the page size, and the gateway returns the list.

Each arrow in that path is a network hop between fleets that scale independently, and the request is only as fast as the slowest stage it must wait for. Retrieval queries the embedding fleet built in Chapter 11 and the ANN index built in Chapter 25; ranking runs on the distributed inference substrate of Chapter 23. The architecture does not invent these; it composes them, and composition is where the latency goes.

Key Insight: The Architecture Is the Composition, Not the Components

Every stage in the request path was a finished, benchmarked artifact before this section. None of them is the recommender. The recommender is the wiring: the order of the calls, the budget each one is allowed to spend, the fan-out that lets retrieval and feature fetch overlap, and the fallback that fires when one of them is late. A system can have a state-of-the-art ranking model and a state-of-the-art index and still miss its SLO entirely because the composition leaks time at every hop. Architecting is budgeting and wiring, and it is a distinct skill from building any single stage.

2. Latency Budgeting: Slicing the SLO Across Stages Intermediate

The serving loop runs in sequence: retrieval must finish before feature fetch can name the candidates, feature fetch before ranking can score them, ranking before re-rank can reorder them. When stages run in series, their latencies add, so the end-to-end latency is the sum of the per-stage latencies,

$$T_{\text{e2e}} \;=\; \sum_{s=1}^{S} T_s \;\le\; T_{\text{SLO}},$$

and the architect's job is to hand each stage a budget $B_s$ whose sum leaves headroom under the SLO,

$$\sum_{s=1}^{S} B_s \;=\; T_{\text{SLO}} - T_{\text{slack}}, \qquad T_s \le B_s \;\;\text{for every } s.$$

The slack term $T_{\text{slack}}$ is not waste; it absorbs the gateway overhead, the network hops between fleets, serialization, and the small variance that every stage carries. Because the product cares about the tail, the budgets are written against a high percentile (p99), not the mean, which ties this section directly to the tail-latency analysis of Section 34.7: a stage whose mean is comfortably inside budget can still blow the SLO if its p99 is three times its mean, which is the normal state of affairs for a stage that fans out to many shards and waits for the slowest. A budget is a promise about a percentile, and the percentile is where fan-out punishes you.

Budgets also bound the optimization you are allowed to do. The ranking model can be made more accurate by scoring more candidates or by using a larger network, but only up to the point where its p99 still fits $B_{\text{rank}}$. This is the lever that connects architecture to model design: the per-node efficiency techniques of Chapter 22 (quantization, batching, KV-cache discipline) exist precisely to buy a fatter model inside a fixed time slice. The budget is the contract; per-node efficiency is how you afford a better model under it.

3. Service Topology and Capacity from the Request Rate Intermediate

Latency sizes the depth of the pipeline; throughput sizes its width. Given a peak request rate of $\lambda$ requests per second and a stage whose service time is $T_s$ seconds, the average number of requests in flight at that stage follows Little's law,

$$L_s \;=\; \lambda \, T_s,$$

and the number of replicas the fleet needs is the request rate divided by what one replica can sustain, with a safety factor for the tail and for failures,

$$R_s \;=\; \left\lceil \frac{\lambda}{\mu_s} \right\rceil, \qquad \mu_s = \text{requests per second one replica serves.}$$

The stages have wildly different $\mu_s$. A gateway replica forwards thousands of requests per second; an ANN shard serves a few thousand; a GPU ranking replica, scoring hundreds of candidates through a deep network, serves only a few hundred. So the ranking fleet is by far the widest and most expensive, and the total cost is the sum over fleets of replica count times unit price,

$$C \;=\; \sum_{s=1}^{S} R_s \, c_s,$$

which is dominated by the accelerator-backed ranking fleet. This is why so much of the architecture's economics live in the ranking stage, and why Chapter 33 scheduling and bin-packing of those GPU replicas matters: a 10% improvement in ranking throughput per replica is a 10% cut in the largest line of the bill. Autoscaling tracks $\lambda$ over the day, adding replicas before the morning peak and retiring them at night, while a cache in front of the heaviest stages (popular users, popular candidate sets) lowers the effective $\lambda$ that reaches them.

Library Shortcut: The Fleets Are Autoscaled Deployments, Not Bespoke Daemons

You do not hand-roll the replica management, health checks, and rolling updates for five fleets. Each stage is a Kubernetes Deployment with a HorizontalPodAutoscaler, and the per-stage target is a custom metric (queue depth or p99 latency, not just CPU). A few lines of manifest replace the hundreds of lines a bespoke supervisor would need:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: ranking-fleet }
spec:
  scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: ranking }
  minReplicas: 40
  maxReplicas: 200                 # ranking is the widest, most expensive fleet
  metrics:
    - type: Pods                   # scale on p99 latency, not CPU, to defend the SLO
      pods:
        metric: { name: inference_p99_ms }
        target: { type: AverageValue, averageValue: "40" }

Code 38.8.1: One autoscaler manifest per fleet. The control plane (the cluster scheduling of Chapter 33) handles placement, restarts, and rolling model updates; the architect supplies only the SLO-derived target metric and the replica bounds from $R_s = \lceil \lambda / \mu_s \rceil$.

4. Two Loops: Offline Training and Online Serving Intermediate

The orange path in Figure 38.8.1 is a second, much slower loop that the request path never blocks on. Serving emits a log of impressions and clicks; streaming pipelines (the online AI of Chapter 9) turn those events into training features and write the freshest of them straight into the online feature store so that serving sees up-to-date signals within seconds. On a slower cadence, a training fleet retrains the embedding tables and the ranking model, the model registry validates the candidate against held-out metrics and runs a canary against a slice of live traffic, and only then are the new artifacts (the index, the embedding weights, the ranker checkpoint) published into the live services. That publish step is a hot swap: the serving fleets load the new artifact behind a version flag and cut over without dropping requests, which is exactly the model-deployment discipline of Chapter 26 applied to a recommender.

Keeping the two loops separate is what lets the system be both fresh and fast. The online loop is optimized for tail latency and never waits on training; the offline loop is optimized for model quality and throughput and never touches the request path. The only coupling is the artifact handoff and the log stream, and both are asynchronous. When the loops are accidentally coupled, for example when a feature is computed live from a database the training job also hammers, the architecture loses its independence and a training-side spike leaks into serving latency. The separation is a design rule, not an accident.

Thesis Thread: Every Earlier Section, Now a Coordinated Service

This is the chapter's payoff and a clean instance of the book's spine. The embedding parameter servers (Section 38.2, scaled out from Chapter 11), the ANN retrieval fleet (Section 38.3, from Chapter 25), the distributed ranking inference (Section 38.4, from Chapter 23), the feature store (Section 38.5), and the real-time personalization (Section 38.6) do not become one big program. They stay distributed, each on its own fleet, and the architecture coordinates them to answer one user in tens of milliseconds. Scale-out is not a phase the system grows out of; it is the permanent shape of the running system, and the request path is the proof that many machines can behave as one coherent recommender.

5. Reliability: Graceful Degradation Under the SLO Advanced

At the scale of five fleets and hundreds of replicas, something is always partly broken: a ranking replica is restarting, an ANN shard is slow, the feature store is shedding load. The architecture answers the SLO not by assuming everything works but by deciding, per stage, what to return when it does not. The reliability engineering of Chapter 35 gives the vocabulary: timeouts, retries with budget, circuit breakers, and fallbacks. Each stage carries a hard deadline equal to its budget $B_s$; if the stage does not answer by its deadline, the request does not wait, it degrades. If ranking times out, return the candidates in retrieval order (a worse but valid list) rather than nothing. If the feature store times out, score with cached or default features. If retrieval times out, serve a precomputed popular-items list for the user's segment. A degraded answer inside the SLO beats a perfect answer that arrives after the slot is gone.

This fallback ladder is what makes the additive latency model of Section 2 safe in practice: no single slow stage can drag the request past the SLO, because each stage is capped at its budget and a miss converts to a fallback rather than to unbounded waiting. The tail-latency techniques of Section 34.7 (hedged requests, tied requests, sending a duplicate to a second replica and taking the first answer) attack the same problem from the other side, shrinking the tail so the fallback fires rarely. Together, budgets plus fallbacks plus hedging turn a crowd of independently-failing fleets into a service with a latency contract it actually keeps.

6. The Budget in Numbers Intermediate

The runnable demo below makes Sections 2, 3, and 5 concrete. It takes an SLO of 120 ms and a peak rate of 50,000 requests per second, assigns each of four stages a budget $B_s$ and a measured p99 $T_s$, and checks the additive constraint $\sum_s T_s \le T_{\text{SLO}}$ from Section 2. It then sizes each fleet with $R_s = \lceil \lambda / \mu_s \rceil$ and reports the in-flight count $L_s = \lambda T_s$ from Section 3, exposing the ranking fleet as the widest. Finally it degrades the ranking stage from 40 ms to 70 ms (a cold cache, a heavier model) and shows the same additive sum now blows the SLO, which is exactly the condition the fallback ladder of Section 5 must catch.

import math

SLO_MS = 120.0          # end-to-end p99 latency target
QPS    = 50_000         # peak requests per second the system must absorb

# (stage, budget_ms, measured_p99_ms, per_replica_throughput_qps)
stages = [
    ("gateway + re-rank",  10.0,  8.0, 5000),
    ("retrieval (ANN)",    35.0, 31.0, 1200),
    ("feature fetch",      25.0, 22.0, 2500),
    ("ranking inference",  45.0, 40.0,  400),
]

budget_total   = sum(b for _, b, _, _ in stages)
measured_total = sum(m for _, _, m, _ in stages)

print(f"SLO (end-to-end p99)      : {SLO_MS:.0f} ms")
print(f"sum of stage budgets      : {budget_total:.0f} ms   (slack {SLO_MS - budget_total:+.0f} ms)")
print(f"measured end-to-end p99   : {measured_total:.0f} ms   -> "
      f"{'WITHIN' if measured_total <= SLO_MS else 'BLOWS'} SLO")
print()
print(f"{'stage':<20}{'budget':>8}{'meas':>7}{'in-flight':>11}{'replicas':>10}")
print("-" * 56)
for name, b, m, cap in stages:
    in_flight = QPS * (m / 1000.0)            # Little's law: L = lambda * W
    replicas  = math.ceil(QPS / cap)          # R = ceil(lambda / mu)
    print(f"{name:<20}{b:>6.0f}ms{m:>5.0f}ms{in_flight:>11.0f}{replicas:>10}")

total_replicas = sum(math.ceil(QPS / cap) for _, _, _, cap in stages)
print("-" * 56)
print(f"{'total fleet':<20}{'':>13}{'':>11}{total_replicas:>10}")
print()

# One stage regresses: ranking p99 drifts 40 -> 70 ms (cold cache, larger model)
deg = [(n, b, (70.0 if n == 'ranking inference' else m), c) for n, b, m, c in stages]
deg_total = sum(m for _, _, m, _ in deg)
print("Ranking p99 drifts 40 -> 70 ms:")
print(f"  new end-to-end p99      : {deg_total:.0f} ms   -> "
      f"{'WITHIN' if deg_total <= SLO_MS else 'BLOWS'} SLO by {deg_total - SLO_MS:+.0f} ms")

Code 38.8.2: A latency-budget and capacity calculator for the request path. It checks the additive SLO constraint, sizes each fleet from the peak request rate, and demonstrates that a single slow stage breaks the end-to-end budget even though every other stage is well inside its own.

SLO (end-to-end p99)      : 120 ms
sum of stage budgets      : 115 ms   (slack +5 ms)
measured end-to-end p99   : 101 ms   -> WITHIN SLO

stage                 budget   meas  in-flight  replicas
--------------------------------------------------------
gateway + re-rank       10ms    8ms        400        10
retrieval (ANN)         35ms   31ms       1550        42
feature fetch           25ms   22ms       1100        20
ranking inference       45ms   40ms       2000       125
--------------------------------------------------------
total fleet                                        197

Ranking p99 drifts 40 -> 70 ms:
  new end-to-end p99      : 131 ms   -> BLOWS SLO by +11 ms

Output 38.8.2: At the planned operating point the path returns in 101 ms with 5 ms of designed slack, and the ranking fleet (125 replicas) dwarfs the others, confirming where the cost lives. When ranking alone drifts from 40 ms to 70 ms, the additive sum jumps to 131 ms and blows the 120 ms SLO by 11 ms, even though gateway, retrieval, and feature fetch never moved. This is precisely the moment the fallback ladder of Section 5 must fire.

Practical Example: The Slow Shard That Took Down the Page

Who: An ML platform engineer on the home-feed team of a large e-commerce site.

Situation: The recommender held a 120 ms p99 SLO comfortably at 96 ms for months, exactly as Code 38.8.2 plans for.

Problem: After a routine index rebuild, p99 jumped past 120 ms during the evening peak and the home feed started rendering the static fallback for a noticeable fraction of users.

Dilemma: Add ranking replicas immediately (treat it as a throughput problem) or trace which stage's tail had moved (treat it as a latency problem), with revenue bleeding either way.

Decision: They traced first, because the additive model says a single stage's p99 is enough to blow the sum, and throughput-driven autoscaling would not fix a tail caused by one slow shard.

How: Per-stage tracing showed retrieval p99 had climbed from 31 ms to 58 ms while its mean was unchanged; one ANN shard, rebuilt with a bad parameter, was the slowest replica that every fan-out query had to wait for.

Result: Enabling hedged requests on retrieval (Section 34.7) cut its p99 back to 34 ms within minutes, and a corrected rebuild of the bad shard closed it permanently; no replicas were added.

Lesson: Under an additive SLO the tail lives in one stage at a time. Trace the percentile per stage before you scale, because a latency problem and a throughput problem call for opposite remedies.

Research Frontier: Learned and SLO-Aware Cascades (2024 to 2026)

The fixed-budget cascade of this section is being made adaptive. Recent work on recommendation serving treats the retrieve-then-rank pipeline as a learned cascade whose depth and candidate count are chosen per request to hit the latency target, spending more compute on hard or high-value requests and short-circuiting easy ones; this builds on the early-exit and cascade-ranking lineage and on SLO-aware serving schedulers. A parallel line studies generative and large-model recommenders, where a single transformer subsumes retrieval and ranking (the generative-recommendation direction of work such as Meta's HSTU / generative recommenders, 2024), which collapses several stages of Figure 38.8.1 into one fleet and shifts the entire latency budget onto the per-node efficiency techniques of Chapter 24. The open question for the architect is the same one this section poses: how to keep a hard tail-latency contract while the model inside the budget grows.

7. Reading the Whole System as One Picture Beginner

Step back from Figure 38.8.1 and the recommender is legible as a single object: a fast loop and a slow loop sharing two thin connections, a log stream down and an artifact handoff up. The fast loop spends a fixed budget across a fixed sequence of fleets and degrades rather than waits; the slow loop turns the fast loop's exhaust (its logs) into the fast loop's fuel (its models) on a cadence the fast loop never feels. Every component is something this chapter already built, and every cross-chapter dependency, the collectives of Chapter 4 inside training, the parameter servers of Chapter 11, the retrieval of Chapter 25, the serving of Chapters 23 and 24, the scheduling of Chapter 33, the reliability of Chapter 35, is here as a service with a budget and a fallback. The final section hands this architecture back to you as a buildable project, growing a single-machine baseline into the full federation of services one stage at a time, with a measurable milestone at each step. That continues in Section 38.9.

Exercise 38.8.1: Redraw the Budget Conceptual

The product team tightens the SLO from 120 ms to 90 ms. Using the additive model of Section 2 and the stage budgets in Code 38.8.2, propose a new budget assignment $\{B_s\}$ that fits 90 ms with a sensible slack term. State explicitly which stage you cut the most and justify it from where the time and the risk actually sit (consider that ranking is both the largest budget and the most accuracy-sensitive). Then name one architectural change (not a budget edit) that would let you reclaim time at that stage without losing ranking quality, and tie it to a specific earlier chapter.

Exercise 38.8.2: Add a Cache Tier Coding

Extend Code 38.8.2 with a cache in front of the ranking stage that serves a hit in 3 ms and has hit rate $h$. The effective request rate reaching the ranking fleet becomes $\lambda(1-h)$, and the stage's contribution to end-to-end p99 becomes a blend of the 3 ms hit path and the 40 ms miss path. For $h \in \{0.3, 0.5, 0.7\}$, recompute the ranking fleet's replica count $R_{\text{rank}} = \lceil \lambda(1-h)/\mu_{\text{rank}} \rceil$ and the blended p99, and report at which hit rate the system would still hold the SLO if the miss-path ranking latency drifted to 70 ms. Print a small table.

Exercise 38.8.3: Cost of the Tail Analysis

Using $R_s = \lceil \lambda/\mu_s \rceil$ and the cost model $C = \sum_s R_s c_s$ from Section 3, assume a gateway replica costs \$0.05/hr, an ANN replica \$0.20/hr, a feature-store replica \$0.15/hr, and a GPU ranking replica \$2.00/hr. Compute the total hourly fleet cost at the operating point of Output 38.8.2. Now suppose you must defend the p99 against a 1.5x tail by over-provisioning every fleet by 50% (a common headroom rule). Recompute the cost and report which single fleet drives the increase. Argue, from these numbers, why investments in ranking per-replica throughput (Chapter 22) pay back faster than the same effort spent on any other stage.