Section 9.8: Distributed Real-Time Inference Pipelines

"They asked me to score every event the instant it arrived. Then they asked me to do it twice as fast. I asked them, politely, to send the events twice as slowly."
A Model Operator Holding the Line on Latency

Big Picture

A real-time inference pipeline is a distributed system that turns a never-ending stream of events into a never-ending stream of decisions, and its hardest problems are not the model but the plumbing around it: where the model runs, what happens when it cannot keep up, and whether an action fires exactly once. The model is a single step in a longer flow: events arrive on a log, features are joined in, a prediction is computed, and an action is emitted to a downstream topic or service. Each stage runs on its own machines, and the whole chain must answer within a latency budget while events keep coming whether or not anyone is ready for them. This section assembles the chain from the pieces built earlier in the chapter, then confronts the three failures that distinguish a pipeline that scales from one that merely runs: a model slower than the stream, a duplicated or missing action, and a budget blown by one slow hop.

The previous sections gave us the ingredients. Section 9.5 put events on a partitioned, replayable log so they survive a consumer crash. Section 9.7 showed how to compute features online so a model sees fresh signal rather than yesterday's batch aggregate. What remains is to place a model in the middle of that flow and let its predictions drive action while the events are still warm. A fraud score is worthless an hour after the transaction cleared; a recommendation is worthless after the user has left the page. Real-time inference is the discipline of scoring events fast enough that the decision still matters, across machines that fail independently and a load that never politely pauses.

Figure 9.8.1: The streaming inference pipeline. Events flow left to right from the input topic through feature enrichment, a bounded queue, model scoring, and out to an action topic. The dashed red path is the backpressure signal: when the queue fills because the model cannot keep pace, the pressure propagates back to the source, which slows down rather than discarding events. Whether the scoring step lives inside the stream operator or in an external service is the coupling decision of Section 1.

1. Embed the Model or Call a Service Beginner

The first design choice is where the model actually runs relative to the stream. In the embedded pattern, the model weights are loaded inside the stream-processing operator itself, so scoring is an ordinary in-process function call on the same machine that already holds the event. In the service pattern, the operator makes a network call to a separate model-serving deployment (a microservice fronted by a load balancer), which holds the weights and may scale independently. The two patterns trade the same currency that the whole book trades: latency against coupling.

Embedding is fast because there is no network hop and no serialization; the prediction is a function call away, and the per-event latency is just the model's compute time. The cost is coupling. The model now ships, scales, and fails together with the stream job; updating the model means redeploying the pipeline, and a large model multiplies the memory of every operator instance. Calling a service decouples the two: the serving fleet scales to its own load, the model is updated without touching the pipeline, and a GPU-backed model is shared across many stream consumers instead of replicated into each. The cost is a network round trip on the critical path and a second system that can fail or slow independently. That serving fleet, with its batching, replication, and autoscaling, is a subject in its own right; we build it in Chapter 23 and specialize it for large language models in Chapter 24. Here we only need the trade-off, because it decides the shape of the pipeline.

Key Insight: Embed for Latency, Serve for Independence

Embed the model in the stream operator when the model is small, the latency budget is tight, and the model changes rarely: you pay no network hop and the pipeline is one deployable unit. Call an external serving service when the model is large or GPU-bound, when it is shared across many consumers, or when it must be updated independently of the pipeline: you pay a round trip but gain independent scaling and a single place to roll the model forward. The choice is not aesthetic; it is set by which resource binds, exactly as the ceiling argument of Section 1.1 sets every other distribution decision in this book.

2. Backpressure: When the Model Is Slower Than the Stream Intermediate

A stream does not wait. Events arrive at whatever rate the world produces them, and that rate is not under your control. The model, by contrast, has a fixed throughput: it can score only so many events per second. When the arrival rate exceeds the service rate, the gap has to go somewhere, and there are only three places it can go. It can pile up in a buffer (which grows without bound until memory runs out), it can be discarded (which silently loses events and corrupts every downstream count), or it can push back on the source so the source slows down. The third option is backpressure, and it is the only one that is both bounded and lossless.

The arithmetic is unforgiving. If events arrive at rate $\lambda$ and the model serves them at rate $\mu$, then whenever $\lambda > \mu$ the queue depth grows on average at $\lambda - \mu$ per unit time. A buffer of capacity $C$ overflows after roughly $C / (\lambda - \mu)$ seconds of sustained overload, no matter how large $C$ is. You cannot buffer your way out of a rate mismatch; you can only delay the reckoning. Backpressure converts that doomed race into a stable control loop: as the queue fills toward a high-water mark, the consumer signals the source to slow its emission, and as the queue drains below a low-water mark, the source is released back to full speed. The queue oscillates inside its bounds and nothing is lost.

Fun Note: The Checkout Line Knows This Already

Backpressure is the supermarket closing lanes. When one cashier (the model) cannot keep up and the line (the queue) snakes into the frozen aisle, the store does not start throwing groceries in the bin. It opens another lane or, failing that, the line itself slows people down at the door. A bounded queue with a throttle is the same idea with fewer impulse purchases.

The simulation below builds exactly this loop in pure Python: a producer whose peak rate is twice the model's service rate, a bounded queue, and a hysteresis controller that throttles the producer between a high- and low-water mark. We watch the queue grow, the throttle engage, and the drop count stay at zero. The model in the loop is a stand-in (any scoring function would do); the point is the control behavior of the pipeline around it.

import collections
import random

random.seed(7)

# --- fixed system parameters (one "tick" = 1 millisecond of virtual time) ---
TICKS = 4000                 # total virtual milliseconds simulated
QUEUE_CAP = 200              # hard buffer ceiling (events)
HIGH_WATER = 150             # throttle the producer at/above this depth
LOW_WATER = 50               # release the throttle at/below this depth
MODEL_MS = 4.0               # the model scores one event every 4 ms => 250 ev/s

# Peak demand (0.50/ms = 500 ev/s) is TWICE the model's 250 ev/s service rate:
# the pipeline is structurally overloaded and MUST apply backpressure.
PEAK_RATE = 0.50
THROTTLED_RATE = 0.15        # rate the producer falls back to under backpressure

queue = collections.deque()
throttled = False
next_model_free = 0.0        # virtual time when the model can take the next event
produced = scored = dropped = 0
peak_depth = backpressure_ticks = 0
samples = []

for tick in range(TICKS):
    depth = len(queue)
    if depth >= HIGH_WATER:          # hysteresis controller: two thresholds
        throttled = True
    elif depth <= LOW_WATER:
        throttled = False
    if throttled:
        backpressure_ticks += 1

    rate = THROTTLED_RATE if throttled else PEAK_RATE   # source obeys the signal
    if random.random() < rate:
        if len(queue) < QUEUE_CAP:
            queue.append(tick); produced += 1
        else:
            dropped += 1; produced += 1                 # last-resort drop only

    if tick >= next_model_free and queue:               # model scores one event
        queue.popleft(); scored += 1
        next_model_free = tick + MODEL_MS

    peak_depth = max(peak_depth, len(queue))
    if tick % 250 == 0:
        samples.append((tick, len(queue), throttled))

print("model service rate     :", f"{1000.0 / MODEL_MS:.0f} events/s")
print("producer peak rate     :", f"{PEAK_RATE * 1000:.0f} events/s  (2x the model)")
print()
print("tick    queue_depth   producer_state")
for t, d, thr in samples:
    print(f"{t:>5}   {d:>5} {'#' * (d // 5):<32} {'THROTTLED' if thr else 'full-rate'}")
print()
print("events produced        :", produced)
print("events scored          :", scored)
print("events dropped (buffer) :", dropped)
print("peak queue depth       :", f"{peak_depth}  (cap {QUEUE_CAP})")
print("ticks under backpressure:",
      f"{backpressure_ticks} / {TICKS}  ({100.0 * backpressure_ticks / TICKS:.0f}% of the run)")
print("drop rate              :", f"{100.0 * dropped / produced:.2f}% of produced events")

Code 9.8.1: A streaming inference loop with a bounded queue and a hysteresis backpressure controller. The producer emits at twice the model's service rate until the queue crosses the high-water mark, at which point it throttles; it speeds back up only after the queue drains below the low-water mark.

model service rate     : 250 events/s
producer peak rate     : 500 events/s  (2x the model)

tick    queue_depth   producer_state
    0       0                                  full-rate
  250      71 ##############                   full-rate
  500     134 ##########################       full-rate
  750     129 #########################        THROTTLED
 1000     106 #####################            THROTTLED
 1250      90 ##################               THROTTLED
 1500      56 ###########                      THROTTLED
 1750      95 ###################              full-rate
 2000     145 #############################    THROTTLED
 2250     110 ######################           THROTTLED
 2500      88 #################                THROTTLED
 2750      66 #############                    THROTTLED
 3000      77 ###############                  full-rate
 3250     141 ############################     full-rate
 3500     127 #########################        THROTTLED
 3750      95 ###################              THROTTLED

events produced        : 1078
events scored          : 1000
events dropped (buffer) : 0
peak queue depth       : 151  (cap 200)
ticks under backpressure: 2676 / 4000  (67% of the run)
drop rate              : 0.00% of produced events

Output 9.8.1: The queue rises toward the high-water mark of 150, the controller throttles the producer (active 67% of the run), and the depth oscillates between the two marks. Peak depth reaches 151, comfortably inside the cap of 200, and not a single event is dropped despite a source that wanted to run at twice the model's rate.

The trace tells the whole story. The queue climbs while the producer runs full-rate, crosses 150, and the controller engages; the model drains the backlog until the depth falls below 50 and the producer is released, then the cycle repeats. The system spends two thirds of its time under backpressure because the offered load genuinely exceeds capacity, yet the drop rate is zero: the source absorbed the mismatch by slowing down, exactly as Figure 9.8.1 promises. A pipeline without this loop would have grown the queue without bound and either crashed on memory or begun discarding events the moment the buffer filled.

Library Shortcut: Flink and Reactive Streams Give You Backpressure for Free

The hysteresis loop in Code 9.8.1 is about forty lines of explicit queue bookkeeping. Production stream processors implement it as a property of the runtime, so you never write it. Apache Flink propagates backpressure automatically through its network stack: when a slow operator's input buffers fill, the credit-based flow-control protocol stops upstream operators from sending, and the pressure walks all the way back to the Kafka source connector, which simply polls more slowly. The same guarantee appears in the Reactive Streams standard (the request(n) demand signal) used by Akka Streams and Project Reactor:

# Faust (a Python stream processor in the Kafka ecosystem). The async loop
# only pulls the next event when the model coroutine is ready for it, so an
# overloaded model naturally stops consuming and Kafka stops advancing the
# consumer offset. Backpressure is the absence of a faster pull, not extra code.
@app.agent(transactions_topic)
async def score(stream):
    async for event in stream:          # pulls one event only when ready
        features = await enrich(event)  # online feature join (Section 9.7)
        prediction = model.score(features)
        await actions_topic.send(value=prediction)   # emit the decision

Code 9.8.2: The same bounded-consumer behavior as Output 9.8.1 expressed in a few lines of a real stream processor. The framework's pull-based loop and the Kafka consumer's offset together provide the backpressure; the explicit queue, water marks, and throttle of Code 9.8.1 collapse into the runtime.

3. Delivery Guarantees: Exactly-Once Versus At-Least-Once Actions Advanced

An inference pipeline does not just compute predictions; it emits actions, and actions have consequences in the world. The delivery guarantee on that emission is therefore a product decision, not a systems detail. Consider the three guarantees a streaming system can offer for the action it fires per event. At-most-once means an action may be skipped but never repeated: tolerable for a best-effort recommendation, catastrophic for a fraud block that lets the fraud through. At-least-once means an action may be repeated but never skipped: every fraud is caught, but a customer may receive the same alert twice or a card may be declined twice on a retry. Exactly-once means each event drives precisely one action, which is what everyone wants and what costs the most to guarantee.

The reason exactly-once is expensive is failure. A pipeline that crashes after scoring an event but before recording that it acted will, on recovery, replay the event from the log (Section 9.5) and is now at risk of acting twice. At-least-once is the natural default of a replayable log plus retries; you get it almost for free, and you pay for it in duplicates. Exactly-once requires that the act of emitting and the act of recording progress commit together atomically, so a crash can never leave one done without the other. Kafka's transactional producer offers this as a read-process-write transaction, and Flink achieves it with a two-phase commit tied to its checkpoints. The cheaper and more common engineering answer is to keep at-least-once delivery and make the action idempotent: tag each action with the event's unique id and have the downstream consumer ignore an id it has already applied, so a duplicate emission collapses into a single effect.

Practical Example: The Duplicate Fraud Alert

Who: A platform engineer on the payments team at a digital bank.

Situation: A real-time pipeline scored each card transaction for fraud and, above a threshold, emitted a "freeze card and text the customer" action to a downstream service.

Problem: The stream job ran at-least-once. After a routine operator restart it replayed a few hundred buffered transactions, and a batch of customers received the same fraud text twice and, worse, some cards were frozen, unfrozen by support, then frozen again.

Dilemma: Move the whole pipeline to exactly-once with Kafka transactions, paying a throughput hit and a more fragile commit path, or keep at-least-once and defend against duplicates at the action boundary.

Decision: They kept at-least-once and made the action idempotent, because the duplicate, not the miss, was the failure mode that actually hurt, and idempotency was a localized change rather than a pipeline-wide protocol upgrade.

How: Each action carried the transaction's unique id as an idempotency key; the card-action service recorded applied keys for a 24-hour window and dropped any repeat. A replayed event still emitted an action, but the second action became a no-op.

Result: Duplicate texts and double-freezes went to zero across the next three restarts, with no measurable throughput cost, while every genuine fraud signal still fired because the underlying delivery stayed at-least-once.

Lesson: Decide which error hurts more, the duplicate or the miss, then buy the cheapest guarantee that prevents it. Idempotent actions over at-least-once delivery give exactly-once effects without paying for exactly-once delivery.

4. Holding the End-to-End Latency Budget Intermediate

A real-time pipeline lives or dies by a service-level objective stated end to end, usually as a tail percentile: "99% of events produce an action within 50 milliseconds of arrival." That budget is spent across every hop in Figure 9.8.1, and the hops add. If $T$ is the time from event arrival to action emission, then

$$T = t_{\text{queue}} + t_{\text{feature}} + t_{\text{model}} + t_{\text{emit}},$$

where $t_{\text{queue}}$ is time spent waiting in buffers, $t_{\text{feature}}$ is the online feature join, $t_{\text{model}}$ is scoring, and $t_{\text{emit}}$ is writing the action downstream. The objective is on the tail of $T$, not its mean, and tails compose punishingly: a stage that is fast at the median but occasionally slow (a feature lookup that misses cache, a model that hits a garbage-collection pause) inflates the tail of the sum even when every mean looks healthy. This is why backpressure matters for the budget and not only for stability: an unbounded queue turns $t_{\text{queue}}$ into the dominant, ever-growing term, and the SLO is violated long before memory runs out. The methods for measuring and attributing tail latency across a distributed pipeline are the subject of Chapter 5, and the cost models that predict each term come from Chapter 3.

The practical consequence is that the embed-versus-serve choice from Section 1 is partly a latency-budget choice. An external service adds a network round trip to $t_{\text{model}}$, often a few milliseconds at the median but tens of milliseconds at the tail under load, which can be the difference between meeting and missing a tight budget. Request batching on the serving side, which Chapter 23 develops, trades a little median latency for much higher throughput, and whether that trade fits depends entirely on where your budget binds.

Research Frontier: Streaming Inference for Large Language Models (2024 to 2026)

The patterns of this section are being stretched by models whose per-event cost is far higher than a fraud scorer's. Two threads stand out. First, streaming and online RAG: rather than retrieving against a static index, recent systems keep the retrieval corpus itself fed by a live stream so that a language model answers over documents that arrived seconds ago, which turns the index-update path into a stream-processing problem with its own backpressure and freshness budget (StreamingRAG and related 2024 to 2025 work). Second, token-streaming inference under continuous batching: serving stacks such as vLLM and its descendants treat a flight of in-progress generations as a stream of token-level events, dynamically admitting and evicting requests to hold a latency SLO while keeping the accelerator busy, which is backpressure and admission control applied at the granularity of a single decoded token. Both lines take the embed-or-serve, backpressure, and tail-budget questions raised here and re-ask them where one "event" can cost a full forward pass; we return to them with the serving machinery in Chapter 24 and the retrieval machinery in Chapter 25.

We now have a complete real-time pipeline: a model placed by the embed-or-serve trade-off, a bounded queue with backpressure so a slow model cannot blow the buffer, a delivery guarantee chosen by which error hurts, and a latency budget tracked across every hop. What we have not yet asked is whether the model is still right. A pipeline can be fast, lossless, and exactly-once while quietly scoring against a world that has drifted out from under it. Detecting that drift across a fleet of streaming consumers is where Section 9.9 goes next.

Exercise 9.8.1: Embed or Serve Conceptual

For each pipeline, decide whether you would embed the model in the stream operator or call an external serving service, and name the binding constraint that decides it: (a) a gradient-boosted fraud scorer of a few megabytes that must answer within 10 milliseconds and changes once a quarter; (b) a 7-billion-parameter language model that classifies support tickets, shared by four different stream jobs, retrained weekly; (c) an edge pipeline on a delivery robot with no reliable network. Explain what would go wrong if you made the opposite choice in each case.

Exercise 9.8.2: Tune the Backpressure Loop Coding

Starting from Code 9.8.1, run three experiments and report the peak queue depth, drop count, and percentage of time under backpressure for each. First, raise HIGH_WATER to 195 (just below the cap) and lower LOW_WATER to 5: explain why the wide hysteresis band reduces throttle switching but risks more drops. Second, set both water marks equal (for example both at 100) and explain the rapid on-off "chattering" you observe. Third, make the model faster than the source (MODEL_MS = 1.5) and confirm the controller never throttles. State, from these runs, the relationship between the water-mark gap and controller stability.

Exercise 9.8.3: Cost the Latency Budget Analysis

A pipeline must emit an action within a 50 millisecond budget at the 99th percentile. Measurements give per-stage 99th-percentile times of $t_{\text{feature}} = 8$ ms, $t_{\text{model}} = 22$ ms (embedded) or $35$ ms (external service, including the network round trip), and $t_{\text{emit}} = 4$ ms, with $t_{\text{queue}}$ held near zero by backpressure. Using $T = t_{\text{queue}} + t_{\text{feature}} + t_{\text{model}} + t_{\text{emit}}$, determine whether each scoring option meets the budget, and how much tail headroom each leaves. Then argue why summing per-stage 99th percentiles is a pessimistic but defensible upper bound on the 99th percentile of $T$, and what you would measure to tighten it (link your reasoning to the tail-latency methods of Chapter 5).