"They asked me for yesterday's answer cheaply or today's answer instantly, and seemed genuinely surprised that I could not deliver both at once."
A Pipeline Negotiating Its Latency Budget
The same model can be deployed in four very different ways, and the choice is governed by a single trade-off: how fresh the answer must be versus how cheaply you can produce it at high volume. Run the work in large overnight jobs and you get enormous throughput per dollar but answers that are hours stale; run it one request at a time under a strict deadline and you get instant answers but pay dearly for each one. Between those poles sit streaming, which processes events continuously seconds after they arrive, and online learning, which updates the model itself on every event. These are not four different systems competing for the title of "best"; they are four operating points on one curve, and a serious distributed AI system usually runs several of them at once. This section names the four modes, places them on the freshness-latency axis, and points to the chapters where each is built in full.
In the previous sections we established the thesis (distribute the essential work), the proof that one central form of it is exact (the gradient identity of Section 1.1), and the six axes that organize the book. Those sections asked where the work is split across machines. This section asks a different and equally structural question: when does the work run relative to the data arriving and the answer being needed? A recommendation model trained once a night, refreshed on a streaming feature pipeline every few seconds, and queried interactively at serving time is the same statistical object wearing three different deployment costumes. Choosing the costume well is the difference between a system that is fast, fresh, and affordable and one that is two of those at the expense of the third.
1. The Four Modes, Ordered by Latency Beginner
Batch processing runs the work in large jobs over an accumulated body of data: read a day's worth of logs, retrain or rescore everything, write the results, and stop. Because the per-job overhead (launching workers, reading from storage, scheduling) is amortized over millions of records, batch achieves the highest throughput per dollar of any mode, and the distributed-data machinery of Chapter 6 exists precisely to make these jobs scale across a cluster. The price is freshness: a batch answer reflects the world as it was when the job started, which may be hours or a full day ago.
Streaming processing never stops. Events flow in continuously through a log such as Apache Kafka, and a processor such as Apache Flink or Spark Structured Streaming applies the same computation to each small window of events seconds after they arrive. Streaming trades a little of batch's per-record efficiency (it cannot amortize over a whole day at once) for freshness measured in seconds rather than hours, which is exactly what a fraud feature or a trending-topic counter needs. This is the home of the streaming feature store, and the book develops it in Chapter 9.
Online learning goes one step further: the model parameters themselves change on every event. Where batch and streaming hold the model fixed and only the inputs flow, an online system folds each new labeled example straight into the weights, so the model at time $t+1$ already reflects the example seen at time $t$. This keeps the model continuously adapted to a drifting distribution, at the cost of new failure modes (a single bad batch can corrupt the live model) that Chapter 9 treats alongside streaming, since the two almost always run on the same pipeline.
Interactive serving is request-response under a tight latency budget. A user, or another service, sends one input and waits for the answer, and the entire deadline (often tens to hundreds of milliseconds, or the time to first token for a language model) is the product. Here throughput per dollar is sacrificed deliberately: you keep capacity warm and idle so that any single request is answered now, not when it is convenient. Interactive serving of large models is the subject of Chapter 23 and, for language models specifically, Chapter 24.
The four modes are not a menu of unrelated systems; they are positions on one trade-off. Moving toward fresher data and lower latency (right along Figure 1.5.1) means processing smaller groups of events more often, which strips away the amortization that makes large jobs cheap. Batch amortizes over a whole day and is cheapest per record but stalest; interactive amortizes over nothing and is freshest but most expensive per answer. There is no operating point that is simultaneously the freshest, the lowest latency, and the cheapest, so every deployment decision is a deliberate choice of where on this curve a given part of the workload should sit.
2. Why Smaller Groups Cost More: The Amortization Curve Intermediate
The trade-off in Figure 1.5.1 is not a matter of taste; it falls out of a simple cost model. Suppose processing a group of $b$ events carries a fixed overhead $c_0$ (scheduling, reading from storage, launching the model) plus a marginal cost $c_1$ per event. The cost per event is then
$$\text{cost per event} = \frac{c_0 + c_1 b}{b} = \frac{c_0}{b} + c_1.$$As the group size $b$ grows, the fixed overhead $c_0$ is spread thinner and the per-event cost falls toward the floor $c_1$. Batch pushes $b$ into the millions and sits near that floor. Interactive serving runs at $b = 1$, where the full overhead $c_0$ is paid on every single answer. Streaming and micro-batching choose an intermediate $b$ to buy back most of the amortization while keeping latency bounded, because a larger group also means waiting longer to fill it. The code below measures this curve directly by scoring a million events three ways and reporting throughput and per-event latency for each.
import time, numpy as np
rng = np.random.default_rng(0)
N = 1_000_000 # total events to score
w = rng.standard_normal(16) # a tiny linear model
X = rng.standard_normal((N, 16)) # the event features
def score(rows): # the per-event work, identical in every mode
return rows @ w
# BATCH: one big matrix multiply over all N events at once.
t0 = time.perf_counter()
_ = score(X)
batch_s = time.perf_counter() - t0
# MICRO-BATCH (streaming): process in windows of 10k events.
t0 = time.perf_counter()
for i in range(0, N, 10_000):
_ = score(X[i:i+10_000])
stream_s = time.perf_counter() - t0
# INTERACTIVE: one event at a time, the latency a caller actually waits.
t0 = time.perf_counter()
for i in range(2_000): # sample 2k single-event calls
_ = score(X[i:i+1])
inter_s = (time.perf_counter() - t0) / 2_000
print(f"{'mode':<14}{'throughput (events/s)':>24}{'per-event latency (us)':>26}")
print(f"{'batch':<14}{N/batch_s:>24,.0f}{batch_s/N*1e6:>26.2f}")
print(f"{'streaming':<14}{N/stream_s:>24,.0f}{stream_s/N*1e6:>26.2f}")
print(f"{'interactive':<14}{1/inter_s:>24,.0f}{inter_s*1e6:>26.2f}")
score) is identical; only the group size $b$ changes, from all $N$ at once (batch), to windows of ten thousand (streaming), to one at a time (interactive).mode throughput (events/s) per-event latency (us)
batch 65,915,668 0.02
streaming 61,711,634 0.02
interactive 783,239 1.28
The numbers make the amortization curve concrete. Batch and streaming, with large $b$, sit near the marginal-cost floor and reach tens of millions of events per second. The interactive mode, at $b = 1$, pays the fixed overhead on every call and its throughput collapses by nearly two orders of magnitude, even though the arithmetic per event is identical. This is the entire reason interactive serving is expensive: you are not doing harder work per event, you are doing it in the least amortizable way possible, because a waiting caller cannot be told to come back once a million friends have queued up behind them. Real serving systems claw back some of this loss with dynamic batching, grouping concurrent requests on the fly, which is one of the central techniques of Chapter 24.
A batch job is a catering kitchen: it cooks five hundred identical meals at dawn, dirt cheap per plate, and you eat whatever was decided last night. Interactive serving is a barista pulling one espresso while you watch: fresh, exactly what you asked for, and priced accordingly. Streaming is the pastry case refilled every few minutes. Online learning is the barista quietly adjusting the grind after every cup based on how the last customer reacted. Same coffee, four business models.
3. The Modes Compose Inside One System Beginner
It would be a mistake to read Figure 1.5.1 as four boxes you choose between. A production AI system almost always runs several modes at once, each on the part of the workload that fits it. Table 1.5.1 lays out a single recommendation service to make the point: the same underlying model is trained in batch, kept fresh in streaming, optionally nudged online, and queried interactively, and each row sits at a different point on the freshness-cost curve for a good reason.
| Activity | Mode | Freshness | Where the book builds it |
|---|---|---|---|
| Nightly full model retrain | Batch | Hours to a day | Chapter 6 |
| Per-user feature updates from clicks | Streaming | Seconds | Chapter 9 |
| Live adaptation to a viral item | Online | Per event | Chapter 9 |
| Ranking a page on request | Interactive | Milliseconds | Chapter 23 |
The composition is the design. The expensive, throughput-hungry full retrain runs once a night in batch because nobody needs last-second weights for the bulk of the model. The features that must reflect a click from ten seconds ago ride a streaming pipeline. A genuinely fast-moving signal, a suddenly viral item, may justify an online update so the model adapts within minutes rather than waiting for the next night. And the final ranking, which a user is actively waiting on, runs interactively under a tight budget. Each mode addresses a different freshness requirement, and forcing the whole system into any single mode would be wasteful at one end and too stale at the other. Recognizing this layered structure, and the distributed machinery each layer needs, is what the rest of the book equips you to do.
None of these four modes escapes scale-out. Batch over a petabyte is the partitioned MapReduce and Spark machinery of Part II. Streaming over millions of events per second is a partitioned, fault-tolerant dataflow across a cluster, also Part II. Online learning distributes the model-update step the way Section 1.1 distributed the gradient. Interactive serving of a model too large for one accelerator is distributed inference, the whole of Part V. The processing mode chooses when the work runs; the six axes of Section 1.2 still govern how it is split across machines. The two questions are orthogonal, and a real system answers both.
4. Latency, Freshness, and the Cost of Waiting Intermediate
It is worth separating two ideas that the axis in Figure 1.5.1 deliberately bundles, because they can diverge. Latency is how long a caller waits for one answer once they ask. Freshness (or its inverse, staleness) is how old the data and model behind that answer are. Interactive serving has low latency but can still serve a stale model if that model was last trained in last night's batch job; the request is answered in milliseconds, yet the weights are a day old. Streaming attacks freshness without necessarily attacking serving latency: it keeps the features current but the request itself may still be a separate interactive call. Keeping the two notions distinct lets you diagnose a complaint precisely. "The answer was slow" is a latency problem; "the answer was instant but wrong because it ignored what just happened" is a freshness problem, and they have entirely different remedies.
There is also a hard floor on freshness set by physics and queueing, not by laziness. To process events in groups of $b$, a streaming system must wait for $b$ events to arrive, so at an arrival rate of $\lambda$ events per second the window alone adds on the order of $b / \lambda$ seconds of delay before any processing begins. Shrinking $b$ toward one removes that wait but, as Output 1.5.1 showed, surrenders the amortization that kept the work cheap. This is the freshness-versus-throughput trade-off restated in time rather than dollars, and it is why "just make it real-time" is never free: every second of freshness you demand is paid for in efficiency you give up. Section 1.6 turns latency, throughput, cost, and reliability into the four quantities you will measure for the rest of the book.
Who: A staff engineer on the risk team at a payments company.
Situation: A card-fraud classifier served scores in under 20 milliseconds, comfortably inside the checkout budget, and the latency dashboards were flawless.
Problem: Fraud losses crept up anyway. The model was interactive and fast, but its features (counts of recent transactions per card) were refreshed only by a nightly batch job, so a card compromised at 9 a.m. looked pristine until the next night's run.
Dilemma: Move the whole feature pipeline to interactive recomputation on every request, simple but far too slow inside a 20-millisecond budget, or stand up a streaming feature pipeline that kept the counts seconds-fresh while leaving the fast interactive scorer untouched.
Decision: They separated latency from freshness. The interactive scorer stayed as is; a Kafka-plus-Flink streaming job updated the per-card features within seconds of each transaction, writing them to a feature store the scorer read at request time.
How: Transaction events flowed into Kafka, a streaming job maintained windowed per-card aggregates, and the serving path did a single fast lookup instead of recomputing anything, so serving latency was unchanged.
Result: Feature staleness fell from up to 24 hours to a few seconds, fraud caught on the same compromised card within minutes, and the serving latency budget was never touched because the expensive freshness work happened off the request path.
Lesson: Low latency does not imply fresh data. Put the freshness work in a streaming pipeline and keep the interactive path lean; the two trade-offs are solved on different layers, exactly as Table 1.5.1 lays out.
5. Where Each Mode Lives in the Book Beginner
The four modes are not just a conceptual frame; each anchors a substantial part of the book, and naming the destinations now lets later chapters assume this vocabulary. Batch processing at cluster scale is built from the MapReduce model up in Chapter 6, the foundation that Spark then accelerates. Streaming and online learning share a home in Chapter 9, because in practice a streaming dataflow and a per-event model update run on the same pipeline and face the same questions of windowing, state, and fault tolerance. Interactive serving splits by model size: general distributed inference systems are Chapter 23, and the specialized world of serving large language models, with its token-by-token latency and dynamic batching, is Chapter 24. The library shortcut below shows just how little code stands between you and a real streaming consumer, the entry point to that whole arc.
Building a streaming processor from scratch means handling partitioned topics, offsets, windowing, and recovery, the machinery Chapter 9 unpacks. A modern client collapses the read loop to a handful of lines; the consumer group, partition assignment, and offset tracking are handled for you, so each event is scored seconds after it is produced:
from kafka import KafkaConsumer # pip install kafka-python
import json
consumer = KafkaConsumer(
"transactions", # the topic of incoming events
bootstrap_servers="broker:9092",
group_id="fraud-scorer", # the library manages partition assignment
value_deserializer=lambda b: json.loads(b),
)
for msg in consumer: # blocks until the next event arrives
event = msg.value
risk = score(features_for(event)) # the same model, now scoring a live stream
emit(event["id"], risk) # seconds after the event was produced
for loop over consumer hides partition assignment, offset commits, and rebalancing on failure; turning batch scoring into streaming is mostly a change of when the loop runs, not what it computes.The freshness-latency frontier is moving fast in two visible directions. First, streaming feature stores have become standard infrastructure: open systems such as Feast and Chronon (Airbnb, 2024) compute the same feature in both a batch backfill and a streaming path, then guarantee the two agree, which removes the train-serve skew that plagued systems like the fraud model above. Second, retrieval-augmented generation is going real-time: rather than serving a language model over an index rebuilt nightly, 2024 to 2026 work pushes toward continuously updated vector indexes and streaming ingestion so that a model can answer questions about events from minutes ago, blurring the line between the streaming layer and the interactive layer. A third active thread revisits online and continual learning for foundation models, asking how to fold fresh data into very large models without the catastrophic forgetting that makes naive per-event updates dangerous. We meet the serving side of these systems in Chapter 24 and the streaming side in Chapter 9; the throughline is that "fresh" and "fast" are being pushed together by better engineering of the layers in between.
With the processing modes named and placed, we have answered when the work runs. What remains is to make the trade-offs measurable. Every claim in this section, "cheaper per record," "lower latency," "fresher," became a quantity we waved at rather than measured. The next section fixes that by defining throughput, latency, cost, and reliability precisely, so that the freshness-versus-throughput curve of Figure 1.5.1 becomes something you can put numbers on for any system you build. That treatment begins in Section 1.6.
For each workload, name the most appropriate primary mode (batch, streaming, online, or interactive) and justify it in one sentence using the freshness-versus-throughput trade-off: (a) generating monthly accounting reports over a year of transactions; (b) maintaining a live count of active viewers per video; (c) a code-completion model that must respond before the developer's next keystroke; (d) a spam filter that should adapt within minutes as a new campaign appears. Then state, for any one of them, a second mode the full system would likely also run and why.
Extend Code 1.5.1 to sweep the group size $b$ across $\{1, 10, 100, 1000, 10000, 100000\}$, scoring the same $N$ events at each $b$, and plot throughput against $b$ on a log-x axis. Fit the simple model $\text{cost per event} = c_0 / b + c_1$ from Section 2 to your measured times and report the estimated fixed overhead $c_0$ and marginal cost $c_1$. At which $b$ does throughput reach ninety percent of its batch maximum? Explain what that crossover means for choosing a streaming window size.
A streaming system processes events in windows of $b$ events and receives events at a steady rate of $\lambda = 5000$ events per second. Using the window-delay estimate $b / \lambda$ from Section 4, compute the minimum freshness delay added by windowing for $b \in \{1, 50, 500, 5000\}$. Combine this with the per-event cost model $c_0 / b + c_1$ to argue that there is an intermediate $b$ that is both reasonably fresh and reasonably cheap, and explain why neither extreme ($b = 1$ nor $b = N$) is the right choice for a latency-sensitive but high-volume feature pipeline.