Part II: Distributed Data Processing for AI
Chapter 9: Stream Processing and Online AI

Batch vs Stream Processing

"I spent all night reducing the perfect daily report. By the time I finished, the day it described had already happened to someone else."

A Batch Job That Missed Its Window
Big Picture

Every chapter so far has treated data as a finite pile you process to completion; this chapter treats it as an endless river you must act on while it flows. Batch processing assumes the input is bounded: you wait until all of it has arrived, run a computation over the whole thing, and emit a result that is correct but already a little old. Stream processing assumes the input never ends: events arrive one after another, forever, and the system updates its answer continuously, trading the comfort of seeing everything for the freshness of acting now. For AI this is not a stylistic choice. Fraud must be caught during the swipe, not in tomorrow's report; features must reflect what the user did a second ago; a model whose accuracy is sliding must be noticed while it slides. This section draws the line between the two modes, shows the trade-off that governs every choice between them, and introduces the Lambda and Kappa architectures that the rest of the chapter builds on.

The previous three chapters processed data that sat still. MapReduce in Chapter 6 read an entire dataset from disk, shuffled it, and reduced it. Spark in Chapter 7 built a dataframe over a fixed set of files. The storage and loading systems of Chapter 8 fed a finite corpus into a training loop. In all three, the dataset had a first record and a last record, and the job was done when the last record was processed. That assumption, a bounded input you can read to the end, is the defining property of batch processing, and it is enormously convenient: you can sort, you can join, you can compute exact global aggregates, because nothing more is ever coming.

Real systems, though, sit inside a world that does not stop producing data. Card swipes, sensor readings, clickstream events, log lines, and model predictions arrive at all hours and never reach a last record. You can pretend such a feed is bounded by slicing it into yesterday's file and running a batch job overnight, and for decades that is exactly what people did. The cost of the pretense is staleness: the answer you compute describes a world that has already moved on. Stream processing removes the pretense. It accepts the input as unbounded and processes each event close to the moment it arrives, keeping a continuously fresh answer at the price of never being able to see the whole dataset at once.

Batch: bounded input, wait, then one stale answer the whole dataset, collected first batch job result correct, hours old long wall-clock delay before any output Stream: unbounded input, continuous fresh answers time, never ending events arrive one at a time, forever processor bounded state answer always now state updated per event, then the event is forgotten
Figure 9.1.1: The two processing modes. Above, a batch job collects the entire bounded dataset, runs once, and emits a correct result that is already hours old. Below, a stream processor consumes an unbounded sequence of events, updates a small bounded state on each arrival, and keeps an answer that reflects the present. The dashed loop marks the defining move of streaming: state is updated incrementally and the raw event is discarded, so memory never grows with the length of the stream.

1. Bounded and Unbounded Inputs Beginner

The cleanest way to separate the two modes is by the shape of their input, not by how fast they run. A batch job consumes a bounded dataset: a collection with a known size that you can, in principle, read from the first record to the last. Because the end exists, the job can compute things that require seeing everything, such as an exact count, a global sort, or a join against the full table, and it can report a single final answer when it reaches the last record. A stream job consumes an unbounded sequence: events that keep arriving with no last record in sight. Because the end never comes, the job can never wait for "all the data"; it must produce useful output from a prefix of the stream and keep revising that output as more of the stream arrives.

This recalls the processing-mode spectrum introduced in Section 1.5, where we placed offline batch, micro-batch, and true streaming along a single axis of how much input a system waits for before it acts. Batch sits at one extreme, waiting for everything; record-at-a-time streaming sits at the other, waiting for nothing; micro-batching lives in between, gathering a few seconds of events into a tiny bounded chunk and running a small batch over it. Seen this way, batch and stream are not two unrelated technologies but the endpoints of a continuum, and most production systems pick a point on it rather than a side.

A useful mental test is the question "what does this system do when the next record never comes?" A batch job hangs forever, because it is waiting for an end that will not arrive. A stream job is unbothered, because it never intended to wait; it has already emitted answers for everything it has seen and is simply idle until the next event. That difference, whether the absence of more data is a completion or a pause, is the whole distinction in one sentence.

Key Insight: The Input's Boundedness, Not Its Speed, Defines the Mode

Batch versus stream is not "slow versus fast." A batch job over a small file can finish in milliseconds; a stream pipeline can lag minutes behind under load. The real distinction is whether the input is bounded, so the system can wait for the end and see everything once, or unbounded, so the system must act on a prefix and keep revising. Every design consequence (exact versus continuously updated results, growing versus bounded state, one final answer versus an answer that is always current) follows from that single property.

2. Why AI Needs Streaming Beginner

An AI system can tolerate stale data exactly until the moment a decision depends on what just happened. Many training pipelines are perfectly happy in batch: a foundation model trained once over a fixed corpus does not care whether the corpus is a day old. The pressure toward streaming comes from the serving and monitoring side of the system, where freshness is not a luxury but the entire point. Five recurring needs push AI workloads off the batch model and onto the stream.

The first is real-time features. A recommendation or ranking model is only as good as the signals it sees, and the most predictive signals are often the freshest: the item you viewed ten seconds ago, the merchant you just paid, the page you are on right now. Computing these features in a nightly batch makes them useless by serving time, so they must be maintained as running aggregates over an event stream, a topic Section 9.7 develops in full. The second is online inference: a deployed model that scores each incoming event as it arrives, with a latency budget measured in milliseconds, is itself a stream consumer, and Section 9.8 builds exactly such a pipeline. The third is fraud and anomaly detection, where the value of a decision decays to near zero within seconds; a fraud verdict delivered tomorrow protects no one. The fourth is monitoring: prediction latency, throughput, and error rates are themselves unbounded streams that an operations team must watch live, not reconstruct after an outage. The fifth, and the one most particular to machine learning, is concept-drift response: the statistical relationship a model learned can shift under it, and noticing that shift requires comparing a live stream of recent data against the training distribution continuously, which is why Section 9.9 treats drift as a streaming-monitoring problem.

Practical Example: The Fraud Model That Was Always One Day Too Late

Who: A data engineer on the risk team at a digital payments company.

Situation: Card-fraud scoring ran as a nightly Spark batch that recomputed each account's spending features from the full transaction history and rescored every pending charge.

Problem: Fraud rings learned the rhythm. They would open an account, run a burst of charges within a single day, and cash out before the next night's batch ever recomputed the account's risk features.

Dilemma: Run the batch more often, say hourly, which raised cluster cost steeply and still left a one-hour blind spot, or move the feature computation to a stream that updated each account's running aggregates on every transaction.

Decision: They moved to streaming, because the binding requirement was freshness measured in seconds, and no batch cadence they could afford would reach it.

How: Transactions already flowed through a distributed log (the subject of Section 9.5); they attached a stream processor that kept per-account running counts and sums in bounded state and scored each charge against features current to the millisecond.

Result: The median time from a suspicious charge to a fraud verdict fell from roughly eighteen hours to under two seconds, and the burst-and-cash-out pattern stopped working because the features moved as fast as the fraud did.

Lesson: When the value of a decision decays in seconds, no batch cadence is fast enough; the input must be treated as the unbounded stream it always was.

3. The Same Data, Two Ways: A Running Demo Intermediate

The structural difference between the modes is easiest to feel by processing one dataset both ways. The code below takes a short log of transaction amounts and computes their mean twice. The batch version waits for every event, sums the whole list, and divides once. The stream version processes the events one at a time, keeping only a running count and a running sum, and emits a fresh estimate after every single arrival. The point is not which number is "right," since they agree at the end, but when each answer is available and how much memory each approach holds.

import random

random.seed(7)
# A finite log of transaction amounts arriving in order.
N = 12
events = [round(random.uniform(5, 500), 2) for _ in range(N)]

# --- BATCH: wait for every event, then compute one aggregate. ---
def batch_mean(all_events):
    total = sum(all_events)          # touches all N at once
    return total / len(all_events)

# --- STREAM: process one event at a time, keep a bounded running state. ---
class RunningMean:
    """O(1) state: count and sum. Never stores the history."""
    def __init__(self):
        self.count = 0
        self.total = 0.0
    def update(self, x):
        self.count += 1
        self.total += x
        return self.total / self.count   # answer available after EVERY event

print("STREAM: a fresh estimate after each arrival (state = 2 numbers)")
rm = RunningMean()
for i, x in enumerate(events, 1):
    est = rm.update(x)
    print(f"  t={i:2d}  amount={x:6.2f}  running_mean={est:7.2f}")

print()
print("BATCH: one answer, only after all", N, "events have landed")
print(f"  batch_mean = {batch_mean(events):7.2f}")
print("agreement (stream final vs batch):",
      f"{abs(rm.total/rm.count - batch_mean(events)):.2e}")
Code 9.1.1: One dataset, two processing modes. The batch path calls batch_mean once over the full list; the stream path feeds events into RunningMean one at a time, emitting an updated answer per event while holding only two numbers of state.
STREAM: a fresh estimate after each arrival (state = 2 numbers)
  t= 1  amount=165.30  running_mean= 165.30
  t= 2  amount= 79.67  running_mean= 122.49
  t= 3  amount=327.21  running_mean= 190.73
  t= 4  amount= 40.86  running_mean= 153.26
  t= 5  amount=270.26  running_mean= 176.66
  t= 6  amount=186.02  running_mean= 178.22
  t= 7  amount= 33.71  running_mean= 157.58
  t= 8  amount=256.18  running_mean= 169.90
  t= 9  amount= 23.56  running_mean= 153.64
  t=10  amount=219.65  running_mean= 160.24
  t=11  amount= 39.58  running_mean= 149.27
  t=12  amount= 49.90  running_mean= 140.99

BATCH: one answer, only after all 12 events have landed
  batch_mean =  140.99
agreement (stream final vs batch): 2.84e-14
Output 9.1.1: The stream produced eleven usable answers before the batch produced any, and its final estimate matches the batch result to floating-point rounding ($2.84 \times 10^{-14}$). Crucially, the stream's state stayed at two numbers regardless of how many events flowed through it.

Two properties of streaming jump out of Output 9.1.1. First, freshness: at every timestep $t$ the stream already has an answer for the first $t$ events, while the batch has nothing until event twelve lands. If transactions kept arriving for a year, the stream would have a current mean at every instant and the batch would still be waiting. Second, bounded state: the running mean is computable from a count and a sum, so the processor's memory does not grow with the length of the stream. This is the streaming designer's central craft, finding a small piece of state that summarizes an unbounded history. A mean needs two numbers; a maximum needs one; quantiles and distinct-element counts need clever approximate sketches because their exact state would grow without bound. The aggregate $\bar{x}_t = \frac{1}{t}\sum_{i=1}^{t} x_i$ updates as $\bar{x}_t = \bar{x}_{t-1} + (x_t - \bar{x}_{t-1})/t$, a one-line recurrence that never revisits an old event. Where state cannot be bounded by a clean summary, it is bounded instead by a window, the structure Section 9.2 introduces next.

Library Shortcut: Frameworks Maintain Streaming State for You

Code 9.1.1 hand-rolled the running aggregate and the per-event loop. Production stream processors give you the same incremental aggregate declaratively, and they add the hard parts we ignored: distributing the state across machines, checkpointing it so a crash does not lose the count, and handling events that arrive out of order. In PySpark Structured Streaming the entire batch-or-stream choice collapses to which reader you call, and the aggregation code is identical to its batch form:

# The SAME aggregation, run as an unbounded stream instead of a batch.
from pyspark.sql.functions import avg

stream = (spark.readStream                 # readStream, not read: input is unbounded
          .format("kafka")
          .option("subscribe", "transactions")
          .load())

(stream.groupBy("account_id")
       .agg(avg("amount"))                  # incremental running mean, maintained for you
       .writeStream
       .outputMode("update")               # emit only the rows whose answer changed
       .start())
Code 9.1.2: The running mean of Code 9.1.1 as a few lines of Structured Streaming. The framework handles the incremental state update, distributes it by account_id, checkpoints it for fault tolerance, and emits only changed rows; Section 9.6 unpacks this engine and contrasts it with Flink.

4. The Core Trade-Off: Freshness Against Cost and Simplicity Intermediate

Streaming is not strictly better than batch, and treating it as an upgrade is a common and expensive mistake. The two modes sit on opposite ends of a single trade-off. Batch buys throughput, cost-efficiency, and simplicity: reading a large dataset sequentially and processing it in bulk is the most efficient way to move bytes, the cluster runs only when the job runs, and the programming model is the familiar one where you see all the data and nothing changes under you. Its price is latency and staleness: the answer describes the world as of the last time the job ran. Streaming buys low latency and freshness: the answer tracks the present moment. Its price is that the cluster must run continuously, the per-event overhead makes raw throughput lower than a bulk scan, and the programming model is harder, because you must reason about partial inputs, out-of-order events, late arrivals, and state that survives crashes.

We can make the latency side precise. Treat result staleness as the age of the data reflected in the current answer. For a batch job that runs every $T$ seconds and takes $P$ seconds to process, the freshest result is already $P$ seconds old the instant it lands, and just before the next run it is $T + P$ seconds old, so the staleness averages about $T/2 + P$. Shrinking $T$ (running more often) reduces staleness but raises cost, and you cannot push $T$ below $P$ without runs overlapping. A stream processor instead carries a roughly constant per-event latency $\ell$, the time from an event arriving to the answer updating, independent of how long the stream has been running. The choice between the modes is the choice between paying continuously for a small constant $\ell$ and paying periodically for a staleness of $T/2 + P$ that you can only shrink by spending more.

Because this trade-off is real and neither end dominates, mature systems frequently want both at once: the cheap exact history that batch computes well, and the fresh approximate present that streaming computes well. That desire is what gives rise to the two reference architectures of the chapter, which is the subject of the next section.

Fun Note: The Lambda That Outlived Its Name

The Lambda architecture is named for the Greek letter, whose two diverging strokes were meant to evoke the batch path and the speed path splitting from a shared input and rejoining at the serving layer. Engineers maintaining one have been known to argue that the more fitting image is the two code paths slowly diverging in behavior over the years until a number in the dashboard depends on which branch you ask. The Kappa architecture, proposed partly to escape exactly that fate, is named for the letter that comes next.

5. Lambda and Kappa Architectures Advanced

The Lambda architecture answers "do I want batch or stream?" with "both." Incoming events fan out into two paths. A batch layer stores the complete immutable history and periodically recomputes accurate, comprehensive views over all of it; this path is slow but exact and easy to reason about. A speed layer processes the same events as a stream to produce low-latency approximate views covering only the recent window the batch has not yet caught up to. A serving layer merges the two, answering each query from the authoritative batch view plus the fresh speed-layer increment. Lambda delivers both freshness and eventual exactness, and its cost is duplication: the same aggregation logic is implemented twice, once in a batch engine and once in a streaming engine, and keeping the two definitions in lockstep as requirements change is a permanent maintenance tax.

The Kappa architecture removes the duplication by removing the batch layer. Its claim is that if your event log is durable and replayable, you do not need a separate batch path at all: there is only the stream. Routine processing runs over the live stream as usual, and whenever you need to recompute history (because the logic changed or a bug is fixed) you replay the retained log from the beginning through the same streaming code. Batch becomes a special case of streaming, namely streaming over recorded history rather than the live tail, so a single codebase serves both needs. Kappa is simpler to maintain and conceptually clean, and it leans entirely on the durable, replayable distributed log that Section 9.5 builds; its limit is that some genuinely batch-shaped jobs, large global joins and sorts over the full dataset, are awkward to express as a replayed stream. Figure 9.1.2 contrasts the two.

Lambda: batch path + speed path events batch layer exact, slow speed layer fresh, approx serving merge views Kappa: one replayable stream events replayable log stream processor replay history through the same code
Figure 9.1.2: The two reference architectures. Lambda (left) runs a batch layer and a speed layer in parallel and merges their views at serving time, paying duplicated logic for both exactness and freshness. Kappa (right) keeps only a durable replayable log and a single stream-processing path; recomputing history means replaying the log through the same code, so one codebase covers both the live tail and the recorded past.
Research Frontier: The Streaming-First Lakehouse (2024 to 2026)

The industry trend since 2024 has been to collapse the batch and stream divide at the storage layer rather than the compute layer, pushing Kappa's "one path" idea into the table format itself. Transactional table formats (Apache Iceberg, Delta Lake, Apache Hudi) now expose the same dataset as both a batch table and an incrementally consumable stream, so a single declarative query runs in either mode; Spark's Project Lightspeed and Apache Flink 1.18+ unified batch-and-stream execution push the same convergence on the engine side, and systems such as Apache Paimon and RisingWave market themselves explicitly as streaming-first lakehouses. For machine learning the most active line is the streaming feature store (Feast, Tecton, Chronon), which guarantees that the feature a model sees online is computed by the identical logic that produced its training data offline, eliminating the train-serve skew that Lambda's duplicated paths invite. The unifying research question is whether a single execution model can deliver batch's throughput and streaming's freshness without forcing the engineer to choose, and the 2024 to 2026 stacks are betting that it can.

6. What This Chapter Builds Beginner

The rest of the chapter turns the contrast drawn here into working machinery, one layer at a time. Section 9.2 defines events, streams, and the windows that bound state when no clean running summary exists. Section 9.3 separates the time an event happened from the time the system saw it, the event-time versus processing-time distinction that makes streaming subtle, and Section 9.4 introduces watermarks, the mechanism that decides when a window is safe to close despite late arrivals. Section 9.5 builds the Kafka-style distributed log that both architectures lean on, and Section 9.6 compares the two dominant engines, Spark Structured Streaming and Flink. The chapter then turns to AI proper: Section 9.7 on online feature computation, Section 9.8 on distributed real-time inference pipelines, and Section 9.9 on detecting concept drift and monitoring deployed models as a streaming problem.

Thesis Thread: Streaming Is Distribution Across Time, Not Just Machines

Every earlier part of this book distributed work across space, splitting a dataset, a gradient, or a model across many machines at one instant. Streaming adds a second axis: it distributes the work across time, processing an unbounded input incrementally so that no single moment ever holds the whole thing. The two axes compose, because a real stream processor is also sharded across machines, each owning a slice of the keyspace, so the bounded per-key state of Code 9.1.1 is itself partitioned exactly as the embeddings of Chapter 8 were. Online learning and continual model updates, which keep a model fresh against a moving world, are the natural endpoint of this idea, and federated learning in Part III pushes it all the way out to data that never stops arriving on devices the central system never sees.

The thread to hold onto is the one Figure 9.1.1 drew: batch waits for a bounded whole and answers once; streaming acts on an unbounded prefix and answers continuously, bounding its memory by a summary or a window. Everything that follows, event time, watermarks, logs, engines, online features, and drift, is the engineering required to make that continuous answer correct, fault-tolerant, and fast at scale.

Exercise 9.1.1: Bounded, or Just Pretending? Conceptual

For each workload, decide whether its input is genuinely bounded or an unbounded stream being sliced into batches, and state the staleness the slicing imposes: (a) retraining a recommendation model every night on the last 90 days of clicks; (b) computing each user's "items viewed in the last 5 minutes" feature for live ranking; (c) generating a quarterly financial report from a closed ledger; (d) alerting an on-call engineer when prediction error rate crosses a threshold. For any you called "pretending," name the freshness requirement that would force a true streaming design.

Exercise 9.1.2: Bound the State Coding

Extend the RunningMean class in Code 9.1.1 to also track, with bounded state, the running maximum and the running variance of the stream (use Welford's online algorithm for the variance so you never store past events). Then add a method that reports the mean over only the last $k$ events, and explain why this last quantity, unlike the others, cannot be maintained in $O(1)$ state without either a $k$-length buffer or an approximation. Connect your answer to why Section 9.2 needs windows.

Exercise 9.1.3: Price the Freshness Analysis

A batch feature job takes $P = 20$ minutes to run and is scheduled every $T$ minutes. Using the staleness estimate $T/2 + P$ from Section 4, find the smallest average staleness you can achieve and the $T$ that achieves it, given that runs may not overlap. Now suppose a streaming version holds a constant per-event latency of $\ell = 2$ seconds but requires the cluster to run continuously at a cost of \$8 per hour, while the batch cluster costs \$8 per hour only while a run is executing. At what run frequency does the batch approach cost as much as the stream, and what staleness does it deliver at that point? State the conditions under which streaming is the clearly correct choice.