"In the morning I answer one customer in nine milliseconds and feel important. At night I answer nine hundred million of them in my sleep and nobody thanks me. Same weights, opposite life."
A Model Serving Two Shifts
The same trained model serves two completely different workloads, and the fleet must be tuned in opposite directions for each. Online serving answers a stream of live requests under a strict latency budget, so the fleet keeps batches tiny and headroom plentiful, optimizing for tail latency and elasticity. Offline batch inference scores a fixed, enormous dataset where no one is waiting, so the fleet packs items into the largest batches the hardware will hold, runs at maximum utilization, and rents the cheapest preemptible capacity it can find. Identical weights, identical math, yet the throughput and the cost per item differ by more than two orders of magnitude purely because of how the work is grouped. This section shows why the two regimes pull apart, how a single fleet can run both at once through priority queues and idle-capacity reuse, and how large-scale batch inference becomes a distributed data job that maps a model over sharded inputs.
Chapter 22 established the per-node serving economics: how a single accelerator batches requests, pages a KV cache, and turns floating-point throughput into answered queries. We carry those numbers up to the fleet now, and the first thing the fleet forces us to confront is that "serving the model" is not one task but two. A request that a user is waiting for and a corpus that a nightly job is grinding through place opposite demands on the same hardware. The processing-mode distinction we drew in Section 1.5, between work that must answer as it arrives and work that can be accumulated and swept in bulk, returns here as the central design axis of an inference fleet. Get the regime wrong and you either blow the latency budget or burn money scoring a backlog one item at a time.
1. Two Regimes, One Set of Weights Beginner
Online (real-time) serving handles requests that arrive continuously and unpredictably, each attached to a caller who is waiting for the answer. A search ranker, a chat assistant, a fraud check at the point of sale: the defining constraint is a latency service-level objective (SLO), usually stated as a tail percentile such as "99 percent of requests answered within 200 milliseconds." Because a human or a downstream service is blocked on the result, the fleet cannot wait to accumulate a big batch; it must run whatever has arrived, now. Small batches mean each accelerator is rarely full, so online serving deliberately trades hardware efficiency for responsiveness, and it keeps spare replicas idle so that a sudden burst of traffic does not push the tail latency past the SLO.
Offline (batch) inference is the opposite situation. You hold a fixed, often enormous dataset and you must run the model over every item: embed an entire document corpus for a retrieval index, label a backlog of millions of images, score a held-out set for an evaluation, or precompute recommendations for every user overnight. Nobody is waiting on any individual result; the only things that matter are total throughput and total cost. So the fleet does everything online serving refuses to do. It forms the largest batches the device memory allows, runs every accelerator at full utilization, and, because the job can be paused and resumed, runs on the cheapest preemptible or spot capacity available, the same economics we exploited for elastic training in Section 18.6.
Online and offline inference run identical weights through identical math, yet their optimal system configurations are mirror images. Online optimizes the tail of the latency distribution and pays for elasticity with idle headroom and tiny batches. Offline optimizes throughput and cost by filling every batch and saturating every device on the cheapest interruptible hardware. The knob that separates them, batch size, also sets which resource you are spending: small batches spend money to buy latency, large batches spend latency to buy throughput. A fleet that confuses the two either misses its SLO or scores a corpus at a hundred times the necessary cost.
2. The Cost of a Grouping Decision Intermediate
The two regimes diverge because every forward call carries a fixed overhead that is paid once per call regardless of how many items ride inside it: request handling, kernel launch, host-to-device dispatch, and framework bookkeeping. Let $o$ be that fixed per-call overhead in seconds, let $c$ be the marginal compute time each item adds, and let $b$ be the batch size. The wall-clock time for one forward call is $o + b\,c$, so the time attributed to each item is
$$t_{\text{item}}(b) = \frac{o + b\,c}{b} = \frac{o}{b} + c.$$At batch size one, every item pays the full overhead $o$; as $b$ grows, the $o/b$ term collapses toward zero and the per-item time approaches the irreducible compute floor $c$. Multiply by the rented price of the hardware and you get the quantity that decides an offline budget, the cost per million items,
$$\text{cost}_{1\text{M}}(b) = 10^{6} \cdot t_{\text{item}}(b) \cdot \frac{p}{3600},$$where $p$ is the dollars-per-hour price of the instance. Two levers move this number: the batch size $b$ (which shrinks $t_{\text{item}}$ through amortization) and the price $p$ (which the offline job lowers by accepting preemptible capacity). Online serving cannot pull either lever, because a large $b$ means waiting for a batch to fill and a preemptible instance means a request can vanish mid-flight. The code below makes both terms concrete on one model.
The script runs the same small multilayer perceptron in both regimes. Online uses batch size one and reports the per-request latency; offline uses a large batch and reports the amortized per-item time, then prices each on its own hardware tier. The fixed per-call overhead $o$ and the per-item compute $c$ are made explicit constants and the wall-clock of a call is modeled as $o + c\,b$, so the amortization is visible and the printed numbers are exact rather than at the mercy of one machine's clock; the real matmul still runs to show the answer is identical in both regimes.
import numpy as np
rng = np.random.default_rng(0)
D, H, C = 256, 512, 16 # input dim, hidden, classes: a small MLP "model"
W1 = (rng.standard_normal((D, H)) / np.sqrt(D)).astype(np.float32)
W2 = (rng.standard_normal((H, C)) / np.sqrt(H)).astype(np.float32)
# A forward call costs a FIXED per-call overhead O_MS (request handling, kernel
# launch, host/device dispatch, framework bookkeeping) plus a small marginal
# compute C_MS per item. We model the cost analytically so the result is exact
# and reproducible, and still run the real matmul to show the answer is identical.
O_MS, C_MS = 6.0, 0.18 # fixed per-call overhead (o) and per-item compute (c)
def score(batch): # the real forward pass (the answer)
h = np.maximum(batch @ W1, 0.0) # ReLU hidden layer
return h @ W2 # logits
def call_ms(b): return O_MS + C_MS * b # modeled wall-clock of one call: o + c*b
DOLLARS_PER_HOUR_ONLINE = 3.00 # on-demand, reserved for low tail latency
DOLLARS_PER_HOUR_BATCH = 0.90 # preemptible / spot, ~70% cheaper
LATENCY_SLO_MS = 50.0 # online tail-latency budget per request
def measure(batch_size):
data = rng.standard_normal((batch_size, D)).astype(np.float32)
score(data) # run it; the logits are the answer
per_batch_ms = call_ms(batch_size) # modeled time of the forward call
return per_batch_ms / batch_size / 1000.0, per_batch_ms # per-item s, per-batch ms
def cost_per_million(per_item_s, price_per_hr):
return 1e6 * per_item_s * (price_per_hr / 3600.0)
online_pi, online_ms = measure(batch_size=1) # one request at a time
batch_pi, batch_ms = measure(batch_size=512) # large offline batch
online_iph = 3600.0 / online_pi; batch_iph = 3600.0 / batch_pi
online_cpm = cost_per_million(online_pi, DOLLARS_PER_HOUR_ONLINE)
batch_cpm = cost_per_million(batch_pi, DOLLARS_PER_HOUR_BATCH)
print("Same model (256->512->16 MLP), one accelerator, two serving regimes\n")
hdr = f"{'regime':<9}{'batch':>7}{'per-item ms':>13}{'items/hour':>15}{'$/1M items':>13}{'batch ms':>11}"
print(hdr); print("-" * len(hdr))
for name, bs, pi, iph, cpm, bms in [
("online", 1, online_pi, online_iph, online_cpm, online_ms),
("offline", 512, batch_pi, batch_iph, batch_cpm, batch_ms),
]:
print(f"{name:<9}{bs:>7}{pi*1e3:>13.4f}{iph:>15,.0f}{cpm:>13.4f}{bms:>11.2f}")
print(f"\nonline per-request latency : {online_ms:6.2f} ms (SLO {LATENCY_SLO_MS:.0f} ms -> within budget)")
print(f"throughput gain offline : {batch_iph/online_iph:6.1f}x more items/hour")
print(f"cost gain offline : {online_cpm/batch_cpm:6.1f}x cheaper per million items")
online row uses batch size one and reports its per-request latency against the SLO; the offline row uses a large batch on cheaper preemptible pricing and reports the amortized cost per million items. Only the grouping and the price differ.Same model (256->512->16 MLP), one accelerator, two serving regimes
regime batch per-item ms items/hour $/1M items batch ms
--------------------------------------------------------------------
online 1 6.1800 582,524 5.1500 6.18
offline 512 0.1917 18,777,506 0.0479 98.16
online per-request latency : 6.18 ms (SLO 50 ms -> within budget)
throughput gain offline : 32.2x more items/hour
cost gain offline : 107.4x cheaper per million items
The numbers make the trade concrete. The online regime pays the full $6.18$ millisecond overhead on every single request, which is fine because a person is waiting and the budget is generous, but it means each item costs about five dollars per million. The offline regime spreads that same overhead across $512$ items, drops the per-item time by more than thirty times, and then takes a further discount by running on spot hardware, landing near five cents per million. Nothing about the model changed; only the decision of how to group the work and where to run it. This is the same exactness-versus-cost story from Section 1.1, seen from the serving side: the answer each item receives is identical in both regimes, but the system cost of producing it is not.
Who: A platform engineer at a document-management company building a semantic search feature.
Situation: A backlog of 400 million stored documents had to be embedded once to populate a vector index before search could launch.
Problem: The team reused the production online embedding endpoint, sending documents one at a time through the same low-latency service that served live queries.
Dilemma: Keep the simple path and let the one-at-a-time job run for weeks on expensive reserved GPUs, or build a separate offline pipeline that batches aggressively on cheap preemptible capacity but cannot reuse the existing endpoint.
Decision: They built the offline pipeline, because the backlog was a fixed dataset with no latency requirement, exactly the case where batch economics dominate.
How: They sharded the document table, ran a batch-inference job that formed batches of several hundred documents per forward call, and placed the workers on spot instances that checkpointed progress per shard so a preemption only lost the current shard.
Result: The job finished in under two days instead of an estimated five weeks, at a small fraction of the original projected cost, while the online endpoint kept serving live queries undisturbed, mirroring the gap in Output 23.3.1.
Lesson: A one-time scoring job over a fixed corpus is an offline workload, never an online one. Pushing it through a latency-optimized endpoint pays the per-request overhead and the on-demand price on every item, the two costs batch inference exists to avoid.
3. Batch Inference as a Distributed Data Job Intermediate
Once the dataset is large enough that one machine cannot sweep it in reasonable time, offline inference stops being a serving problem and becomes a distributed data-processing problem. The pattern is the map step you already know: partition the input into shards, run the same model over each shard in parallel on many workers, and write the per-item outputs back to storage. The model is the map function, the corpus is the partitioned input, and there is no reduce step unless you are aggregating, which makes batch inference one of the cleanest embarrassingly parallel jobs in the book. The input side is itself a distributed-data concern: the sharded reading, the columnar formats, and the shuffle-free partitioning come straight from the DataFrame machinery of Chapter 7, now feeding a model instead of an aggregation.
Framing batch inference this way buys two things. First, fault tolerance is free: because each shard is independent and outputs are written per shard, a preempted worker only forces the re-execution of its current shard, the same re-execution model that made MapReduce robust in Chapter 6, which is precisely what makes running on cheap interruptible capacity safe. Second, throughput scales by adding workers, with no communication between them, so the job has none of the all-reduce tax that limits distributed training. The bottleneck is rarely the model; it is feeding data to the accelerators fast enough, which is why a batch-inference job is tuned by overlapping data loading with compute and by sizing the batch to keep every device saturated.
The from-scratch view treats batch inference as a hand-written shard loop with manual batching, checkpointing, and worker placement. A modern batch-inference framework collapses that to a streaming map over a dataset, with the sharding, batch formation, accelerator placement, and preemption recovery handled internally. Ray Data expresses the entire job as one map_batches call:
import ray, numpy as np
class Embedder: # the model is loaded once per worker
def __init__(self):
self.model = load_model() # heavy weights, paid once, not per batch
def __call__(self, batch: dict) -> dict:
batch["embedding"] = self.model(batch["text"]) # run on a whole batch
return batch
ds = ray.data.read_parquet("s3://corpus/") # sharded, streamed input
out = ds.map_batches( # the model IS the map fn
Embedder,
batch_size=512, # large offline batch
concurrency=64, # 64 parallel GPU workers
num_gpus=1,
)
out.write_parquet("s3://embeddings/") # per-shard outputs
map_batches; the framework streams shards through 64 workers, forms large batches, and re-runs only the shards an interruption lost. Spark offers the same shape through mapInPandas or a pandas UDF.4. Sharing One Fleet: Priorities and Idle Capacity Advanced
Running separate fleets for online and offline work is simple but wasteful, because the online fleet is provisioned for peak traffic and sits half-idle the rest of the day. The hybrid patterns reclaim that waste by letting both workloads share the same accelerators, with a scheduler deciding who runs when. The first pattern is a priority queue: latency-critical online requests are admitted ahead of best-effort batch work on the same devices, so the batch job soaks up whatever capacity the online traffic is not using and yields it instantly when a live request arrives. The batch work runs as a low-priority background tenant, preemptible by the foreground service, which is exactly the disposition that lets it tolerate interruption.
The second pattern is temporal: online traffic follows a daily cycle, so the idle troughs, nights and weekends for a consumer service, are exactly when a large offline backlog can run on the otherwise-empty online fleet at no extra hardware cost. This turns the online fleet's elasticity headroom, the spare replicas it keeps for spikes, into useful offline throughput whenever the spikes are not happening. Both patterns depend on the scheduler being able to preempt and resume the batch job cleanly, which is why the distributed-data framing of Section 3 matters: a job built from independent, checkpointed shards can be paused and moved without losing work. The cluster-scheduling machinery that arbitrates these priorities, gang scheduling and preemption, is the subject of Chapter 3's performance models and is developed fully when we reach cluster infrastructure later in the book.
A consumer chat service at 3 a.m. is a stadium with the lights on and three people in the seats. The cheapest "new" GPUs a company can find are often the ones it already rents and leaves warming the data center overnight. Backfilling that idle online fleet with a batch job is less an optimization than a refusal to pay twice for silence.
There is a failure mode worth naming. If the batch tenant is not truly preemptible, or if its batches are so large that a single forward call holds the device for hundreds of milliseconds, then an online request arriving mid-batch waits behind it and the tail latency spikes. The remedy is to cap the batch tenant's maximum batch size on a shared device so that no single call blocks the accelerator longer than the online SLO can absorb. Sharing a fleet is therefore not free; it trades a little batch efficiency for the right to reuse idle capacity, and the cap is the price of admission. Multi-tenant GPU serving, where several models and tenants share devices under isolation guarantees, is the dedicated subject of Chapter 9's online-processing patterns and the next section's autoscaling controls.
The online-versus-batch split has become a first-class product feature for large language models. Since 2024 the major providers ship asynchronous batch APIs (OpenAI's Batch API, Anthropic's Message Batches, Google's Vertex batch prediction) that accept a file of requests, run them offline within a 24-hour window, and charge roughly half the synchronous price, the productized form of the cost gap in Output 23.3.1. On the open-inference side, engines such as vLLM and SGLang expose offline batch entry points that pair continuous batching with prefix and KV-cache reuse so that a corpus of prompts sharing a long system preamble is scored far below its naive cost, and frameworks like Ray Data integrate vLLM directly as a batch-inference operator over sharded datasets. A parallel research thread targets throughput-optimal offline serving specifically, scheduling thousands of concurrent sequences to keep every accelerator saturated when no latency constraint applies, the regime where the only objective is items per dollar. The shared lesson is that batch inference is no longer an afterthought of an online stack; it is a separately engineered system with its own APIs, schedulers, and price.
We now have both regimes, the equation that separates them, the distributed-data framing that scales the offline one, and the hybrid patterns that let a single fleet serve both. The remaining question is how the fleet decides, moment to moment, how many replicas to keep running as live traffic rises and falls and as a batch queue drains. That control problem, driving replica count from GPU utilization and queue depth, is where Section 23.4 takes us next.
For each task, decide whether it is fundamentally online or offline, and name the one property that decides it: (a) a chat assistant answering a user typing in a browser; (b) precomputing tomorrow's recommended items for all 50 million users overnight; (c) a content-moderation model that must flag a livestream within two seconds; (d) re-embedding an entire 200 million document corpus after upgrading the embedding model. For each offline case, state which cheaper hardware tier from Section 1 you would run it on and why a preemption would not corrupt the result.
Extend Code 23.3.1 to sweep the batch size $b$ over the values 1, 2, 4, 8, 16, 32, 64, 128, 256, 512 and plot or print the per-item time $t_{\text{item}}(b)$ and the cost per million items for each. Using the model $t_{\text{item}}(b) = o/b + c$ from Section 2, fit $o$ and $c$ from two of your measurements and predict the per-item time at $b = 1024$; compare the prediction to a measured run. Then identify the batch size beyond which doubling $b$ cuts the per-item cost by less than 5 percent, and explain why an online fleet bound by a 50 millisecond SLO would still refuse to use it.
A shared fleet serves online requests under a 100 millisecond tail-latency SLO and backfills a low-priority batch job on the same GPUs. The batch job's forward call takes $0.2$ milliseconds per item plus a fixed $6$ millisecond overhead. If an online request can arrive at any instant and must wait for the current batch call to finish before it runs, derive the largest batch size the batch tenant may use so that this head-of-line wait never consumes more than 40 percent of the online SLO. State how the answer changes if the device supports preemption at the granularity of a single forward call, and connect your reasoning to the preemptible-shard design of Section 3.