"They optimized me until I served five times the traffic, then bought a fifth as many of me. I am not sure whether to feel proud or redundant."
A Serving Node That Did the Math on Its Own Replacements
Everything in this chapter exists to set two numbers for one node: how much memory the model needs (does it fit, and on how many GPUs) and how many tokens or requests per second that node can serve within its latency budget. Those two numbers are the multiplicand of distributed serving. A fleet is just many nodes, so its size is the required throughput divided by the per-node throughput, and its cost is that node count times the per-node cost. This means every per-node lever from Sections 22.2 through 22.8, quantization, pruning, distillation, paged KV cache, FlashAttention, continuous batching, speculative decoding, and compilation, is also a fleet lever: each one that raises per-node throughput or shrinks the per-node footprint divides the whole fleet, and its cost, by the same factor. This closing section ties the per-node toolbox into a single performance model and hands that model to the distributed serving chapters that follow.
The previous sections each pulled on one lever in isolation. Quantization (Section 22.2) cut the bytes per weight; pruning and distillation (Sections 22.3 and 22.4) cut the work per token; the paged KV cache (Section 22.5) and FlashAttention (Section 22.6) cut the memory and the kernel cost of attention; continuous batching and speculative decoding (Section 22.7) kept the accelerator busy; compilation (Section 22.8) fused the graph. We treated them one at a time so the mechanism of each was clear. A deployed system does not experience them one at a time. It experiences their combined effect on a single node, expressed as a footprint and a throughput, and then multiplies that node across a fleet. The job of this section is to make that multiplication explicit, because it is the entire reason a chapter on single-node efficiency belongs in a book about scale-out.
1. The Per-Node Performance Model Beginner
Collapse the whole chapter into a single box with three outputs. Give that box a model and a target latency budget, and it returns the per-node memory footprint (which fixes how many GPUs one node must have), the per-node throughput (requests or tokens per second the node sustains while honoring the latency budget), and the per-node cost (the dollars-per-hour of the hardware that node occupies). The optimizations of Sections 22.2 through 22.8 are the knobs that move those three outputs. Nothing in the distributed serving chapters changes what one node can do; they change how many nodes you arrange and how requests flow between them. So the per-node box must be characterized first, and characterizing it is precisely what this chapter has been doing.
Figure 22.9.1 draws the box and the two arrows that leave it. The memory footprint feeds a "does it fit, on how many GPUs" decision; the throughput and latency feed the fleet-size and cost calculation. Read the diagram left to right: per-node levers go in, two characterizing numbers come out, and those numbers drive the fleet arithmetic on the right.
The fleet node-count is the required throughput divided by per-node throughput. So a per-node optimization that doubles throughput halves the node count, and one that shrinks the footprint enough to drop a node from four GPUs to two halves the hardware each node occupies. Because the same factor flows straight through the division, a 5x improvement on one node is a 5x smaller, 5x cheaper fleet. Single-node efficiency is never the main event in this book, but it sets the constant that every multiplication in Part V starts from.
2. From One Node to a Fleet, in One Equation Intermediate
Let the deployment have to sustain a target throughput of $Q$ requests per second while honoring a latency objective. Suppose one optimized node sustains $q$ requests per second within that latency budget, and occupies hardware that costs $c$ dollars per hour. Then the fleet needs
$$N \;=\; \left\lceil \frac{Q}{q} \right\rceil \quad\text{nodes}, \qquad C \;=\; N \cdot c \quad\text{dollars per hour}.$$The ceiling reflects that you cannot run a fractional node. The structure is deliberately simple, because the simplicity is the point: per-node throughput $q$ sits in the denominator, so any lever that raises $q$ lowers $N$ proportionally, and any lever that lets the same node serve the load on fewer GPUs lowers $c$. The latency budget is not decoration; it sets how large a batch a node may run, which in turn caps $q$. A node that ignores latency could batch enormously and report a huge $q$, but it would blow the service-level objective, so the honest $q$ is the throughput achievable at the largest batch that still meets the budget. This is the same throughput-versus-latency tension that Section 22.7 navigated with continuous batching, now read as a constraint on the fleet denominator. The cost models of Chapter 5, Section 5.5 turn $C$ into the cost-per-query and cost-per-token figures that a serving team actually budgets against, and the performance models of Chapter 3, Section 3.9 warn that $N$ is a floor: real fleets add nodes for redundancy, head-room, and the load-balancing imperfections that this clean division ignores.
This book leads with scale-out, and Section 1.1 framed single-node efficiency as the per-node baseline that distribution multiplies. Here that promise comes due. The chapter's entire toolbox resolves into one $q$ and one $c$, and the distributed serving system of Chapters 23 through 26 is the machinery that multiplies them: replication for throughput, sharding for footprint, routing and load balancing to keep every node near its $q$. When you meet a serving architecture later, factor it back to this equation and ask which term it is moving. The per-node levers move $q$ and $c$; the distributed levers move how cleanly $N$ nodes approach $N \cdot q$.
3. A Worked Fleet-Sizing Example Intermediate
Make the equation concrete. A team must serve a 70-billion-parameter model at a target of 4000 requests per second under a fixed latency objective, on 8-GPU nodes that cost twelve dollars per hour. In fp16 the weights alone are about 140 GB, which does not fit on one 80 GB GPU, so the unoptimized baseline needs four GPUs just to hold the model plus its working KV cache, and a static batch limits it to roughly 18 requests per second per node. The calculator below starts from that baseline and applies the chapter's optimizations as a cumulative stack, recomputing the per-node footprint and throughput at each step, then dividing the target by the new per-node throughput to get the node count and cost. Each optimization is modeled as a multiplicative effect on memory and on throughput; the numbers are representative of the published ranges, not guarantees, and the exercises invite you to substitute your own measured factors.
import math
# Target service-level objective for the deployment.
TARGET_QPS = 4000.0 # requests per second the fleet must sustain
COST_PER_NODE_HR = 12.0 # $/hour for one 8-GPU node (illustrative price)
# Baseline per-node numbers BEFORE any optimization, for a 70B model in fp16.
# Weights (~140 GB) plus a fat KV cache do not fit on one 80 GB GPU, so the
# baseline needs 4 GPUs to hold the model, and a static batch caps throughput.
baseline = dict(gpus_per_node=4, throughput_qps=18.0, label="fp16, static batch, eager")
# Each optimization multiplies (memory footprint, throughput).
# mem_factor < 1 shrinks the footprint (fewer GPUs); tput_factor > 1 raises QPS.
optimizations = [
("INT8 quantization (22.2)", 0.55, 1.35), # half the weight bytes, faster matmuls
("Paged KV cache (22.5)", 0.65, 1.40), # reclaim fragmented KV memory -> bigger batch
("FlashAttention (22.6)", 1.00, 1.25), # same memory, fused attention kernel
("Continuous batching (22.7)", 1.00, 1.80), # keep the batch full as requests finish
("torch.compile / TensorRT (22.8)", 1.00, 1.20), # fused graph, less launch overhead
]
GPU_MEM_GB = 80.0 # one accelerator's HBM
MODEL_FP16_GB = 140.0 # 70B params x 2 bytes
KV_OVERHEAD_GB = 40.0 # working KV-cache headroom at the baseline batch
def gpus_needed(mem_factor):
footprint = (MODEL_FP16_GB + KV_OVERHEAD_GB) * mem_factor
return max(1, math.ceil(footprint / GPU_MEM_GB))
def fleet(throughput_qps):
nodes = math.ceil(TARGET_QPS / throughput_qps) # N = ceil(Q / q)
return nodes, nodes * COST_PER_NODE_HR # C = N * c
b_nodes, b_cost = fleet(baseline["throughput_qps"])
print(f"Target: {TARGET_QPS:.0f} QPS | node price: ${COST_PER_NODE_HR:.0f}/hr\n")
print(f"{'Stack state':<34}{'GPUs/node':>10}{'QPS/node':>10}{'nodes':>8}{'$/hr':>9}{'fleet x':>9}")
print("-" * 80)
print(f"{baseline['label']:<34}{baseline['gpus_per_node']:>10}"
f"{baseline['throughput_qps']:>10.1f}{b_nodes:>8}{b_cost:>9.0f}{1.0:>9.2f}")
mem_factor, tput = 1.0, baseline["throughput_qps"]
for name, mf, tf in optimizations: # apply the stack cumulatively
mem_factor *= mf
tput *= tf
gpus = gpus_needed(mem_factor)
nodes, cost = fleet(tput)
print(f"+ {name:<32}{gpus:>10}{tput:>10.1f}{nodes:>8}{cost:>9.0f}{cost / b_cost:>9.2f}")
print("-" * 80)
f_nodes, f_cost = fleet(tput)
print(f"\nPer-node throughput rose {tput / baseline['throughput_qps']:.1f}x "
f"and memory shrank to {mem_factor:.2f}x of baseline.")
print(f"Fleet: {b_nodes} -> {f_nodes} nodes ({b_nodes / f_nodes:.1f}x fewer); "
f"cost: ${b_cost:.0f}/hr -> ${f_cost:.0f}/hr ({b_cost / f_cost:.1f}x cheaper).")
Target: 4000 QPS | node price: $12/hr
Stack state GPUs/node QPS/node nodes $/hr fleet x
--------------------------------------------------------------------------------
fp16, static batch, eager 4 18.0 223 2676 1.00
+ INT8 quantization (22.2) 2 24.3 165 1980 0.74
+ Paged KV cache (22.5) 1 34.0 118 1416 0.53
+ FlashAttention (22.6) 1 42.5 95 1140 0.43
+ Continuous batching (22.7) 1 76.5 53 636 0.24
+ torch.compile / TensorRT (22.8) 1 91.9 44 528 0.20
--------------------------------------------------------------------------------
Per-node throughput rose 5.1x and memory shrank to 0.36x of baseline.
Fleet: 223 -> 44 nodes (5.1x fewer); cost: $2676/hr -> $528/hr (5.1x cheaper).
The shape of the result is the lesson. Throughput levers (continuous batching, compilation, FlashAttention) divide the node count directly through the denominator $q$. Footprint levers (quantization, paged KV) do something subtler but just as valuable: by letting the model fit on one GPU instead of four, they cut the per-node cost $c$ even when they do not change $q$. The two effects compound, which is why the final fleet costs a fifth of the baseline rather than merely the throughput ratio alone. A reader who has internalized only this paragraph has the chapter's thesis: optimize the node, and the fleet shrinks with it.
Who: An inference platform engineer at a consumer chat startup about to reserve a year of GPU capacity.
Situation: Product forecast 4000 requests per second at launch with a 70B model and a strict first-token latency objective.
Problem: A naive fp16, static-batch deployment penciled out to roughly 223 eight-GPU nodes, a capacity reservation the company could not afford.
Dilemma: Sign the large reservation to be safe and bleed cash, or spend two weeks optimizing the node first and risk missing the launch date with an unproven serving stack.
Decision: They optimized the node before sizing the fleet, building the per-node stack of INT8 quantization, paged KV cache, FlashAttention, continuous batching, and compilation, then ran the calculator of Code 22.9.1 on measured numbers.
How: Each lever was benchmarked in isolation to get a real throughput and footprint factor, the factors were fed into the model, and the resulting per-node $q$ and GPU count were validated under a load test at the latency objective.
Result: The fleet estimate fell to about 44 nodes, the GPUs-per-node dropped from four to one, and the reservation cost about a fifth of the original, comfortably within budget, exactly the 5.1x of Output 22.9.1.
Lesson: Size the fleet after optimizing the node, never before. The per-node number you reserve against is the most expensive constant in the whole deployment, and this chapter is how you shrink it.
4. The Handoff to Distributed Serving Beginner
The clean division $N = \lceil Q / q \rceil$ is a starting point, not the finished system, and naming what it omits is the bridge into the rest of Part V. It assumes every node runs at its full $q$, but real traffic is bursty and uneven, so a load balancer and a request router are needed to keep nodes near their capacity; that is the subject of Chapter 23. It assumes the model fits on one node, but the largest models must be split across nodes with tensor and pipeline parallelism, and their KV caches managed across machines, which Chapter 24 develops by multiplying the very paged-KV economics of Section 22.5 across the fleet. It ignores the retrieval and vector-search tier that production systems bolt on, covered in Chapter 25, and the autoscaling, deployment, and monitoring that keep $N$ matched to a load that changes by the hour, the province of Chapter 26. In every one of those chapters the per-node box you characterized here is the unit being multiplied, replicated, sharded, or routed to. We can build the distributed serving system now precisely because we can describe one node with two numbers.
Code 22.9.1 asked you to supply the per-node throughput by hand. In practice you measure it with a benchmark harness rather than guess. The vLLM serving engine, which already bundles paged attention and continuous batching, ships a benchmark that drives a node to its sustainable throughput at a chosen latency target and prints the requests-per-second and tokens-per-second you plug in as $q$:
# Start a node, then measure its sustainable per-node throughput.
vllm serve meta-llama/Llama-3.1-70B-Instruct --quantization fp8 \
--tensor-parallel-size 1 --max-num-seqs 256
python -m vllm.entrypoints.benchmark \
--model meta-llama/Llama-3.1-70B-Instruct \
--request-rate 80 --num-prompts 2000 # prints achieved QPS and P99 latency
The clean per-node $q$ is being refined faster than this chapter can fix it. Prefill-decode disaggregation, introduced at scale by DistServe (Zhong et al., 2024) and Splitwise (Patel et al., 2024), runs the compute-bound prefill phase and the memory-bound decode phase on separate node pools, so a fleet now has two per-node numbers to multiply rather than one, and sizing means balancing the two pools. The notion of goodput, the throughput that actually meets per-request latency objectives rather than raw tokens per second, is replacing naive QPS as the denominator that matters, a framing sharpened in the DistServe line of work. Prefix and KV-cache sharing across requests and nodes (the lineage of vLLM's automatic prefix caching and Mooncake's KV-centric architecture, 2024) further breaks the assumption that each request costs the same $1/q$, since cached prefixes make some requests nearly free. The throughline for fleet sizing is that the single denominator of Section 2 is splitting into a small vector of phase-specific, SLO-aware capacities, and Chapter 24 sizes a fleet against exactly that.
There is a satisfying irony in a scale-out book ending its scale-up chapter by deleting machines. Every optimization in Output 22.9.1 made a node so much better at its job that the fleet needed 179 fewer of them. The cheapest distributed system is often the one whose nodes were optimized hard enough that you needed far fewer to build it.
Chapter 22 assembled the single-node inference toolbox: quantize the weights and activations to cut bytes and speed matmuls (22.2), prune and distill to cut the work per token (22.3, 22.4), use a paged KV cache to stop attention memory from fragmenting (22.5), use FlashAttention to fuse the attention kernel (22.6), use continuous batching and speculative decoding to keep the accelerator full (22.7), and compile the graph with torch.compile, ONNX, or TensorRT to remove launch overhead (22.8). Every one of these lowers the per-node memory footprint or raises per-node throughput, which is to say it lowers the per-node cost $c$ or raises the per-node throughput $q$. Because the fleet is $N = \lceil Q / q \rceil$ nodes at cost $N \cdot c$, each lever divides the whole fleet and its bill by the same factor: in the worked example, a 5.1x better node became a 5.1x smaller, 5.1x cheaper fleet. That is the one sentence to carry into Part V: optimize the node, and the fleet shrinks with it. Now that one node is fully characterized, Chapters 23 through 26 build the distributed serving system that multiplies it.
Using only the fleet equation $N = \lceil Q / q \rceil$ and $C = N \cdot c$, explain in words why a footprint optimization that drops a node from four GPUs to one can save more money than a throughput optimization that doubles $q$, even though only the throughput optimization changes $N$. Then describe a deployment in which the opposite is true, where doubling $q$ saves more than halving the per-node GPU count. State which term each optimization moves.
Modify Code 22.9.1 so the optimizations are applied in a different order, and confirm that the final fleet size is unchanged because the per-node factors are multiplicative. Then break that invariance: make one optimization's throughput factor depend on whether quantization was already applied (for example, continuous batching helps more once the model fits on one GPU), and show that order now matters. Report the best and worst orderings and the spread in final fleet cost.
The calculator treats $q$ as a single number, but in reality larger batches raise throughput while worsening tail latency. Suppose per-node throughput grows with batch size $b$ as $q(b) = q_0 \cdot b / (1 + 0.05 b)$ but the P99 latency grows linearly in $b$, and the objective caps P99 at a value that permits at most $b = 24$. Compute the honest $q$ at that batch cap, redo the fleet sizing, and discuss how a tighter latency objective (a smaller permitted $b$) raises the node count even with every optimization in place. Connect your answer to the throughput-latency tension of Section 22.7 and the cost models of Section 5.5.
1. A per-node throughput model and SLO-driven fleet sizer. Extend Code 22.9.1 into a small tool that takes a model size, a GPU memory, a measured throughput-versus-batch curve, and a latency objective, then outputs the largest batch that meets the objective, the resulting per-node $q$ and GPU count, and the fleet node-count and cost for a target QPS. Validate the per-node numbers against a real benchmark (vLLM's harness from Code 22.9.2) and report how far the model's prediction is from measurement.
2. The optimization-stack ablation, on real hardware. Take one open model and measure the per-node throughput and footprint after each lever of this chapter is added (quantization, paged KV, FlashAttention, continuous batching, compilation). Feed the measured factors into the fleet sizer and produce the real version of Output 22.9.1 for your hardware, including which levers compounded and which interfered. Quantify the gap between the multiplicative estimate and the measured combined effect.
3. Disaggregated fleet sizer. Following the research frontier, split the model into a per-node prefill capacity and a per-node decode capacity, and size two node pools against a workload with a given prompt-to-generation ratio. Show how the optimal split between prefill and decode nodes shifts as prompts get longer, and compare the total fleet cost against the single-pool sizing of Code 22.9.1.