"They asked me to serve a billion requests. I am one replica. I did the only sensible thing: I asked them to clone me, and then I asked each clone to be a little less wasteful."
A Replica That Knows It Is Not Alone
A deployed model does not run on one machine; it runs on a fleet of many identical replicas, and the cost of that fleet is, to a first approximation, the cost of one node multiplied by the number of nodes. That single multiplication is why a book about scaling out opens its serving part with a chapter about scaling up. This chapter is the labeled per-node prerequisite: everything in it is single-node efficiency, included only because distributed serving (Chapters 23 to 26) multiplies whatever each node does, good or bad, by the replica count. Halve a node's memory and you may fit it on a cheaper accelerator; double a node's throughput and you may need half as many replicas. You cannot size, cost, or reason about the distributed serving systems that follow without first pinning down the unit economics of one node, and that is the entire job of Chapter 22.
Part IV taught how to train a large model by spreading the work across many machines. Training, however, is a one-time (or periodic) capital expense. Serving is the bill that arrives every hour for as long as the model is in production, and for any successful model that recurring bill dwarfs the cost of the training run that produced it. The shift from Part IV to Part V is the shift from "how do we build this model across a cluster?" to "how do we answer a continuous stream of requests from a fleet, cheaply and within a latency budget, forever?" The fleet is the subject of the next four chapters. This one chapter steps back to the single replica, because the fleet is built out of replicas, and a fleet of wasteful replicas is just waste at scale.
We need to be honest about what this chapter is and is not. It is scale-up: quantization, pruning, distillation, KV-cache management, efficient attention, single-node batching, and compilation are all techniques for making one machine do more with less. None of them distribute anything. They earn their place in a scale-out book for exactly one reason, stated in the next section and repeated whenever it matters: the per-node number they improve is the number the fleet multiplies. The distributed serving methods (replication and routing in Chapter 23, disaggregated and tensor-parallel LLM serving in Chapter 24) are the main event of Part V. This chapter is their prerequisite, not their rival.
1. The Fleet Is Replicas, and Cost Is a Product Beginner
Start from the deployment picture. A model that must answer many requests per second cannot do so from a single accelerator, because one accelerator delivers only a bounded throughput; the same throughput ceiling we named as the third pressure that forces distribution back in Chapter 1. The remedy for a throughput ceiling is replication: stand up many identical copies of the model behind a load balancer, and route each incoming request to whichever replica is free. That fleet of replicas is precisely what Chapter 23 builds. For this section, the only fact we need is that the fleet is made of copies, and the copies are the same single-node program this chapter optimizes.
Let one replica sustain a throughput of $r$ queries per second and rent for a price of $c$ dollars per node-hour. To serve a target load of $\Lambda$ queries per second, leaving a utilization headroom $u \in (0,1)$ so that tail latency does not explode near saturation, the fleet needs
$$N \;=\; \left\lceil \frac{\Lambda}{u \, r} \right\rceil \quad\text{replicas,} \qquad\text{at an hourly cost of}\qquad C \;=\; N \cdot c.$$The whole argument of this chapter lives in that pair of formulas. The node count $N$ falls as the per-node throughput $r$ rises; the fleet cost $C$ is a straight product of the node count and the per-node price $c$. Substituting the first into the second makes the dependence explicit:
$$C \;=\; \left\lceil \frac{\Lambda}{u\,r} \right\rceil \cdot c \;\;\approx\;\; \frac{\Lambda}{u} \cdot \frac{c}{r}.$$Read the right-hand approximation slowly, because it is the thesis of the chapter in one fraction. The target load $\Lambda$ and the headroom $u$ are set by the product, not by you. What you control as an efficiency engineer is the ratio $c/r$: the cost per node divided by the throughput per node, in other words the cost per query of a single replica. Every technique in Chapter 22 attacks that one ratio. Double the throughput $r$ and you halve the cost. Cut the per-node price $c$ (often by shrinking the model so it fits on a smaller accelerator) and you cut the cost in the same proportion. Because the fleet is a product, a per-node factor is a fleet-wide factor.
Fleet cost is approximately $\frac{\Lambda}{u}\cdot\frac{c}{r}$, a target load you do not control multiplied by a per-node cost-per-query you do. Any single-node improvement that halves $c$ or doubles $r$ halves the entire fleet's hourly bill, across every replica at once. This is the only reason a chapter of scale-up tricks belongs in a scale-out book: distribution multiplies the per-node number, so improving the per-node number is the highest-leverage thing you can do before you ever distribute. Get the unit economics of one node right first; then replicate.
2. One Optimized Node, Multiplied Beginner
The multiplication is easier to see than to say. Figure 22.1.1 shows one replica, then the same replica copied across a serving fleet behind a load balancer. The point of the figure is that the fleet does not invent new efficiency; it inherits, copy by copy, whatever efficiency the single node has. An improvement applied to the node on the left propagates to every box on the right at no extra design cost, which is what makes per-node work the cheapest lever in the whole serving stack.
This is also why the chapter is ordered the way it is before fleet sizing closes it. The single-node levers, quantization in Section 22.2, pruning in Section 22.3, distillation in Section 22.4, KV-cache management in Section 22.5, efficient attention in Section 22.6, single-node batching and speculative decoding in Section 22.7, and compilation in Section 22.8, each move the ratio $c/r$ for one replica. Section 22.9 then folds those per-node numbers into the fleet-sizing arithmetic of this section, turning measured single-node throughput and memory into a node count and a monthly bill. The thread runs from one node to the whole fleet and back, which is exactly the order in which a serving system is actually costed.
3. Most of These Tricks Attack Memory Traffic Intermediate
There is a unifying reason the per-node levers look the way they do, and it comes from the roofline model of Section 3.7. The roofline says a kernel is bound either by compute (floating-point operations per second) or by memory bandwidth (bytes moved per second), whichever ceiling it hits first, and the dividing line is the arithmetic intensity: the number of operations performed per byte read from memory. Autoregressive decoding, the dominant cost of LLM inference, generates one token at a time and so reuses each loaded weight only across a small batch. Its arithmetic intensity is low, which puts it firmly on the memory-bandwidth side of the roofline. The replica spends most of its time waiting for weights and the key-value cache to stream out of high-bandwidth memory, not waiting for the arithmetic units.
Once you see that single-node inference is usually memory-bound, the catalogue of techniques in this chapter stops looking like a grab bag and starts looking like a single strategy applied many ways: move fewer bytes. Quantization (Section 22.2) shrinks each weight from sixteen bits to eight or four, cutting the bytes streamed per token. Pruning and sparsity (Section 22.3) remove weights so there are fewer bytes to move at all. Distillation (Section 22.4) replaces a large model with a smaller one whose entire weight footprint is smaller. KV-cache management and paged attention (Section 22.5) attack the other big memory consumer, the growing per-request cache, and efficient attention (Section 22.6) reduces the memory traffic of the attention kernel itself. Even batching (Section 22.7) is a memory-traffic argument: serving more requests per weight load raises the arithmetic intensity and pushes the kernel back toward the compute roofline where the hardware is actually fast. Almost every lever in Chapter 22 is, at bottom, a way to feed a memory-bound machine fewer bytes.
The book's spine is that distribution multiplies per-node behavior, so the per-node number is worth getting right before you replicate. We met this thread first as the cost of the all-reduce in data-parallel training (Chapter 1); here it returns on the serving side, where the per-node cost-per-query $c/r$ is multiplied by the replica count $N$ to give the fleet bill. The same KV-cache and throughput economics you pin down on one node in this chapter reappear, multiplied across the serving fleet, when Chapter 24 sizes a distributed LLM service. Whenever a serving chapter later quotes a fleet cost, ask which single-node number it multiplied; the answer is almost always the ratio in this section.
4. From One Node to a Monthly Bill Intermediate
The arithmetic of Section 1 is worth making concrete, because the magnitudes are what motivate the rest of the chapter. The program below takes a target load and the unit economics of one node, then sizes the fleet three ways: a baseline node, a node whose throughput a single-node trick has doubled, and a node whose memory a single-node trick has halved (so it fits a cheaper accelerator class at a lower per-node price). It is deliberately pure Python with no libraries, because the point is the multiplication, not the modeling. It reports node counts and the hourly and monthly fleet cost for each case.
import math
# Target serving workload and the unit economics of ONE node.
target_qps = 12_000 # queries per second the deployed model must sustain
node_qps = 85.0 # queries per second one replica delivers (baseline)
node_dollar_hr = 3.10 # rented cost of one accelerator-node per hour
util_ceiling = 0.70 # never run a replica past 70% to keep tail latency sane
def fleet(node_qps, node_dollar_hr):
effective = node_qps * util_ceiling # usable qps after headroom
nodes = math.ceil(target_qps / effective)
cost_hr = nodes * node_dollar_hr # C = N * c
return nodes, cost_hr
# Baseline node.
n0, c0 = fleet(node_qps, node_dollar_hr)
# Per-node lever A: a single-node trick (quantization + better batching)
# doubles throughput per replica at the SAME hourly node price.
n_thr, c_thr = fleet(node_qps * 2.0, node_dollar_hr)
# Per-node lever B: a 2x memory reduction lets each node hold the model on a
# cheaper/smaller accelerator class, cutting the per-node price by ~35%.
n_mem, c_mem = fleet(node_qps, node_dollar_hr * 0.65)
print(f"target load : {target_qps:,} qps")
print(f"baseline : {n0:4d} nodes ${c0:8.2f}/hr ${c0*24*30:12,.0f}/month")
print(f"2x thrpt : {n_thr:4d} nodes ${c_thr:8.2f}/hr ${c_thr*24*30:12,.0f}/month "
f"-> {100*(1-c_thr/c0):4.1f}% cheaper, {n0-n_thr} fewer nodes")
print(f"2x memory : {n_mem:4d} nodes ${c_mem:8.2f}/hr ${c_mem*24*30:12,.0f}/month "
f"-> {100*(1-c_mem/c0):4.1f}% cheaper, same node count, smaller box")
fleet helper applies $N = \lceil \Lambda / (u\,r) \rceil$ and $C = N \cdot c$ from Section 1; the three calls show how a single per-node lever (throughput or memory) propagates to the node count and the monthly bill.target load : 12,000 qps
baseline : 202 nodes $ 626.20/hr $ 450,864/month
2x thrpt : 101 nodes $ 313.10/hr $ 225,432/month -> 50.0% cheaper, 101 fewer nodes
2x memory : 202 nodes $ 407.03/hr $ 293,062/month -> 35.0% cheaper, same node count, smaller box
The numbers in Output 22.1.1 are the reason this chapter exists. A near-half-million-dollar monthly serving bill is not unusual for a popular model, and a single-node throughput trick that costs an engineer a week to apply removes more than two hundred thousand dollars a month, every month, for as long as the model is served. No amount of clever request routing in Chapter 23 can match the leverage of halving the cost-per-query of the node being routed to, because routing rearranges the load across replicas whereas per-node efficiency changes how many replicas the load needs in the first place. That is the precise sense in which one node's efficiency determines fleet cost.
Code 22.1.1 assumes a per-node throughput $r$; in practice you do not guess it, you measure it on the real model and hardware. Modern inference servers bundle most of this chapter's single-node levers (paged KV cache, continuous batching, quantized weight loading) behind one engine, so you can read off the throughput that goes into the fleet formula:
# pip install vllm
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct",
quantization="awq", # per-node lever: 4-bit weights (Section 22.2)
gpu_memory_utilization=0.90, # paged KV cache packs more concurrent requests
max_num_seqs=256) # continuous batching raises arithmetic intensity
out = llm.generate(prompts, SamplingParams(max_tokens=128))
# the server's reported tokens/sec, divided by tokens-per-query, IS the r in C = N*c
Who: A site reliability engineer on the inference platform team at a consumer chat company.
Situation: A popular assistant model served roughly 12,000 queries per second from a fleet of about 200 rented accelerator nodes, costing close to half a million dollars a month.
Problem: Finance flagged the serving line item as the single largest infrastructure cost, and traffic was still growing, so the fleet was on track to double within two quarters.
Dilemma: Negotiate a volume discount on more of the same nodes (scale out further, linear in cost), or invest engineering time in making each node cheaper per query (scale up the unit economics), which is slower to ship but multiplies across every replica.
Decision: They went after the per-node number first, because the fleet cost is a product and the per-node factor is the one term they fully controlled.
How: They applied 4-bit weight quantization plus continuous batching on one replica, measured a clean doubling of sustained per-node throughput with no measurable quality regression on their evaluation set, then rolled the identical image to every replica.
Result: The fleet dropped from about 200 nodes to about 100, and the monthly bill fell from roughly $451,000 to roughly $225,000, the 50% reduction Output 22.1.1 predicts, with the routing and load balancer untouched.
Lesson: When the fleet is a product, optimize the factor you control. A single-node throughput change ships once and pays out across every replica, every hour, which is leverage no amount of request rerouting can match.
5. What This Chapter Is, and Where the Distributed Story Resumes Beginner
It is worth saying plainly, one more time, what role this chapter plays. Chapter 22 is the only place in the book where single-node efficiency is the subject rather than a supporting detail, and it is here strictly as a prerequisite. The reader who skips it will not be able to follow the fleet-sizing arguments of Chapter 23 or the disaggregated-serving cost models of Chapter 24, because those chapters take the per-node throughput and memory footprint as given inputs and reason about how to replicate, route, and partition around them. You have to know the unit before you can reason about the multiple. The distributed serving story, the actual main event of Part V, resumes the moment this chapter closes; Chapter 22 is the on-ramp, not the destination.
The remaining sections of this chapter work through the per-node levers one at a time, always reporting the change in throughput or memory that the fleet formula of Section 1 will multiply. They are arranged from the most universally applicable (quantization) through the model-structural (pruning, distillation), the LLM-specific memory levers (KV cache, efficient attention), the scheduling levers (batching, speculative decoding), and finally compilation, before Section 22.9 reassembles everything into a fleet-sizing procedure. Keep the picture from Figure 22.1.1 in mind throughout: you are optimizing one node, and the fleet inherits the result.
The per-node lever that the rest of this chapter unpacks is an active research area, because shrinking $c/r$ on one node is the cheapest path to a cheaper fleet. On the throughput side, paged attention and continuous batching from vLLM (Kwon et al., 2023) became the de facto serving baseline, and 2024 to 2025 work on prefill-decode disaggregation (DistServe, Zhong et al., 2024) and chunked prefill (Sarathi-Serve, Agrawal et al., 2024) pushed per-node goodput further. On the memory side, the weight-only quantization line (GPTQ, AWQ) was joined by 4-bit KV-cache quantization and by FP8 serving on current accelerators, each cutting the bytes streamed per token that the roofline of Section 3.7 identifies as the binding constraint. Speculative decoding with self-drafting (Medusa, Cai et al., 2024) raises tokens-per-step on a single node without a separate draft model. The common thread is that every one of these per-node advances is reported as a throughput-per-dollar gain, which is exactly the $c/r$ that Chapter 24 multiplies across the fleet.
There is a mild irony at the start of a scale-out part. The single highest-leverage cost reduction available to a serving team is usually not a clever distributed algorithm; it is a quiet single-node change, quantization or a better batch scheduler, that ships as a new container image and never touches the topology. The distributed-systems engineers get to keep their interesting routing problems; they just get to solve them for half as many boxes. Scaling out works best when each thing you scale out is already lean.
With the unit economics established (fleet cost is the per-node cost-per-query, multiplied), we can now go after that per-node number in earnest. The first and most broadly useful lever is reducing the precision of the weights, which shrinks both the memory footprint and the bytes streamed per token. That is the subject of the next section, Section 22.2, on quantization.
Using the approximation $C \approx \frac{\Lambda}{u}\cdot\frac{c}{r}$ from Section 1, explain in words why a single-node trick that doubles throughput $r$ and a single-node trick that halves the per-node price $c$ have the same effect on fleet cost, yet differ in their effect on the node count $N$. Then state which of the two you would prefer if your accelerator supply were constrained (you cannot easily rent more nodes), and which you would prefer if your accelerators were plentiful but expensive. Justify each choice from the two formulas alone.
Extend Code 22.1.1 with a fourth scenario that applies both levers at once: a node whose throughput is doubled and whose per-node price is cut by 35%. Report its node count and monthly bill, and compute the combined percentage saving against the baseline. Verify numerically that the two factors compose multiplicatively on cost (not additively), and explain from $C = N \cdot c$ why the savings multiply. Then add a fifth scenario in which the throughput gain is only 1.4x (a more typical quantization result) and discuss how the fleet decision changes when the per-node win is real but modest.
Section 3 argued that single-node autoregressive decoding is usually memory-bound. Suppose a replica must stream $W$ bytes of weights plus a per-request KV cache to generate each token, its accelerator has a memory bandwidth of $B$ bytes per second and a peak compute of $F$ floating-point operations per second, and generating one token costs $2P$ operations for a model of $P$ parameters. Write the arithmetic intensity of single-token decoding at batch size one, and use the roofline of Section 3.7 to argue whether bandwidth or compute binds. Then explain, in terms of the roofline, why raising the batch size (Section 22.7) and quantizing the weights (Section 22.2) move the operating point in opposite directions along the two axes, and why both can still raise throughput per dollar.