"They asked the load balancer how many replicas the fleet needs. The load balancer asked one node how fast it serves a token. One node said it was busy moving the KV cache around and would get back to everyone shortly."
A GPU Profiling Its Own Memory Bandwidth
This chapter is the book's one labeled per-node, scale-up prerequisite, and it is included only because distributed serving multiplies per-node behavior across the fleet: it is the calibration step before Part V's distribution, not the main event. The thesis of this book leads with scale-out, with distribution and coordination and the spreading of work across many machines, and it treats single-node efficiency as borrowed, scoped, and clearly marked. This chapter is where that single-node material lives, gathered into one place so the rest of Part V never has to stop and explain what a KV cache is or why a quantized model serves more requests per dollar. The reason a distribution-first book pauses on one machine is arithmetic: a distributed inference system is a multiplier, and what it multiplies is the behavior of a single node. The replica count that Chapter 23 autoscales, the routing decisions a batch-aware load balancer makes, the per-token cost that shows up on the monthly bill, all of them are the per-node latency, throughput, and memory footprint developed here, multiplied across the fleet. Get the unit wrong and you do not pay the mistake once; you pay it on every replica, in every region, for as long as the service runs. So the work of this chapter is to measure and improve one number at a time on one machine: how many tokens per second a single accelerator generates, how many concurrent sequences its KV cache can hold, how much smaller a quantized or pruned or distilled model is, how much faster a fused attention kernel runs, and how a compiler closes the gap between the operations you wrote and the operations the hardware wants. The final section then turns the page back toward distribution by taking those measured per-node numbers and doing the fleet arithmetic with them. Everything between the framing and that closing translation is scale-up, on purpose, and labeled as such, because you cannot size a fleet you have not first costed a node for.
Chapter Overview
This chapter opens Part V, the part of the book about serving models across a fleet, and it opens it on a single machine. The earlier parts spread one training run across many accelerators; now the book turns from building models to serving them, and serving has a unit cost that distribution multiplies. The nine sections develop that unit cost end to end. They are scale-up throughout, by design and by label, because the distributed serving chapters that follow assume you already know what a node does with a request and only need to reason about how many nodes to run and how to route between them.
The sections fall into three movements. The first names the stakes: Section 22.1 shows why one node's efficiency determines fleet cost, establishing latency, throughput, and memory footprint as the per-node numbers that every later replica inherits. The second movement is the toolkit of model compression and serving optimization, the techniques that move those numbers on a single accelerator: Sections 22.2 through 22.4 shrink the model itself through quantization, pruning and sparsity, and knowledge distillation; Sections 22.5 and 22.6 attack the memory and compute of attention through the KV cache with paged attention and through FlashAttention and efficient attention kernels; and Sections 22.7 and 22.8 raise single-node utilization through continuous batching and speculative decoding and through compilation and kernel optimization. The third movement closes the loop: Section 22.9 takes the measured per-node profile and turns it into fleet sizing, translating tokens per second and memory per sequence into a replica count and a cost, which is exactly the input the distributed serving chapters consume.
Read in order, the nine sections take you from "fleet cost is per-node cost multiplied" to a worked path for moving every per-node number that matters: place the model in less memory with quantization, prune and distill it smaller, manage the KV cache so a node holds more concurrent sequences, run attention with a kernel that respects the memory hierarchy, keep the accelerator busy with continuous batching and speculative decoding, compile the graph so the hardware runs it well, and finally convert the improved profile into how many replicas the fleet needs and what it will cost. The argument is cumulative and it points outward: every technique here is justified by the fleet that will multiply it, and the chapter ends by handing that multiplied arithmetic to Part V.
Prerequisites
This chapter assumes the performance vocabulary the book has already built. From Chapter 3: Scalability and Performance Models you carry the roofline model and the distinction between compute-bound and memory-bound execution, because the central fact of inference, that generating one token at a time is bound by memory bandwidth rather than arithmetic, is a roofline statement, and most of this chapter is a campaign to move work off the memory-bound side of that roofline. From the large-model training chapters, Chapter 16: Model, Pipeline, and Sharded Parallelism and Chapter 19: Training Foundation Models at Scale, you carry the size of modern models, the transformer architecture, the attention mechanism, and the practical fact that weights and activations do not fit comfortably in one accelerator's memory, which is precisely why quantization, the KV cache, and paged attention matter. Beyond that the chapter assumes comfortable Python, familiarity with the transformer and its attention computation, and a working sense of what an accelerator's memory and compute resources are. No prior experience with inference serving or model compression is needed; Section 22.1 builds the per-node cost framing from the ground up before any specific technique appears.
Learning Objectives
- Explain why a single node's inference efficiency, its latency, throughput, and memory footprint, determines the cost of an entire serving fleet, and why distributed serving is a multiplier on per-node behavior.
- Distinguish post-training quantization from quantization-aware training, and reason about how methods such as GPTQ, AWQ, SmoothQuant, and LLM.int8() reduce a model's memory footprint while preserving accuracy.
- Describe how pruning and sparsity remove parameters or computation from a model, and how one-shot post-training methods such as SparseGPT and Wanda apply at the scale of large language models.
- Articulate knowledge distillation as training a small student model to match a large teacher, and identify when a distilled model is the right per-node economy.
- Explain how the KV cache grows with sequence length and concurrency, why it bounds how many requests one node can hold, and how paged attention manages it without fragmentation.
- Trace how FlashAttention and related kernels restructure the attention computation to respect the memory hierarchy, turning a memory-bound operation into a faster one.
- Describe continuous batching and speculative decoding as techniques that raise single-node utilization and throughput without adding hardware.
- Reason about compilation and kernel optimization, from ONNX and TensorRT to torch.compile, as ways to close the gap between the operations you wrote and the operations the hardware runs efficiently.
- Convert a measured per-node profile into fleet sizing, translating tokens per second and memory per sequence into a replica count and an operating cost.
If you keep one thing from this chapter, keep this: a distributed serving fleet is a multiplier on the behavior of one node, so the entire job of this labeled scale-up prerequisite is to measure and improve the per-node unit cost, its latency, throughput, and memory footprint, that every replica in Part V will inherit and that every dollar of the bill will repeat. Read forward, the sections build that unit cost in the order an engineer actually attacks it: first the framing that names per-node cost as fleet cost, then the model-compression techniques that shrink the weights, then the KV-cache and attention-kernel techniques that manage memory and compute, then the batching and compilation techniques that raise utilization, and finally the translation that turns the improved profile into a replica count. Read as a question, the chapter asks of any node about to be replicated across a fleet: how big is the model in memory, how many concurrent sequences can its KV cache hold, how many tokens per second does it generate, how busy is the accelerator kept, and what does one unit of that service cost before you multiply it. The roadmap below walks the nine sections that answer it, and the last one hands the answer to the distributed serving chapters that come next.
Chapter Roadmap
- 22.1 Why One Node's Efficiency Determines Fleet Cost Establishes per-node latency, throughput, and memory footprint as the unit economics a distributed fleet multiplies, so that an inefficient node is a mistake repeated on every replica and billed many times over.
- 22.2 Quantization Reduces a model's memory footprint by representing weights and activations in fewer bits, from post-training and quantization-aware approaches through GPTQ, AWQ, SmoothQuant, and LLM.int8().
- 22.3 Pruning and Sparsity Removes parameters or whole structures from a model to shrink memory and computation, including one-shot post-training methods such as SparseGPT and Wanda that scale to large language models.
- 22.4 Knowledge Distillation Trains a small student model to match a large teacher, producing a cheaper-to-serve node when a compressed model can carry the same task quality.
- 22.5 KV Cache and Paged Attention Explains how the per-token key-value cache grows with sequence length and concurrency to bound how many requests a node holds, and how paged attention manages it without fragmentation.
- 22.6 FlashAttention and Efficient Attention Restructures the attention computation to respect the memory hierarchy, turning a memory-bound operation into a faster, IO-aware kernel on a single accelerator.
- 22.7 Continuous Batching and Speculative Decoding Raises single-node utilization and throughput by batching requests at the iteration level and by drafting multiple tokens at once, without adding hardware.
- 22.8 Compilation and Kernel Optimization Closes the gap between the operations you wrote and the operations the hardware runs efficiently, through ONNX, TensorRT, and torch.compile graph capture, fusion, and kernel generation.
- 22.9 From Per-Node Numbers to Fleet Sizing Closes the chapter by converting the measured per-node profile into a replica count and an operating cost, handing the fleet arithmetic to the distributed serving chapters that follow.
Read the nine sections in order and you will hold a worked profile of one serving node and the techniques that improve it: Section 22.1 names per-node cost as fleet cost, Sections 22.2, 22.3, and 22.4 shrink the model with quantization, pruning, and distillation, Section 22.5 and Section 22.6 manage the memory and compute of attention through the KV cache and FlashAttention, Section 22.7 and Section 22.8 raise utilization with continuous batching, speculative decoding, and compilation, and Section 22.9 turns the improved profile into a replica count and a bill. The thread to watch is the roofline of Chapter 3 reappearing under every technique: token-by-token generation lives on the memory-bound side of the roofline, and quantization, paged attention, FlashAttention, and batching are all moves to do more useful work per byte of memory traffic on a single accelerator.
What's Next?
This chapter is a single-node prologue, and it exists to be multiplied. Having measured what one machine costs to serve, the book now returns to its real subject and spreads that machine across a fleet. Chapter 23: Distributed Inference Systems opens the distributed half of Part V, and it asks the questions a single node cannot answer: how many replicas of the node you just profiled the fleet needs, how a batch-aware load balancer routes requests across them, how the system autoscales on GPU utilization and queue depth, how it loads large models and survives cold starts, and how it stays available when a replica fails. Every one of those decisions takes the per-node numbers from this chapter, the tokens per second, the memory per sequence, the concurrency a node can hold, as its input, and reasons about coordination on top of them. Read it next, and watch the chapter you just finished become a single term in a much larger product: where Chapter 22 asked what one machine costs to serve, Chapter 23 asks how to run a thousand of them as one service.
Bibliography & Further Reading
Quantization
Frantar, E., Ashkboos, S., Hoefler, T., Alistarh, D. "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." arXiv:2210.17323, 2022. arxiv.org/abs/2210.17323
The one-shot post-training method that quantizes large language models to low bit-width with second-order error correction, a core reference for the post-training quantization of Section 22.2.
Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., Han, S. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." arXiv:2306.00978, 2023. arxiv.org/abs/2306.00978
The activation-aware scheme that protects the most salient weights during quantization, one of the production-grade methods surveyed in Section 22.2.
Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., Han, S. "SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models." arXiv:2211.10438, ICML 2023. arxiv.org/abs/2211.10438
The method that migrates quantization difficulty from activations to weights so both can be quantized to eight bits, addressing the activation-outlier problem of Section 22.2.
Dettmers, T., Lewis, M., Belkada, Y., Zettlemoyer, L. "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale." arXiv:2208.07339, NeurIPS 2022. arxiv.org/abs/2208.07339
The mixed-precision decomposition that isolates outlier features so the bulk of a transformer runs in eight-bit integers without accuracy loss, foundational to the quantization discussion of Section 22.2.
Pruning and Sparsity
Frantar, E., Alistarh, D. "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot." arXiv:2301.00774, ICML 2023. arxiv.org/abs/2301.00774
The one-shot pruning method that removes a large fraction of weights from a billion-parameter model without retraining, the central reference for Section 22.3.
Sun, M., Liu, Z., Bair, A., Kolter, J. Z. "A Simple and Effective Pruning Approach for Large Language Models (Wanda)." arXiv:2306.11695, ICLR 2024. arxiv.org/abs/2306.11695
The pruning criterion that combines weight magnitude with input activation norm to prune without weight updates, a lightweight counterpart to SparseGPT in Section 22.3.
Knowledge Distillation
Hinton, G., Vinyals, O., Dean, J. "Distilling the Knowledge in a Neural Network." arXiv:1503.02531, NeurIPS 2014 Deep Learning Workshop. arxiv.org/abs/1503.02531
The paper that introduced training a small student to match a large teacher's softened outputs, the foundation of the distillation technique in Section 22.4.
KV Cache and Attention
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., Stoica, I. "Efficient Memory Management for Large Language Model Serving with PagedAttention." arXiv:2309.06180, SOSP 2023. arxiv.org/abs/2309.06180
The vLLM paper that manages the KV cache in non-contiguous pages to eliminate fragmentation and raise concurrency, the direct reference for the paged attention of Section 22.5.
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., Ré, C. "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." arXiv:2205.14135, NeurIPS 2022. arxiv.org/abs/2205.14135
The IO-aware kernel that computes exact attention without materializing the full score matrix, the central method of Section 22.6.
Dao, T. "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning." arXiv:2307.08691, 2023. arxiv.org/abs/2307.08691
The follow-up that improves work partitioning and parallelism to push attention closer to peak hardware throughput, extending the kernel discussion of Section 22.6.
Batching and Decoding
Yu, G.-I., Jeong, J. S., Kim, G.-W., Kim, S., Chun, B.-G. "Orca: A Distributed Serving System for Transformer-Based Generative Models." OSDI 2022. usenix.org/conference/osdi22/presentation/yu
The system that introduced iteration-level (continuous) batching for generative transformers, the throughput technique developed in Section 22.7.
Leviathan, Y., Kalman, M., Matias, Y. "Fast Inference from Transformers via Speculative Decoding." arXiv:2211.17192, ICML 2023. arxiv.org/abs/2211.17192
The method that drafts multiple tokens with a small model and verifies them in parallel with the large one, the latency technique paired with continuous batching in Section 22.7.