Part V: Distributed Inference and Serving

"Training got to finish overnight. I have to answer before the user finishes reading the question, and there are eleven thousand of them, and they all arrived in the same millisecond."
A Serving Replica With No Time To Spare

Big Picture

Training a model is a project with a deadline; serving it is a system that never sleeps, and the inference axis is where distribution stops being a batch optimization and becomes a permanent, latency-bound, always-on obligation. Part V takes the inference axis named in Chapter 1 and distributes it across a fleet. It opens with the one piece of deliberate scale-up the book permits: making a single node answer efficiently, because a fleet of inefficient nodes is just an expensive way to be slow. Then it multiplies that node outward across the four problems a real serving stack must solve, routing and batching across replicas, sharding a model too large for any one accelerator, searching a billion-vector index, and operating the whole thing reliably as it changes underneath you. By the end you will be able to take a trained model and turn it into a service that meets a latency target, holds under load, and survives the day-to-day churn of a production fleet.

Part Overview

Every part before this one ended when the model was trained. Part V begins exactly there, at the moment the weights are frozen and the real work, answering requests forever, starts. Inference is a different regime from training in every way that matters to a distributed system. Training is throughput-bound and can run overnight; serving is latency-bound and must answer now. Training sees one fixed dataset; serving sees an unbounded, bursty, adversarial stream of requests it cannot reorder at will. Training fails and restarts from a checkpoint; serving fails in front of a user. The inference axis, sketched as one cell of the design space in Chapter 1, becomes in this part a full engineering discipline with its own metrics, its own failure modes, and its own reasons to spread work across machines.

The framing of this part is deliberate and worth stating plainly. Chapter 22 is the one explicitly labeled scale-up chapter in the entire book, the per-node efficiency prerequisite: quantization, KV-cache paging, batching, and attention kernels that make a single accelerator answer as many tokens per second as physics allows. The book leads with scale-out everywhere else, but it concedes one truth here, you cannot distribute your way out of a node that wastes its own silicon, so the per-node economics must be settled before the fleet is built on top of them. Chapters 23 through 26 then multiply that efficient node across the fleet, taking each thing one node does well and asking how a thousand of them do it together: how requests are routed and batched across replicas, how a model too large for one device is split across many during serving, how a retrieval index spanning billions of vectors is sharded and queried under a latency budget, and how the entire serving fleet is deployed, monitored, and updated without going dark.

A recurring thread from earlier parts comes due here. The per-node KV-cache economics that Chapter 22 works out for a single accelerator return, multiplied across the serving fleet, in Chapter 24: the cost of a cached token is a per-node fact, but the cost of a million concurrent conversations is a fleet-scheduling fact, and the two are the same arithmetic at different scales. The model-sharding ideas from Part IV reappear too, now serving a single forward pass under a tail-latency constraint rather than driving a training step toward convergence. Read this part as the point where the book turns from making models exist to making them respond, and where every primitive from the earlier parts is re-evaluated against the unforgiving clock of a live request.

The Inference Axis in One Idea

If you keep one idea from this part, keep this: serving is training's economics inverted, you optimize for the latency of one request inside a flood of them, not the throughput of one fixed job. That inversion is why Part V earns its single scale-up chapter, the per-node efficiency of Chapter 22 sets the floor that no amount of distribution can lift you above, and why Chapters 23 through 26 are about multiplying an already-efficient node rather than rescuing an inefficient one. The fleet does not fix a slow node; it scales a fast one. Settle the node first, then distribute it, then operate the result, that is the arc of these five chapters, and it is the order in which any real serving stack must be built.

Part Roadmap

22 Per-Node Inference Efficiency: A Prerequisite The book's one labeled scale-up chapter: quantization, KV-cache paging, continuous batching, and attention kernels that make a single accelerator answer as fast as its silicon allows, the floor every fleet is built on.
23 Distributed Inference Systems The efficient node multiplied across replicas: request routing, load balancing, dynamic batching, autoscaling, and the tail-latency arithmetic that decides whether a fleet meets its service level.
24 Distributed LLM Serving Serving models too large for one device and conversations too many for one cache: tensor and pipeline parallel inference, prefill-decode disaggregation, and the per-node KV-cache economics of Chapter 22 multiplied across the fleet.
25 Distributed Retrieval and Vector Search Searching billions of vectors under a latency budget: sharded approximate-nearest-neighbor indexes, distributed query fan-out and merge, and the retrieval substrate that feeds every RAG and recommendation system.
26 MLOps for Distributed AI Operating the whole serving fleet over time: deployment and rollout, canarying and rollback, monitoring and drift detection, and the lifecycle machinery that keeps a distributed model service alive as it changes underneath you.

Read the five chapters in order and the inference axis becomes a buildable stack: Chapter 22 settles the single node, Chapter 23 multiplies it into a load-balanced fleet, Chapter 24 handles the models and caches too large to fit one box, Chapter 25 attaches the retrieval substrate that grounds them, and Chapter 26 keeps the result running in production. The order is the dependency order: each chapter assumes the node below it already works, which is exactly why the scale-up prerequisite has to come first.

What's Next?

Part V distributed the inference axis and ended with a fleet that answers requests, scales under load, and survives its own deployments. Every system in it, though, still does one thing: it computes an output from an input on command. Part VI: Distributed AI and Multi-Agent Systems distributes the last and most demanding axis, intelligence itself, where the units are no longer replicas serving a shared model but autonomous agents that perceive, decide, negotiate, and act, each with its own goals and only partial knowledge of the others. The routing and coordination machinery of this part returns there in a sharper form: a load balancer assigns work to interchangeable replicas, but a multi-agent system must coordinate parties that may compete, mislead, or fail independently. Read Part VI as the move from serving one mind many times to orchestrating many minds at once.