"A user asked one question. I read three documents, called two tools, and quietly spawned four smaller copies of myself to chase the loose ends. They are still working. I have already answered, and I will never fully know what they found."
An Agent, Spawning Helpers It Will Never Fully Supervise
An LLM-powered agentic application is the place where every axis of this book runs at once and where the model itself becomes the most expensive moving part. A single user request no longer maps to a single model call; it unfolds into a multi-step plan that retrieves passages from a corpus too large for one machine, invokes tools, and spawns sub-agents whose calls fan out and multiply, all while the model that powers each step is a multi-accelerator service answering thousands of concurrent streams under a latency budget a person will actually wait through. This chapter is the capstone case study: it assembles retrieval (Chapter 36 and Chapter 25), distributed LLM serving (Chapter 24), and agent orchestration (Chapter 32) into one product. This opening section fixes the requirements, proves with arithmetic that the token volume and fan-out demand a serving fleet, decomposes the system into the stages the rest of the chapter builds, and maps each stage onto the six axes of distribution from Chapter 1.
An agent, in the sense this chapter uses the word, is a program that wraps a language model in a loop: it reads a goal, decides on an action, takes that action through a tool or a retrieval call, observes the result, and repeats until it judges the goal met. Where a retrieval-augmented question answers itself in one model call conditioned on retrieved text, an agent answers a task in many calls, each conditioned on the accumulated history of what it has already done. The version that fits in a notebook, a single loop calling a hosted model API over a handful of documents, hides every problem this chapter exists to confront. The interesting version is a production assistant or an autonomous task agent that serves a real user population, reasons over a large enterprise or web corpus, and must do so within a cost envelope and a latency budget that an operator would actually sign. The purpose of this case study is to show how the machinery built across Parts II through VII assembles into that product, and where the seams between agent, retriever, and serving fleet actually fall.
We treat the system as a concrete target. The running specification is a production agentic application serving on the order of two hundred thousand daily active users, each running several multi-step task sessions per day, where every session reasons across roughly a dozen steps, retrieves from a large document corpus at most steps, and occasionally fans out into sub-agents that pursue subtasks in parallel. The model behind each step is a large language model that cannot be served from one accelerator. Four pressures, the multi-step token volume, the sub-agent fan-out, the concurrency under a per-step latency budget, and the cost per task, are the requirements; the rest of the chapter is the design that satisfies them.
1. The Requirements: An Agent Is Not One Model Call Beginner
A problem definition earns its keep only when it commits to numbers, because the numbers decide which ceilings bind and therefore which axes of distribution we must reach for. We fix five requirements and carry them through the entire chapter. The first is corpus scale: the agent retrieves over a corpus on the order of tens of millions of documents, large enough that the index must be sharded as in Chapter 25, but the corpus is no longer the headline cost. The second is concurrent users: the system serves roughly two hundred thousand daily active users running several task sessions each, which sets the request arrival rate. The third is the multi-step latency budget: a task is not one call but a chain of about a dozen reasoning and tool-use steps, and the user waits for the whole chain, so each step must finish within a fraction of a second to keep the end-to-end response inside the time a person tolerates. The fourth is token cost: every step consumes prompt tokens (the growing context plus retrieved passages) and emits output tokens, and the bill is the product of steps, tokens per step, and price, summed over every call including those of spawned sub-agents. The fifth is reliability: a single failed step must not abort the whole task, because a twelve-step plan that fails one step in twenty still fails nearly half of all tasks unless steps can be retried and isolated.
These requirements interact, and the interaction is the whole subject. Multi-step reasoning multiplies token cost, because each step re-reads an ever-larger history. Sub-agent fan-out multiplies it again, because one branching step becomes several calls. Concurrency fights latency, because every in-flight step competes for the same serving fleet. Reliability fights latency, because retries add steps. A single machine fails every requirement at once and for independent reasons: it cannot hold the sharded index, cannot serve the model, cannot sustain the concurrent step volume, and cannot survive its own failures. The next section makes that failure quantitative.
A one-shot question costs one model call. An agent costs the number of steps in its plan times the number of sub-agents each step spawns, and each of those calls carries a prompt that grows as the agent accumulates history. The two multipliers, depth (steps per task) and width (sub-agents per branching step), turn a modest per-call cost into a token bill and a concurrency load that no single server absorbs. Reading a specification means locating these two multipliers first, because they convert a tolerable single-call system into one that is forced onto a fleet.
2. Why a Single Machine Is Not Even Close Beginner
The argument for distribution falls out of arithmetic on the requirements, not from rhetoric. Start with the work in one task. If a task runs $s$ reasoning steps, a fraction $f$ of which branch into $g$ sub-agents, the effective number of language-model calls per task is
$$c_{\text{task}} = s \,\bigl(1 + f\,g\bigr),$$because every step is one call and each branching step adds $g$ more. With $s = 12$ steps, a branch fraction $f = 0.3$, and a fan-out $g = 3$, a single task already costs roughly twenty-three model calls rather than one. Multiply by the task volume and the per-call tokens to get the daily token bill. If $u$ users each run $t$ tasks per day, the calls and tokens per day are
$$\text{calls/day} = u\,t\,c_{\text{task}}, \qquad \text{tokens/day} = \text{calls/day}\,\bigl(p_{\text{in}} + p_{\text{out}}\bigr),$$where $p_{\text{in}}$ and $p_{\text{out}}$ are input and output tokens per call. The dollar cost is the same product weighted by per-token prices, the multi-step agent cost
$$\text{cost/day} = \text{calls/day}\,\bigl(p_{\text{in}}\,\pi_{\text{in}} + p_{\text{out}}\,\pi_{\text{out}}\bigr),$$with $\pi_{\text{in}}$ and $\pi_{\text{out}}$ the input and output token prices. Finally the serving load. The average rate of model calls is the calls per day spread over a day, and at peak it is several times higher; Little's law then fixes how many calls are in flight at once,
$$\lambda = \frac{\text{calls/day}}{86400}, \qquad C = \lambda_{\text{peak}}\, L,$$where $L$ is the wall-clock latency of one step and $C$ is the number of concurrent decode streams the fleet must hold. Figure 40.1.1 shows the request unfolding from user to fleet; Code 40.1.1 puts real numbers behind every one of these formulas.
The diagram fixes the vocabulary for the chapter. The orchestrator on the left holds one task together across its many steps; the sharded index and tools are what each step reaches out to; the sub-agent fleet is where one task becomes many cooperating reasoners; and the serving fleet on the right is the multi-accelerator service that every step of every agent ultimately calls. The numbers on the boxes come from a back-of-envelope model, which we now run with real arithmetic so that no estimate in this chapter is a guess.
import math
# Workload requirements for the running agentic specification.
users_dau = 200_000 # daily active users (requirement: concurrent users)
sessions_per_user = 4 # task sessions per active user per day
agent_steps = 12 # mean reasoning/tool steps per task (multi-step)
subagent_fanout = 3 # sub-agents spawned per step that branches
branch_fraction = 0.30 # fraction of steps that fan out to sub-agents
tokens_in_per_step = 6_000 # prompt tokens per LLM call (context + retrieved passages)
tokens_out_per_step= 700 # generated tokens per LLM call
price_in = 3.0 / 1e6 # $ per input token
price_out = 15.0 / 1e6 # $ per output token
peak_factor = 8.0 # peak QPS over the daily average
step_latency_s = 1.4 # mean wall-clock latency of one LLM call
# Effective LLM calls per task once sub-agent fan-out is counted: c_task = s (1 + f g).
calls_per_task = agent_steps * (1 + branch_fraction * subagent_fanout)
tasks_per_day = users_dau * sessions_per_user
calls_per_day = tasks_per_day * calls_per_task
# Token volume per day (input + output across every call).
tokens_in_day = calls_per_day * tokens_in_per_step
tokens_out_day = calls_per_day * tokens_out_per_step
tokens_day = tokens_in_day + tokens_out_day
# Cost: multi-step agent token cost = calls * (tokens_in * price_in + tokens_out * price_out).
cost_in_day = tokens_in_day * price_in
cost_out_day = tokens_out_day * price_out
cost_day = cost_in_day + cost_out_day
# Concurrency: average call rate lambda, then Little's law C = lambda_peak * L at peak.
avg_qps = calls_per_day / 86_400
peak_qps = avg_qps * peak_factor
conc_peak = peak_qps * step_latency_s
print(f"effective LLM calls / task : {calls_per_task:,.1f}")
print(f"tasks / day : {tasks_per_day:,}")
print(f"LLM calls / day : {calls_per_day:,.0f}")
print(f"tokens / day (in+out) : {tokens_day/1e9:,.1f} billion")
print(f"token cost / day : ${cost_day:,.0f}")
print(f"average LLM-call QPS : {avg_qps:,.0f}")
print(f"peak LLM-call QPS (x{peak_factor:.0f}) : {peak_qps:,.0f}")
print(f"concurrent calls C (peak) : {conc_peak:,.0f}")
# A single serving replica sustains a bounded number of concurrent decode streams.
streams_per_replica = 48
replicas_peak = math.ceil(conc_peak / streams_per_replica)
print(f"serving replicas at peak : {replicas_peak:,} "
f"(at {streams_per_replica} streams each)")
effective LLM calls / task : 22.8
tasks / day : 800,000
LLM calls / day : 18,240,000
tokens / day (in+out) : 122.2 billion
token cost / day : $519,840
average LLM-call QPS : 211
peak LLM-call QPS (x8) : 1,689
concurrent calls C (peak) : 2,364
serving replicas at peak : 50 (at 48 streams each)
The output settles the question. The fan-out multiplier alone turns one task into almost twenty-three calls, and the daily token volume crosses a hundred billion, which no single accelerator can either serve or afford. The peak concurrency of more than two thousand simultaneous decode streams forces the model onto a replicated, sharded fleet of dozens of nodes, exactly the distributed-serving regime of Chapter 24. And the half-million-dollar daily bill makes cost control a first-class design axis rather than an afterthought. None of these is a marginal overshoot a faster chip would absorb; each is one to three orders of magnitude beyond a single node, and each falls on a different stage of the pipeline. That separation is what makes the six-axis decomposition the right tool.
3. Decomposing the System Onto the Six Axes Intermediate
The six axes of distribution from Section 1.1, distribute data, distribute training, distribute the model, distribute inference, coordinate the cluster, and distribute intelligence, give us a map onto which every stage of Figure 40.1.1 places cleanly. This case study loads five of the six heavily; training is the one axis it consumes only indirectly, because the application assembles models that earlier parts trained rather than training them anew. Mapping the stages is the planning act of the whole chapter: it tells us which earlier part of the book owns each stage and which later section of this chapter develops it. Table 40.1.1 is that map, and it doubles as the chapter's table of contents.
| Stage | Primary axis | Owning earlier chapter | Built in this chapter |
|---|---|---|---|
| Document processing | Distribute data | Ch 6, Ch 7 | Section 40.2 |
| Embedding the corpus | Distribute inference | Ch 15, Ch 36 | Section 40.3 |
| Sharded vector search | Distribute the model | Ch 25 | Section 40.4 |
| RAG at scale | Distribute inference | Ch 36 | Section 40.5 |
| Agent orchestration | Distribute intelligence | Ch 32 | Section 40.6 |
| Model serving (vLLM) | Distribute inference / the model | Ch 24 | Section 40.7 |
| Cost control | Coordinate the cluster | Ch 26 | Section 40.8 |
| Evaluation | Distribute intelligence / coordinate | Ch 5 | Section 40.9 |
Reading the table top to bottom traces the dataflow of Figure 40.1.1, and reading the third column traces the path back through the book. Document processing is pure data distribution, the MapReduce and Spark jobs of Chapter 6 and Chapter 7 applied to the corpus the agent will retrieve over. Embedding is distributed inference on a fixed corpus, borrowing the encoder and batching techniques of Chapter 15 and the web-scale pipeline of Chapter 36. Sharding the vector store is splitting a structure too large for one node, the partitioning logic of Chapter 25. RAG at scale is the retrieval-and-condition loop that Chapter 36 built end to end, here demoted from the whole product to a single tool the agent may call many times. Agent orchestration, the plan-act-observe loop and the sub-agent fan-out, is the distributed-intelligence material of Chapter 32. Model serving is the distributed LLM serving of Chapter 24, replicated and sharded to hold the concurrency of Output 40.1.1. Cost control is the cluster-coordination and MLOps discipline of Chapter 26 applied to the token bill. Evaluation closes the loop. No stage is new machinery; the contribution of this chapter is the assembly and the seams.
Earlier case studies distributed data, the model, or the serving fleet. This one reaches the last axis the book names: it distributes the intelligence. A single user goal is decomposed by an orchestrator into steps, those steps spawn sub-agents that reason in parallel, and each reasoning act is itself served by a fleet of cooperating model replicas. The product exists only because thinking has been spread across many machines that must coordinate to act as one mind, the scale-out moment in its fullest form. When you reach the capstone in Chapter 41, this is the worked example of the sixth axis: read the requirements, find the depth and width multipliers, assign each ceiling to an axis, and defend the assignment with the arithmetic of Code 40.1.1. Scale-out here is not one optimization but the shape of cognition the product is built from.
4. The Build Path Versus the Agentic Query Path Intermediate
One structural decision organizes the rest of the chapter, and it is the same split that organized the RAG case study, now with a different center of gravity. The offline build path (document processing, embedding, sharding the index) is throughput-bound and latency-tolerant: it runs over the corpus, is measured in accelerator-hours and dollars, and a document indexed a few minutes late costs nothing. The online agentic query path (orchestration, retrieval, generation across many steps) is latency-bound and throughput-replicated: each task must finish its whole multi-step chain inside the budget, so it is measured in end-to-end tail latency and in how many concurrent steps the fleet sustains. What is new here, and what separates this chapter from Chapter 36, is that the query path is no longer one retrieval and one generation; it is a loop whose length and fan-out are decided at runtime by the agent itself, which makes its cost and latency variable rather than fixed.
That runtime variability is the heart of the chapter. In a one-shot RAG system the cost of a query is known before it runs; in an agentic system the agent decides how many steps to take and how many sub-agents to spawn, so the token bill and the concurrency load are themselves outcomes of the model's behavior, not fixed inputs. This is why cost control (Section 40.8) becomes its own stage rather than a footnote: bounding step counts, caching repeated retrievals, and choosing a smaller model for routine steps are the levers that keep the variable cost of Output 40.1.1 from running away. The two paths meet at two shared artifacts, the sharded index that the build path produces and the query path retrieves from, and the serving fleet that every step of every agent calls. Keeping the paths logically separate while letting them share exactly those two artifacts is the architectural spine of the chapter.
Who: A platform engineer at an enterprise software company building an internal assistant that answers employee questions over the company's document corpus.
Situation: A prototype agent worked in a notebook: one reasoning loop, a hosted model API, a small local index, perfect demos to the leadership team.
Problem: Rolled out to a few thousand employees, the agent's reasoning loop averaged a dozen steps and frequently spawned sub-agents to check related documents, and the first monthly invoice was an order of magnitude over the budget.
Dilemma: Cap the agent to a single step and lose the multi-step reasoning that made it useful, or keep the reasoning and re-architect the serving and cost layers before the rollout widened company-wide.
Decision: They re-architected, running the back-of-envelope model of Code 40.1.1 on their real step counts and fan-out first, which revealed that the token bill was driven by re-reading a growing context at every step and by unbounded sub-agent spawning.
How: They moved to a self-hosted serving fleet with continuous batching, capped sub-agent fan-out, cached repeated retrievals, and routed routine steps to a smaller model, the exact stages of Sections 40.7 and 40.8.
Result: Per-task cost fell several-fold while end-to-end latency stayed inside the budget, and because the build and agentic paths were separated, widening to the whole company meant adding serving replicas without touching the orchestration logic.
Lesson: An agent prototype and an agent product are different systems, and the difference is the multiplier in $c_{\text{task}}$. Doing the arithmetic of Section 2 before the rollout is what turns the demo into something that survives its own invoice.
It is worth seeing the system this chapter does not study, because it shows exactly which problems scale removes the luxury of ignoring. A few lines of a modern agent framework give a working multi-step agent on a small corpus:
# pip install langchain langchain-community faiss-cpu
from langchain.agents import initialize_agent, Tool
from langchain_community.vectorstores import FAISS
retriever = FAISS.from_texts(documents, emb).as_retriever() # whole index in RAM
tools = [Tool("search", retriever.invoke, "search the corpus")]
agent = initialize_agent(tools, llm, agent="zero-shot-react-description")
answer = agent.run(task) # loops, retrieves, and calls the model, all on one box
5. What This Chapter Builds, Section by Section Beginner
With the requirements fixed, the scale proven, and the axes mapped, the path through the chapter is set. Section 40.2 builds the distributed document-processing pipeline that produces the corpus the agent retrieves over. Section 40.3 embeds that corpus across a fleet, the inference pass that turns documents into searchable vectors. Section 40.4 shards those vectors into the searchable index, the artifact the agentic path consumes. Section 40.5 assembles retrieval and generation into RAG at scale, now reframed as a tool the agent invokes rather than the whole product. Section 40.6 builds the agent orchestration: the plan-act-observe loop, tool use, and the sub-agent fan-out that distributes the intelligence. Section 40.7 stands up the replicated, sharded model-serving fleet with vLLM that holds the concurrency of Output 40.1.1. Section 40.8 brings the variable token cost under control. Section 40.9 evaluates the whole system for task success, faithfulness, latency, and cost, and Section 40.10 turns it into a project you build yourself. Each section opens with the slice of Figure 40.1.1 it owns and the requirement from Section 1 it must satisfy.
The boundary of this problem is shifting on every axis at once. On the orchestration axis, multi-agent frameworks in the lineage of AutoGen, LangGraph, CrewAI, and OpenAI's Swarm are formalizing how a planner agent decomposes a task and supervises a fleet of workers, and recent studies probe when added agents actually help versus when they merely multiply cost. On the serving axis, agent-aware schedulers exploit the fact that an agent's steps share a long, slowly growing prefix, so prefix-caching and disaggregated prefill-versus-decode serving (in the vLLM and SGLang lineage) cut the dominant input-token cost that Output 40.1.1 exposes. On the reasoning axis, long-context and tool-use models are reopening how much retrieval a step needs, yet retrieval keeps winning on cost and faithfulness at corpus scale, which is why the pipeline of Figure 40.1.1 endures. The constant across all of it is the five-requirement tension of Section 1: every advance is judged by whether it improves one of corpus scale, concurrency, multi-step latency, token cost, or reliability without surrendering the others, and the agent's runtime control over its own step count makes that trade sharper here than anywhere else in the book.
The epigraph is only a small exaggeration. In Output 40.1.1 the average task spawns enough sub-agents that, across a day, the system launches millions of short-lived reasoners that retrieve, decide, and report back, most of which finish their work and vanish before the parent has assembled the final answer. No human supervises them; the orchestrator counts on the arithmetic, not on watching each one. Distributing the intelligence did not make the thinking smaller, it made the waiting survivable, the same bargain that has underwritten every chapter of this book, now applied to cognition itself.
For each of the five requirements in Section 1 (corpus scale, concurrent users, multi-step latency, token cost, reliability), name the single stage in Figure 40.1.1 that it pressures hardest, the axis of distribution from Table 40.1.1 that stage sits on, and one concrete failure that occurs if that stage is left on a single machine. Then identify which two of the five requirements are in the most direct tension with each other in an agentic system specifically, and explain why the agent's runtime control over its own step count makes that tension worse than it is in the one-shot RAG system of Chapter 36.
Modify Code 40.1.1 to model an agent that reasons more deeply and branches more widely: set the mean step count to $20$, the branch fraction to $0.5$, and the sub-agent fan-out to $4$. Report the new effective calls per task $c_{\text{task}}$, the daily token volume, the daily token cost, and the peak concurrency $C$. Then add a cost lever: assume routine steps (say $70\%$ of all calls) can be routed to a model priced at one fifth of the rates used here, and recompute the daily cost. Report the cost reduction as a percentage and explain which term in the cost formula the lever actually shrinks.
Using Little's law $C = \lambda_{\text{peak}} L$ from Section 2, suppose the serving fleet must sustain a peak of $\lambda_{\text{peak}} = 1{,}700$ model calls per second at a per-step latency of $L = 1.4$ seconds. Estimate the concurrent decode streams $C$, and if one serving replica sustains $48$ concurrent streams, the replica count. Now argue how the answer changes if prefix-caching cuts the effective per-step latency to $0.6$ seconds by reusing the shared context across an agent's steps, and connect your reasoning to the replicated-serving treatment forthcoming in Section 40.7 and the LLM-serving economics of Chapter 24.