Section 40.1: Problem Definition | Building Scalable AI

"A user asked one question. I read three documents, called two tools, and quietly spawned four smaller copies of myself to chase the loose ends. They are still working. I have already answered, and I will never fully know what they found."
An Agent, Spawning Helpers It Will Never Fully Supervise

Big Picture

An LLM-powered agentic application is the place where every axis of this book runs at once and where the model itself becomes the most expensive moving part. A single user request no longer maps to a single model call; it unfolds into a multi-step plan that retrieves passages from a corpus too large for one machine, invokes tools, and spawns sub-agents whose calls fan out and multiply, all while the model that powers each step is a multi-accelerator service answering thousands of concurrent streams under a latency budget a person will actually wait through. This chapter is the capstone case study: it assembles retrieval (Chapter 36 and Chapter 25), distributed LLM serving (Chapter 24), and agent orchestration (Chapter 32) into one product. This opening section fixes the requirements, proves with arithmetic that the token volume and fan-out demand a serving fleet, decomposes the system into the stages the rest of the chapter builds, and maps each stage onto the six axes of distribution from Chapter 1.

An agent, in the sense this chapter uses the word, is a program that wraps a language model in a loop: it reads a goal, decides on an action, takes that action through a tool or a retrieval call, observes the result, and repeats until it judges the goal met. Where a retrieval-augmented question answers itself in one model call conditioned on retrieved text, an agent answers a task in many calls, each conditioned on the accumulated history of what it has already done. The version that fits in a notebook, a single loop calling a hosted model API over a handful of documents, hides every problem this chapter exists to confront. The interesting version is a production assistant or an autonomous task agent that serves a real user population, reasons over a large enterprise or web corpus, and must do so within a cost envelope and a latency budget that an operator would actually sign. The purpose of this case study is to show how the machinery built across Parts II through VII assembles into that product, and where the seams between agent, retriever, and serving fleet actually fall.

We treat the system as a concrete target. The running specification is a production agentic application serving on the order of two hundred thousand daily active users, each running several multi-step task sessions per day, where every session reasons across roughly a dozen steps, retrieves from a large document corpus at most steps, and occasionally fans out into sub-agents that pursue subtasks in parallel. The model behind each step is a large language model that cannot be served from one accelerator. Four pressures, the multi-step token volume, the sub-agent fan-out, the concurrency under a per-step latency budget, and the cost per task, are the requirements; the rest of the chapter is the design that satisfies them.

1. The Requirements: An Agent Is Not One Model Call Beginner

A problem definition earns its keep only when it commits to numbers, because the numbers decide which ceilings bind and therefore which axes of distribution we must reach for. We fix five requirements and carry them through the entire chapter. The first is corpus scale: the agent retrieves over a corpus on the order of tens of millions of documents, large enough that the index must be sharded as in Chapter 25, but the corpus is no longer the headline cost. The second is concurrent users: the system serves roughly two hundred thousand daily active users running several task sessions each, which sets the request arrival rate. The third is the multi-step latency budget: a task is not one call but a chain of about a dozen reasoning and tool-use steps, and the user waits for the whole chain, so each step must finish within a fraction of a second to keep the end-to-end response inside the time a person tolerates. The fourth is token cost: every step consumes prompt tokens (the growing context plus retrieved passages) and emits output tokens, and the bill is the product of steps, tokens per step, and price, summed over every call including those of spawned sub-agents. The fifth is reliability: a single failed step must not abort the whole task, because a twelve-step plan that fails one step in twenty still fails nearly half of all tasks unless steps can be retried and isolated.

These requirements interact, and the interaction is the whole subject. Multi-step reasoning multiplies token cost, because each step re-reads an ever-larger history. Sub-agent fan-out multiplies it again, because one branching step becomes several calls. Concurrency fights latency, because every in-flight step competes for the same serving fleet. Reliability fights latency, because retries add steps. A single machine fails every requirement at once and for independent reasons: it cannot hold the sharded index, cannot serve the model, cannot sustain the concurrent step volume, and cannot survive its own failures. The next section makes that failure quantitative.

Key Insight: The Agent Multiplies Every Cost by Its Step Count and Its Fan-Out

A one-shot question costs one model call. An agent costs the number of steps in its plan times the number of sub-agents each step spawns, and each of those calls carries a prompt that grows as the agent accumulates history. The two multipliers, depth (steps per task) and width (sub-agents per branching step), turn a modest per-call cost into a token bill and a concurrency load that no single server absorbs. Reading a specification means locating these two multipliers first, because they convert a tolerable single-call system into one that is forced onto a fleet.

2. Why a Single Machine Is Not Even Close Beginner

The argument for distribution falls out of arithmetic on the requirements, not from rhetoric. Start with the work in one task. If a task runs $s$ reasoning steps, a fraction $f$ of which branch into $g$ sub-agents, the effective number of language-model calls per task is

$$c_{\text{task}} = s \,\bigl(1 + f\,g\bigr),$$

because every step is one call and each branching step adds $g$ more. With $s = 12$ steps, a branch fraction $f = 0.3$, and a fan-out $g = 3$, a single task already costs roughly twenty-three model calls rather than one. Multiply by the task volume and the per-call tokens to get the daily token bill. If $u$ users each run $t$ tasks per day, the calls and tokens per day are

$$\text{calls/day} = u\,t\,c_{\text{task}}, \qquad \text{tokens/day} = \text{calls/day}\,\bigl(p_{\text{in}} + p_{\text{out}}\bigr),$$

where $p_{\text{in}}$ and $p_{\text{out}}$ are input and output tokens per call. The dollar cost is the same product weighted by per-token prices, the multi-step agent cost

$$\text{cost/day} = \text{calls/day}\,\bigl(p_{\text{in}}\,\pi_{\text{in}} + p_{\text{out}}\,\pi_{\text{out}}\bigr),$$

with $\pi_{\text{in}}$ and $\pi_{\text{out}}$ the input and output token prices. Finally the serving load. The average rate of model calls is the calls per day spread over a day, and at peak it is several times higher; Little's law then fixes how many calls are in flight at once,

$$\lambda = \frac{\text{calls/day}}{86400}, \qquad C = \lambda_{\text{peak}}\, L,$$

where $L$ is the wall-clock latency of one step and $C$ is the number of concurrent decode streams the fleet must hold. Figure 40.1.1 shows the request unfolding from user to fleet; Code 40.1.1 puts real numbers behind every one of these formulas.

Figure 40.1.1: The distributed agentic application as a request-unfolding architecture. A user task enters the agent orchestrator (top), which runs a multi-step plan-act-observe loop; on each step it retrieves from a sharded vector index, calls external tools, or spawns sub-agents (left, green) that run their own loops. Every reasoning step of every agent and sub-agent issues a call to the replicated, sharded LLM serving fleet (right), where a router batches the concurrent streams across many model replicas. The orange arrows trace those calls; each region is labeled with the axis of distribution it loads most heavily, and the peak numbers come from Code 40.1.1.

The diagram fixes the vocabulary for the chapter. The orchestrator on the left holds one task together across its many steps; the sharded index and tools are what each step reaches out to; the sub-agent fleet is where one task becomes many cooperating reasoners; and the serving fleet on the right is the multi-accelerator service that every step of every agent ultimately calls. The numbers on the boxes come from a back-of-envelope model, which we now run with real arithmetic so that no estimate in this chapter is a guess.

import math

# Workload requirements for the running agentic specification.
users_dau          = 200_000       # daily active users (requirement: concurrent users)
sessions_per_user  = 4             # task sessions per active user per day
agent_steps        = 12            # mean reasoning/tool steps per task (multi-step)
subagent_fanout    = 3             # sub-agents spawned per step that branches
branch_fraction    = 0.30          # fraction of steps that fan out to sub-agents
tokens_in_per_step = 6_000         # prompt tokens per LLM call (context + retrieved passages)
tokens_out_per_step= 700           # generated tokens per LLM call
price_in           = 3.0 / 1e6     # $ per input token
price_out          = 15.0 / 1e6    # $ per output token
peak_factor        = 8.0           # peak QPS over the daily average
step_latency_s     = 1.4           # mean wall-clock latency of one LLM call

# Effective LLM calls per task once sub-agent fan-out is counted: c_task = s (1 + f g).
calls_per_task = agent_steps * (1 + branch_fraction * subagent_fanout)
tasks_per_day  = users_dau * sessions_per_user
calls_per_day  = tasks_per_day * calls_per_task

# Token volume per day (input + output across every call).
tokens_in_day  = calls_per_day * tokens_in_per_step
tokens_out_day = calls_per_day * tokens_out_per_step
tokens_day     = tokens_in_day + tokens_out_day

# Cost: multi-step agent token cost = calls * (tokens_in * price_in + tokens_out * price_out).
cost_in_day  = tokens_in_day  * price_in
cost_out_day = tokens_out_day * price_out
cost_day     = cost_in_day + cost_out_day

# Concurrency: average call rate lambda, then Little's law C = lambda_peak * L at peak.
avg_qps   = calls_per_day / 86_400
peak_qps  = avg_qps * peak_factor
conc_peak = peak_qps * step_latency_s

print(f"effective LLM calls / task    : {calls_per_task:,.1f}")
print(f"tasks / day                   : {tasks_per_day:,}")
print(f"LLM calls / day               : {calls_per_day:,.0f}")
print(f"tokens / day (in+out)         : {tokens_day/1e9:,.1f} billion")
print(f"token cost / day              : ${cost_day:,.0f}")
print(f"average LLM-call QPS          : {avg_qps:,.0f}")
print(f"peak LLM-call QPS (x{peak_factor:.0f})       : {peak_qps:,.0f}")
print(f"concurrent calls C (peak)     : {conc_peak:,.0f}")

# A single serving replica sustains a bounded number of concurrent decode streams.
streams_per_replica = 48
replicas_peak = math.ceil(conc_peak / streams_per_replica)
print(f"serving replicas at peak      : {replicas_peak:,} "
      f"(at {streams_per_replica} streams each)")

Code 40.1.1: The five-requirement back-of-envelope model for the running specification. Each printed quantity corresponds to one formula in this section: the effective calls per task $c_{\text{task}}$, the daily calls and tokens, the multi-step token cost, the call rate $\lambda$, the Little's-law concurrency $C$, and the serving-replica count it forces.

effective LLM calls / task    : 22.8
tasks / day                   : 800,000
LLM calls / day               : 18,240,000
tokens / day (in+out)         : 122.2 billion
token cost / day              : $519,840
average LLM-call QPS          : 211
peak LLM-call QPS (x8)       : 1,689
concurrent calls C (peak)     : 2,364
serving replicas at peak      : 50 (at 48 streams each)

Output 40.1.1: Real output. One task is nearly twenty-three model calls, not one; the system makes over eighteen million calls and burns more than a hundred and twenty billion tokens a day at roughly half a million dollars; and at peak it holds over two thousand decode streams at once, forcing on the order of fifty model replicas. Each number is a wall the single machine hits, and each lands on a different axis of distribution.

The output settles the question. The fan-out multiplier alone turns one task into almost twenty-three calls, and the daily token volume crosses a hundred billion, which no single accelerator can either serve or afford. The peak concurrency of more than two thousand simultaneous decode streams forces the model onto a replicated, sharded fleet of dozens of nodes, exactly the distributed-serving regime of Chapter 24. And the half-million-dollar daily bill makes cost control a first-class design axis rather than an afterthought. None of these is a marginal overshoot a faster chip would absorb; each is one to three orders of magnitude beyond a single node, and each falls on a different stage of the pipeline. That separation is what makes the six-axis decomposition the right tool.

3. Decomposing the System Onto the Six Axes Intermediate

The six axes of distribution from Section 1.1, distribute data, distribute training, distribute the model, distribute inference, coordinate the cluster, and distribute intelligence, give us a map onto which every stage of Figure 40.1.1 places cleanly. This case study loads five of the six heavily; training is the one axis it consumes only indirectly, because the application assembles models that earlier parts trained rather than training them anew. Mapping the stages is the planning act of the whole chapter: it tells us which earlier part of the book owns each stage and which later section of this chapter develops it. Table 40.1.1 is that map, and it doubles as the chapter's table of contents.

Table 40.1.1: Each stage of the agentic application assigned to the axis of distribution it loads most heavily, the earlier chapter that owns the underlying machinery, and the later section of this chapter that builds it.

Stage	Primary axis	Owning earlier chapter	Built in this chapter
Document processing	Distribute data	Ch 6, Ch 7	Section 40.2
Embedding the corpus	Distribute inference	Ch 15, Ch 36	Section 40.3
Sharded vector search	Distribute the model	Ch 25	Section 40.4
RAG at scale	Distribute inference	Ch 36	Section 40.5
Agent orchestration	Distribute intelligence	Ch 32	Section 40.6
Model serving (vLLM)	Distribute inference / the model	Ch 24	Section 40.7
Cost control	Coordinate the cluster	Ch 26	Section 40.8
Evaluation	Distribute intelligence / coordinate	Ch 5	Section 40.9

Reading the table top to bottom traces the dataflow of Figure 40.1.1, and reading the third column traces the path back through the book. Document processing is pure data distribution, the MapReduce and Spark jobs of Chapter 6 and Chapter 7 applied to the corpus the agent will retrieve over. Embedding is distributed inference on a fixed corpus, borrowing the encoder and batching techniques of Chapter 15 and the web-scale pipeline of Chapter 36. Sharding the vector store is splitting a structure too large for one node, the partitioning logic of Chapter 25. RAG at scale is the retrieval-and-condition loop that Chapter 36 built end to end, here demoted from the whole product to a single tool the agent may call many times. Agent orchestration, the plan-act-observe loop and the sub-agent fan-out, is the distributed-intelligence material of Chapter 32. Model serving is the distributed LLM serving of Chapter 24, replicated and sharded to hold the concurrency of Output 40.1.1. Cost control is the cluster-coordination and MLOps discipline of Chapter 26 applied to the token bill. Evaluation closes the loop. No stage is new machinery; the contribution of this chapter is the assembly and the seams.

Thesis Thread: Intelligence Itself, Distributed Across a Fleet

Earlier case studies distributed data, the model, or the serving fleet. This one reaches the last axis the book names: it distributes the intelligence. A single user goal is decomposed by an orchestrator into steps, those steps spawn sub-agents that reason in parallel, and each reasoning act is itself served by a fleet of cooperating model replicas. The product exists only because thinking has been spread across many machines that must coordinate to act as one mind, the scale-out moment in its fullest form. When you reach the capstone in Chapter 41, this is the worked example of the sixth axis: read the requirements, find the depth and width multipliers, assign each ceiling to an axis, and defend the assignment with the arithmetic of Code 40.1.1. Scale-out here is not one optimization but the shape of cognition the product is built from.

4. The Build Path Versus the Agentic Query Path Intermediate

One structural decision organizes the rest of the chapter, and it is the same split that organized the RAG case study, now with a different center of gravity. The offline build path (document processing, embedding, sharding the index) is throughput-bound and latency-tolerant: it runs over the corpus, is measured in accelerator-hours and dollars, and a document indexed a few minutes late costs nothing. The online agentic query path (orchestration, retrieval, generation across many steps) is latency-bound and throughput-replicated: each task must finish its whole multi-step chain inside the budget, so it is measured in end-to-end tail latency and in how many concurrent steps the fleet sustains. What is new here, and what separates this chapter from Chapter 36, is that the query path is no longer one retrieval and one generation; it is a loop whose length and fan-out are decided at runtime by the agent itself, which makes its cost and latency variable rather than fixed.

That runtime variability is the heart of the chapter. In a one-shot RAG system the cost of a query is known before it runs; in an agentic system the agent decides how many steps to take and how many sub-agents to spawn, so the token bill and the concurrency load are themselves outcomes of the model's behavior, not fixed inputs. This is why cost control (Section 40.8) becomes its own stage rather than a footnote: bounding step counts, caching repeated retrievals, and choosing a smaller model for routine steps are the levers that keep the variable cost of Output 40.1.1 from running away. The two paths meet at two shared artifacts, the sharded index that the build path produces and the query path retrieves from, and the serving fleet that every step of every agent calls. Keeping the paths logically separate while letting them share exactly those two artifacts is the architectural spine of the chapter.

Practical Example: The Agent Demo Whose Bill Arrived Before the Product Did

Who: A platform engineer at an enterprise software company building an internal assistant that answers employee questions over the company's document corpus.

Situation: A prototype agent worked in a notebook: one reasoning loop, a hosted model API, a small local index, perfect demos to the leadership team.

Problem: Rolled out to a few thousand employees, the agent's reasoning loop averaged a dozen steps and frequently spawned sub-agents to check related documents, and the first monthly invoice was an order of magnitude over the budget.

Dilemma: Cap the agent to a single step and lose the multi-step reasoning that made it useful, or keep the reasoning and re-architect the serving and cost layers before the rollout widened company-wide.

Decision: They re-architected, running the back-of-envelope model of Code 40.1.1 on their real step counts and fan-out first, which revealed that the token bill was driven by re-reading a growing context at every step and by unbounded sub-agent spawning.

How: They moved to a self-hosted serving fleet with continuous batching, capped sub-agent fan-out, cached repeated retrievals, and routed routine steps to a smaller model, the exact stages of Sections 40.7 and 40.8.

Result: Per-task cost fell several-fold while end-to-end latency stayed inside the budget, and because the build and agentic paths were separated, widening to the whole company meant adding serving replicas without touching the orchestration logic.

Lesson: An agent prototype and an agent product are different systems, and the difference is the multiplier in $c_{\text{task}}$. Doing the arithmetic of Section 2 before the rollout is what turns the demo into something that survives its own invoice.

Library Shortcut: The Notebook Agent That Hides Every Distributed Problem

It is worth seeing the system this chapter does not study, because it shows exactly which problems scale removes the luxury of ignoring. A few lines of a modern agent framework give a working multi-step agent on a small corpus:

# pip install langchain langchain-community faiss-cpu
from langchain.agents import initialize_agent, Tool
from langchain_community.vectorstores import FAISS

retriever = FAISS.from_texts(documents, emb).as_retriever()   # whole index in RAM
tools = [Tool("search", retriever.invoke, "search the corpus")]
agent = initialize_agent(tools, llm, agent="zero-shot-react-description")
answer = agent.run(task)        # loops, retrieves, and calls the model, all on one box

Code 40.1.2: A complete single-machine agent in a handful of lines. It silently assumes the index fits in RAM, the model is a hosted call that always returns, the reasoning loop is cheap because no one counts its steps, and one process can serve every user. Each assumption is one of the five requirements from Section 1, and each fails in production; the rest of this chapter is what these lines become when none of them hold.

5. What This Chapter Builds, Section by Section Beginner

With the requirements fixed, the scale proven, and the axes mapped, the path through the chapter is set. Section 40.2 builds the distributed document-processing pipeline that produces the corpus the agent retrieves over. Section 40.3 embeds that corpus across a fleet, the inference pass that turns documents into searchable vectors. Section 40.4 shards those vectors into the searchable index, the artifact the agentic path consumes. Section 40.5 assembles retrieval and generation into RAG at scale, now reframed as a tool the agent invokes rather than the whole product. Section 40.6 builds the agent orchestration: the plan-act-observe loop, tool use, and the sub-agent fan-out that distributes the intelligence. Section 40.7 stands up the replicated, sharded model-serving fleet with vLLM that holds the concurrency of Output 40.1.1. Section 40.8 brings the variable token cost under control. Section 40.9 evaluates the whole system for task success, faithfulness, latency, and cost, and Section 40.10 turns it into a project you build yourself. Each section opens with the slice of Figure 40.1.1 it owns and the requirement from Section 1 it must satisfy.

Research Frontier: Where Distributed Agentic Systems Are Moving (2024 to 2026)

The boundary of this problem is shifting on every axis at once. On the orchestration axis, multi-agent frameworks in the lineage of AutoGen, LangGraph, CrewAI, and OpenAI's Swarm are formalizing how a planner agent decomposes a task and supervises a fleet of workers, and recent studies probe when added agents actually help versus when they merely multiply cost. On the serving axis, agent-aware schedulers exploit the fact that an agent's steps share a long, slowly growing prefix, so prefix-caching and disaggregated prefill-versus-decode serving (in the vLLM and SGLang lineage) cut the dominant input-token cost that Output 40.1.1 exposes. On the reasoning axis, long-context and tool-use models are reopening how much retrieval a step needs, yet retrieval keeps winning on cost and faithfulness at corpus scale, which is why the pipeline of Figure 40.1.1 endures. The constant across all of it is the five-requirement tension of Section 1: every advance is judged by whether it improves one of corpus scale, concurrency, multi-step latency, token cost, or reliability without surrendering the others, and the agent's runtime control over its own step count makes that trade sharper here than anywhere else in the book.

Fun Note: The Helpers It Will Never Meet

The epigraph is only a small exaggeration. In Output 40.1.1 the average task spawns enough sub-agents that, across a day, the system launches millions of short-lived reasoners that retrieve, decide, and report back, most of which finish their work and vanish before the parent has assembled the final answer. No human supervises them; the orchestrator counts on the arithmetic, not on watching each one. Distributing the intelligence did not make the thinking smaller, it made the waiting survivable, the same bargain that has underwritten every chapter of this book, now applied to cognition itself.

Exercise 40.1.1: Read the Requirements, Name the Ceilings Conceptual

For each of the five requirements in Section 1 (corpus scale, concurrent users, multi-step latency, token cost, reliability), name the single stage in Figure 40.1.1 that it pressures hardest, the axis of distribution from Table 40.1.1 that stage sits on, and one concrete failure that occurs if that stage is left on a single machine. Then identify which two of the five requirements are in the most direct tension with each other in an agentic system specifically, and explain why the agent's runtime control over its own step count makes that tension worse than it is in the one-shot RAG system of Chapter 36.

Exercise 40.1.2: Resize the Agent Coding

Modify Code 40.1.1 to model an agent that reasons more deeply and branches more widely: set the mean step count to $20$, the branch fraction to $0.5$, and the sub-agent fan-out to $4$. Report the new effective calls per task $c_{\text{task}}$, the daily token volume, the daily token cost, and the peak concurrency $C$. Then add a cost lever: assume routine steps (say $70\%$ of all calls) can be routed to a model priced at one fifth of the rates used here, and recompute the daily cost. Report the cost reduction as a percentage and explain which term in the cost formula the lever actually shrinks.

Exercise 40.1.3: The Concurrency Budget of a Reasoning Loop Analysis

Using Little's law $C = \lambda_{\text{peak}} L$ from Section 2, suppose the serving fleet must sustain a peak of $\lambda_{\text{peak}} = 1{,}700$ model calls per second at a per-step latency of $L = 1.4$ seconds. Estimate the concurrent decode streams $C$, and if one serving replica sustains $48$ concurrent streams, the replica count. Now argue how the answer changes if prefix-caching cuts the effective per-step latency to $0.6$ seconds by reusing the shared context across an agent's steps, and connect your reasoning to the replicated-serving treatment forthcoming in Section 40.7 and the LLM-serving economics of Chapter 24.