"I am an Orchestrator, holding the plan while five workers run ahead. Two have already finished, one is retrying a flaky tool, one is blocked on a lock, and one has wandered off to call an API I never approved. This is, somehow, the calm part of my job."
An Orchestrator, Holding the Plan While Five Workers Run Ahead
A production agentic application is not one model answering one prompt; it is a fleet of cooperating agents and tool services coordinated like a distributed workflow, where the unit of work is a reasoning step and the unit of communication is a tool call. A single language model can plan, but it cannot retrieve a billion documents, run untrusted code in isolation, query a payments API, and do all three at once without help. The moment the work exceeds what one model call can do alone, the application splits into a planner that decides what to do, executors that act, tool services that the executors call as remote procedures, and an orchestration engine that keeps the whole thing durable, retryable, and consistent. This section takes the orchestration machinery built in Chapter 32 and applies it to a real agentic product, showing how distributing intelligence across many agents follows the same split-move-recombine discipline as distributing a gradient.
In the previous section we built the retrieval layer that an agent reaches into when it needs grounded facts. Now we step up a level and ask how the agent uses that retrieval, alongside code execution and external APIs, inside a multi-step reasoning loop, and how that loop becomes a distributed system once one agent is no longer enough. The progression mirrors the rest of the book. A single language model call is the scale-up baseline: capable, self-contained, and bounded by what fits in one context window and one forward pass. Once a task needs many steps, many tools, and work that can proceed in parallel, the application scales out into a coordinated set of agents and services. The discipline for doing that well is distributed agent orchestration, and it is the sixth axis of distribution from Chapter 1: distributing intelligence itself.
1. The Agent Loop: Plan, Act, Observe, Repeat Beginner
The smallest unit of agentic behavior is a loop, not a single call. A language model is given a goal and a set of available tools; it reasons about what to do next, emits an action (a tool call with arguments), receives the tool's result as an observation, and feeds that observation back into its context to decide the next action. This reason, act, observe cycle, popularized as the ReAct pattern, repeats until the model decides the goal is met and emits a final answer. The loop is what turns a static predictor into something that can gather information it did not start with, correct course when a tool returns an error, and chain several operations toward a goal no single call could reach.
Each turn of the loop is cheap to describe and expensive to run. The reasoning step is one LLM forward pass, served from the fleet of Section 40.7. The action step blocks on an external service whose latency the agent does not control: a vector search over the index of Section 40.4, a sandboxed code execution, or a third-party API. Because the model must see each observation before choosing the next action, the turns of a single agent's loop are inherently sequential, and the wall-clock time of the loop is the sum of its per-turn latencies. That sequential spine is exactly the part that orchestration tries to keep short, and the independent work hanging off it is exactly the part that orchestration tries to run in parallel.
We can put a number on how long such a loop runs. If the agent takes a random number of turns $T$ to reach its goal, and each turn that has not yet finished succeeds with probability $q$, then $T$ is geometric and the expected number of turns to completion is
$$\mathbb{E}[T] = \frac{1}{q}, \qquad \text{critical-path latency} \approx \mathbb{E}[T]\,\big(\ell_{\text{reason}} + \ell_{\text{act}}\big),$$where $\ell_{\text{reason}}$ is the LLM latency per turn and $\ell_{\text{act}}$ is the tool latency per turn. The formula is deliberately blunt, but it captures the operational truth: a loop that needs ten turns at half a second each is a five-second user wait, and the only ways to shrink it are to reduce the number of turns (better planning) or to reduce per-turn latency (faster tools and a faster model). When the per-turn tool work is itself splittable across machines, a third lever appears, and that lever is the subject of the rest of this section.
The action step of the agent loop is a remote procedure call. The arguments happen to be chosen by a language model rather than hand-written code, but everything downstream is ordinary distributed-systems engineering: the call crosses a network boundary, it can time out, it can fail and need a retry, it can run concurrently with other calls, and its result must be marshaled back into the caller's state. Treating tool calls as RPCs, with the timeouts, idempotency keys, and circuit breakers that RPCs have always needed, is what separates a demo agent from a production one. The model picks the calls; the orchestration engine makes them survivable.
2. Tools as Distributed Services Beginner
Function calling, the mechanism by which a model emits a structured request to invoke a named tool with typed arguments, is the interface through which an agent reaches the rest of the system. Each tool is a service with its own latency, failure profile, and scaling story. The retrieval tool of Section 40.4 is a sharded vector index that answers in tens of milliseconds and scales by adding index replicas. A code-execution tool is a pool of sandboxed workers, each an isolated container that runs untrusted model-written code and is torn down afterward; it scales by adding sandbox capacity and is the slowest and riskiest tool in the kit. An external API, a weather service or a payments gateway, is a tool the application does not own at all, with rate limits and an availability budget set by someone else.
Because the tools are independent services, they fail and scale independently, and the orchestration layer must hold each to its own contract. A retrieval timeout should not abort a plan that could proceed with a cached result; a payments call must never be retried blindly, because a duplicate charge is worse than a failure; a code sandbox that hangs must be killed and its turn re-planned rather than waited on forever. These are the coordination and failure concerns of Chapter 2, now wearing the costume of an agent's toolbox. The agent chooses which tool to call; the engineering choice is how each tool behaves when it is slow, absent, or returns garbage.
Writing the parse-the-tool-call, dispatch, append-the-observation machinery by hand is tedious and error-prone. Modern SDKs expose function calling as a typed loop: you declare the tools, hand the model the goal, and the SDK surfaces each requested call for you to execute and return. The skeleton below is the entire ReAct spine of Figure 40.6.1.
tools = [retrieve_tool, run_code_tool, call_api_tool] # each a typed schema
messages = [{"role": "user", "content": goal}]
while True:
reply = model.run(messages, tools=tools) # the Reason step
if reply.stop_reason != "tool_use": # model judged it Done
break
for call in reply.tool_calls: # the Act step (may be many)
result = dispatch(call.name, call.arguments) # an RPC to a tool service
messages.append(tool_result(call.id, result)) # the Observe step
reply.tool_calls holds several calls at once, the dispatch can run them in parallel, which Section 4 turns into the supervisor fan-out.3. Planner and Executor: Fanning Work Out to Role-Specialized Agents Intermediate
One agent looping over tools handles a surprising amount, but it serializes everything, and many goals decompose into independent pieces. A research request might need five sources summarized, a coding task might touch four files, a data pipeline might validate six tables. The production pattern, built in depth in Chapter 32, splits the single agent into roles: a planner (or supervisor) decomposes the goal into sub-tasks and decides their dependencies, and a set of executor agents, each possibly specialized for a role such as researcher, coder, or critic, carry out the sub-tasks. The supervisor then gathers the executors' results and either finishes or plans another round. This is a map-then-reduce over reasoning, with the supervisor playing coordinator and the executors playing workers, the same shape as the all-reduce of Chapter 1 but with intelligence rather than gradients as the payload.
Whether this fan-out actually saves time depends on how much of the plan is genuinely independent. If a fraction $p$ of the total work can run in parallel across $W$ worker agents and the remaining fraction $1-p$ is the irreducibly sequential planning spine, the speedup is bounded by Amdahl's law from Chapter 3,
$$S(W) = \frac{1}{(1-p) + \dfrac{p}{W}}, \qquad \lim_{W \to \infty} S(W) = \frac{1}{1-p}.$$The ceiling is sobering: even with infinitely many workers, a plan that is $15\%$ sequential can never run more than about $6.7$ times faster than serial execution. The lesson for agent design is the same as for parallel training: the win comes from making the parallel fraction large (decompose aggressively into independent sub-tasks) and the sequential spine short (fewer supervisor rounds), not from throwing more workers at a plan that is mostly serial. For independent sub-tasks with per-task latencies $\ell_1, \dots, \ell_n$, sequential execution costs $\sum_i \ell_i$ while parallel execution costs only $\max_i \ell_i$, so the speedup is the sum-over-max ratio, and a single slow sub-task (a hung sandbox) caps the gain exactly as a straggler caps a synchronous training step.
4. The Orchestration Engine: Durable Execution, Retries, and State Intermediate
Fan-out is easy to draw and hard to run reliably, because the moment work is spread across many agents and tool calls, partial failure becomes the normal case. A worker crashes mid-plan, a tool times out, the process serving the supervisor is preempted. A toy script loses everything; a production orchestration engine does not, because it treats the plan as durable state. Each completed sub-task is checkpointed, so a restart resumes from the last good step rather than re-running the whole plan; each tool call carries a retry policy and an idempotency guarantee, so a transient failure is retried without duplicating side effects; and the engine tracks which sub-tasks are ready (dependencies satisfied), running, or done, scheduling the ready ones onto available workers. These are precisely the fault-tolerance and coordination primitives of Chapter 2, applied to a workflow whose steps happen to be acts of reasoning.
The runtime substrate underneath is a distributed-execution framework. Chapter 33 develops Ray as exactly this kind of substrate: it places each agent and tool worker as a task or actor across the cluster, moves the results between them, and restarts the ones that die, so that the orchestration logic can speak in terms of sub-tasks and dependencies while the framework handles process placement, fault recovery, and state transport. The agentic application thus inherits the cluster-scheduling story of the rest of the book: agents are scheduled work, tool services are scaled pools, and the engine is the coordinator that keeps a fleet of cooperating reasoners behaving like one coherent system.
The demonstration below makes the fan-out payoff concrete. It models a plan of independent sub-tasks, each a tool call with its own latency, and compares the plan two ways: run sequentially through one worker, its critical path is the sum of the latencies; fanned out to a pool of workers as a supervisor would dispatch them, its critical path is the slowest single sub-task. It reports the ideal fan-out speedup against the Amdahl ceiling, so the gap between independent-task ideal and the ceiling a partly-serial plan can reach is visible.
import random
random.seed(7)
# A plan of N independent subtasks the planner discovered. Latencies vary
# because tools differ (a vector search is fast, a code sandbox is slow).
# Each subtask is a tool call (retrieval, code exec, or an external API),
# and `secs` is its wall-clock latency, blocking on I/O rather than on CPU.
N = 8
plan = [(f"tool_{i}", round(random.uniform(0.05, 0.30), 3)) for i in range(N)]
W = N # one worker per independent subtask
# Sequential execution runs one subtask after another, so its critical path is
# the SUM of the latencies. Fanning the independent subtasks out to W workers
# runs them at once, so the parallel critical path is the MAX (the straggler).
serial_path = sum(secs for _, secs in plan)
parallel_path = max(secs for _, secs in plan)
ideal_speedup = serial_path / parallel_path # sum-over-max for independent work
# Amdahl-style ceiling: a real plan is part parallelizable (the fan-out) and
# part serial (the planner's reason->act->observe spine that cannot be skipped).
p = 0.85 # parallelizable fraction
amdahl_speedup = 1.0 / ((1.0 - p) + p / W) # theoretical ceiling at W workers
print(f"subtasks N : {N}")
print(f"workers W : {W}")
print(f"sum of subtask latency : {serial_path:.3f} s (critical path if serial)")
print(f"slowest single subtask : {parallel_path:.3f} s (critical path if parallel)")
print(f"ideal fan-out speedup : {ideal_speedup:.2f}x (independent sub-tasks)")
print(f"Amdahl ceiling (p={p}) : {amdahl_speedup:.2f}x at W={W} workers")
subtasks N : 8
workers W : 8
sum of subtask latency : 1.066 s (critical path if serial)
slowest single subtask : 0.213 s (critical path if parallel)
ideal fan-out speedup : 5.00x (independent sub-tasks)
Amdahl ceiling (p=0.85) : 3.90x at W=8 workers
Who: A platform engineer at a company shipping an LLM research assistant that answers analyst questions over internal documents and live data feeds.
Situation: A single ReAct agent answered each question by looping through retrieval, code execution, and API calls one tool at a time, often taking eight to twelve turns.
Problem: Median latency was forty seconds and the worst cases timed out, because the agent serialized sub-tasks that did not depend on each other, and any single tool failure restarted the whole loop.
Dilemma: Keep the simple single-agent loop and accept the latency, or move to a planner-executor design that runs independent sub-tasks in parallel but introduces a supervisor, a durable state store, and per-tool retry policies.
Decision: They adopted the planner-executor pattern on a durable orchestration engine, because profiling showed most questions decomposed into three to six independent retrieval and analysis sub-tasks.
How: A supervisor agent decomposed each question and fanned the independent sub-tasks out to a pool of worker agents on a Ray cluster; each tool call got a timeout, a bounded retry, and an idempotency key, and completed sub-tasks were checkpointed.
Result: Median latency fell from forty seconds to twelve, the timeout rate dropped by an order of magnitude because a single failed tool now retried in place instead of restarting the plan, and throughput rose because workers no longer idled waiting on each other.
Lesson: The win came from decomposing aggressively (raising the parallel fraction) and making each tool call independently survivable, not from a smarter model. Orchestration, not the LLM, was the bottleneck.
5. Shared Memory and Interoperation Protocols Advanced
Parallel workers are only useful if they can share what they learn. A researcher worker that finds a key fact must make it visible to the critic worker that checks the final answer, and the supervisor must see every worker's result to plan the next round. Agentic systems therefore need shared memory: a state store the agents read from and write to, ranging from a simple shared scratchpad in the supervisor's context, to a blackboard that workers post intermediate findings on, to a persistent vector store of accumulated knowledge that is itself the retrieval index of Section 40.4. This is the shared-state coordination of Chapter 32, and it carries the same hazards as any distributed shared state: concurrent writes need ordering, stale reads mislead, and unbounded growth of the shared context blows the token budget. The engine that owns the shared memory must enforce consistency exactly as the coordination layer of Chapter 2 demands.
For agents and tools to interoperate at all, they must agree on protocols. Two are consolidating the field. The Model Context Protocol (MCP) standardizes how an agent connects to tools and data sources: a tool server advertises its functions and resources over a uniform interface, and any MCP-speaking agent can call them without bespoke glue, turning the tool layer of Section 2 into a pluggable, discoverable ecosystem. Agent-to-Agent (A2A) protocols standardize the complementary direction: how one agent delegates a task to another, exchanges intermediate state, and reports results, so that a supervisor can fan work out to workers that may run in different processes, on different machines, or even at different organizations. Together MCP and A2A do for agentic systems what the communication primitives of Chapter 4 did for parallel training: they define the interfaces over which distributed intelligence is exchanged, so that the orchestration engine can treat agents and tools as interchangeable, network-addressable participants.
Chapter 1 named six axes of distribution and put "distribute intelligence" last, as the most speculative. This section is that axis built as a shipping product. The supervisor fanning sub-tasks out to workers is the same split-move-recombine move as the gradient all-reduce of Chapter 1; the durable engine is the fault tolerance of Chapter 2; the Amdahl ceiling on fan-out is the performance model of Chapter 3; MCP and A2A are the communication primitives of Chapter 4. An agentic application is a distributed system whose payload is reasoning, and every tool the book built for distributing computation applies, unchanged in spirit, to distributing intelligence.
The protocols that let agents and tools interoperate are crystallizing in real time. The Model Context Protocol, introduced by Anthropic in late 2024, has been adopted across multiple vendors as the de facto standard for connecting agents to tools and data, and a growing registry of MCP servers turns external services into plug-in tools. In parallel, agent-to-agent specifications (Google's A2A, proposed in 2025, among others) target cross-agent delegation and are converging on shared notions of agent cards, task lifecycles, and streaming results. On the orchestration side, durable-execution engines (Temporal-style workflows, and agent frameworks such as LangGraph that compile an agent graph onto a checkpointed runtime) are bringing decades of workflow-reliability engineering to multi-agent systems. The open research questions are sharp: how to bound the cost of ever-larger shared context, how to verify that a fanned-out plan's sub-results compose correctly, and how to secure a fleet in which agents call tools and other agents they did not write. We meet the orchestration internals in full in Chapter 32.
The slow worker that caps a synchronous all-reduce in Chapter 1 reappears here as the hung code sandbox that a supervisor waits on while three other workers sit idle, finished, twiddling their tokens. Same villain, new hat. The fix is the same too: bound its wait, speculate a replacement, and never let one slow participant set the pace for the whole fleet.
We have now turned a single language-model call into a fleet of cooperating agents and tool services: a loop that reasons and acts, tool calls dispatched as distributed RPCs, a supervisor that fans independent work out to parallel workers, a durable engine that keeps the plan alive through failure, shared memory that lets the workers learn from each other, and protocols that let any agent call any tool. Every one of those agents and tool calls ultimately bottoms out in a language-model forward pass, served from a distributed inference fleet. That fleet, the engine every agent in this section depends on, is the subject of Section 40.7.
For each goal, decide whether the work is mostly a sequential agent loop (Section 1) or mostly a parallel supervisor fan-out (Section 3), and estimate the parallelizable fraction $p$ you would assume: (a) "summarize these twelve uploaded PDFs into one brief"; (b) "debug this failing test by reading the stack trace, then the implicated file, then proposing a fix"; (c) "for each of our forty product pages, fetch its current price from the live API and flag mispricings". For the one you judge most parallel, use Amdahl's law to give the speedup ceiling as $W \to \infty$, and explain what part of the plan keeps it from being infinite.
Extend Code 40.6.2 so that one sub-task is a straggler (latency ten times the others) and one sub-task fails on its first attempt and must be retried once before succeeding (model the retry as adding that sub-task's latency a second time to its own path). Recompute the parallel critical path now and compare it to the no-straggler baseline. Then add a per-task timeout that abandons any sub-task exceeding a threshold and re-plans it as a fresh task, and report how the timeout changes the worst-case parallel path. Explain which line of the orchestration engine in Section 4 each of your additions corresponds to.
A supervisor can either run $n$ sub-tasks as one parallel fan-out, or run them as $n$ sequential turns of its own loop. Each fan-out adds a fixed supervisor overhead of $c$ seconds (decompose plus gather) on top of the parallel sub-task latency $\max_i \ell_i$, while the sequential path pays $\sum_i \ell_i$ with no fan-out overhead. Using $\mathbb{E}[T] = 1/q$ from Section 1 for the number of supervisor rounds, derive the condition on $c$, $n$, and the latency distribution under which fanning out is faster than sequential execution. State the regime (small $n$, similar latencies, large overhead) in which a planner is right to keep the work sequential, and connect your answer to the communication-versus-computation trade-off of Chapter 3.