Part VIII: Case Studies and Capstone Projects
Chapter 40: Distributed LLM and Agentic Applications

Distributed Agent Orchestration

"I am an Orchestrator, holding the plan while five workers run ahead. Two have already finished, one is retrying a flaky tool, one is blocked on a lock, and one has wandered off to call an API I never approved. This is, somehow, the calm part of my job."

An Orchestrator, Holding the Plan While Five Workers Run Ahead
Big Picture

A production agentic application is not one model answering one prompt; it is a fleet of cooperating agents and tool services coordinated like a distributed workflow, where the unit of work is a reasoning step and the unit of communication is a tool call. A single language model can plan, but it cannot retrieve a billion documents, run untrusted code in isolation, query a payments API, and do all three at once without help. The moment the work exceeds what one model call can do alone, the application splits into a planner that decides what to do, executors that act, tool services that the executors call as remote procedures, and an orchestration engine that keeps the whole thing durable, retryable, and consistent. This section takes the orchestration machinery built in Chapter 32 and applies it to a real agentic product, showing how distributing intelligence across many agents follows the same split-move-recombine discipline as distributing a gradient.

In the previous section we built the retrieval layer that an agent reaches into when it needs grounded facts. Now we step up a level and ask how the agent uses that retrieval, alongside code execution and external APIs, inside a multi-step reasoning loop, and how that loop becomes a distributed system once one agent is no longer enough. The progression mirrors the rest of the book. A single language model call is the scale-up baseline: capable, self-contained, and bounded by what fits in one context window and one forward pass. Once a task needs many steps, many tools, and work that can proceed in parallel, the application scales out into a coordinated set of agents and services. The discipline for doing that well is distributed agent orchestration, and it is the sixth axis of distribution from Chapter 1: distributing intelligence itself.

1. The Agent Loop: Plan, Act, Observe, Repeat Beginner

The smallest unit of agentic behavior is a loop, not a single call. A language model is given a goal and a set of available tools; it reasons about what to do next, emits an action (a tool call with arguments), receives the tool's result as an observation, and feeds that observation back into its context to decide the next action. This reason, act, observe cycle, popularized as the ReAct pattern, repeats until the model decides the goal is met and emits a final answer. The loop is what turns a static predictor into something that can gather information it did not start with, correct course when a tool returns an error, and chain several operations toward a goal no single call could reach.

Each turn of the loop is cheap to describe and expensive to run. The reasoning step is one LLM forward pass, served from the fleet of Section 40.7. The action step blocks on an external service whose latency the agent does not control: a vector search over the index of Section 40.4, a sandboxed code execution, or a third-party API. Because the model must see each observation before choosing the next action, the turns of a single agent's loop are inherently sequential, and the wall-clock time of the loop is the sum of its per-turn latencies. That sequential spine is exactly the part that orchestration tries to keep short, and the independent work hanging off it is exactly the part that orchestration tries to run in parallel.

Reason LLM forward pass Act emit tool call Retrieval (40.4) vector search Code sandbox isolated exec External API RPC over network Observe tool result Done? final answer
Figure 40.6.1: One agent's ReAct loop. The Reason step is an LLM forward pass; the Act step emits a tool call dispatched to retrieval, a code sandbox, or an external API; the Observe step feeds the result back. The cycle repeats until the model judges the goal met and exits with a final answer. The turns are sequential because each Reason needs the previous Observe.

We can put a number on how long such a loop runs. If the agent takes a random number of turns $T$ to reach its goal, and each turn that has not yet finished succeeds with probability $q$, then $T$ is geometric and the expected number of turns to completion is

$$\mathbb{E}[T] = \frac{1}{q}, \qquad \text{critical-path latency} \approx \mathbb{E}[T]\,\big(\ell_{\text{reason}} + \ell_{\text{act}}\big),$$

where $\ell_{\text{reason}}$ is the LLM latency per turn and $\ell_{\text{act}}$ is the tool latency per turn. The formula is deliberately blunt, but it captures the operational truth: a loop that needs ten turns at half a second each is a five-second user wait, and the only ways to shrink it are to reduce the number of turns (better planning) or to reduce per-turn latency (faster tools and a faster model). When the per-turn tool work is itself splittable across machines, a third lever appears, and that lever is the subject of the rest of this section.

Key Insight: A Tool Call Is a Distributed RPC With a Natural-Language Caller

The action step of the agent loop is a remote procedure call. The arguments happen to be chosen by a language model rather than hand-written code, but everything downstream is ordinary distributed-systems engineering: the call crosses a network boundary, it can time out, it can fail and need a retry, it can run concurrently with other calls, and its result must be marshaled back into the caller's state. Treating tool calls as RPCs, with the timeouts, idempotency keys, and circuit breakers that RPCs have always needed, is what separates a demo agent from a production one. The model picks the calls; the orchestration engine makes them survivable.

2. Tools as Distributed Services Beginner

Function calling, the mechanism by which a model emits a structured request to invoke a named tool with typed arguments, is the interface through which an agent reaches the rest of the system. Each tool is a service with its own latency, failure profile, and scaling story. The retrieval tool of Section 40.4 is a sharded vector index that answers in tens of milliseconds and scales by adding index replicas. A code-execution tool is a pool of sandboxed workers, each an isolated container that runs untrusted model-written code and is torn down afterward; it scales by adding sandbox capacity and is the slowest and riskiest tool in the kit. An external API, a weather service or a payments gateway, is a tool the application does not own at all, with rate limits and an availability budget set by someone else.

Because the tools are independent services, they fail and scale independently, and the orchestration layer must hold each to its own contract. A retrieval timeout should not abort a plan that could proceed with a cached result; a payments call must never be retried blindly, because a duplicate charge is worse than a failure; a code sandbox that hangs must be killed and its turn re-planned rather than waited on forever. These are the coordination and failure concerns of Chapter 2, now wearing the costume of an agent's toolbox. The agent chooses which tool to call; the engineering choice is how each tool behaves when it is slow, absent, or returns garbage.

Library Shortcut: A Tool-Calling Loop in a Dozen Lines

Writing the parse-the-tool-call, dispatch, append-the-observation machinery by hand is tedious and error-prone. Modern SDKs expose function calling as a typed loop: you declare the tools, hand the model the goal, and the SDK surfaces each requested call for you to execute and return. The skeleton below is the entire ReAct spine of Figure 40.6.1.

tools = [retrieve_tool, run_code_tool, call_api_tool]   # each a typed schema

messages = [{"role": "user", "content": goal}]
while True:
    reply = model.run(messages, tools=tools)            # the Reason step
    if reply.stop_reason != "tool_use":                 # model judged it Done
        break
    for call in reply.tool_calls:                       # the Act step (may be many)
        result = dispatch(call.name, call.arguments)    # an RPC to a tool service
        messages.append(tool_result(call.id, result))   # the Observe step
Code 40.6.1: The agent loop as exposed by a function-calling SDK. The roughly forty lines of hand-rolled parsing and state threading collapse to this loop; the SDK handles the schema validation, the tool-call extraction, and the message bookkeeping. When reply.tool_calls holds several calls at once, the dispatch can run them in parallel, which Section 4 turns into the supervisor fan-out.

3. Planner and Executor: Fanning Work Out to Role-Specialized Agents Intermediate

One agent looping over tools handles a surprising amount, but it serializes everything, and many goals decompose into independent pieces. A research request might need five sources summarized, a coding task might touch four files, a data pipeline might validate six tables. The production pattern, built in depth in Chapter 32, splits the single agent into roles: a planner (or supervisor) decomposes the goal into sub-tasks and decides their dependencies, and a set of executor agents, each possibly specialized for a role such as researcher, coder, or critic, carry out the sub-tasks. The supervisor then gathers the executors' results and either finishes or plans another round. This is a map-then-reduce over reasoning, with the supervisor playing coordinator and the executors playing workers, the same shape as the all-reduce of Chapter 1 but with intelligence rather than gradients as the payload.

Supervisor / Planner decompose, dispatch, gather Worker: Researcher sub-task A Worker: Coder sub-task B Worker: Analyst sub-task C Worker: Critic sub-task D retrieval code sandbox data API retrieval Gather / Reduce
Figure 40.6.2: Supervisor fan-out. The planner decomposes the goal into independent sub-tasks A through D, dispatches each to a role-specialized worker agent, and each worker runs its own ReAct loop calling its own tool services. Results gather into a reduce step that returns to the supervisor, which decides whether to finish or plan another round. The fan-out is where parallelism lives.

Whether this fan-out actually saves time depends on how much of the plan is genuinely independent. If a fraction $p$ of the total work can run in parallel across $W$ worker agents and the remaining fraction $1-p$ is the irreducibly sequential planning spine, the speedup is bounded by Amdahl's law from Chapter 3,

$$S(W) = \frac{1}{(1-p) + \dfrac{p}{W}}, \qquad \lim_{W \to \infty} S(W) = \frac{1}{1-p}.$$

The ceiling is sobering: even with infinitely many workers, a plan that is $15\%$ sequential can never run more than about $6.7$ times faster than serial execution. The lesson for agent design is the same as for parallel training: the win comes from making the parallel fraction large (decompose aggressively into independent sub-tasks) and the sequential spine short (fewer supervisor rounds), not from throwing more workers at a plan that is mostly serial. For independent sub-tasks with per-task latencies $\ell_1, \dots, \ell_n$, sequential execution costs $\sum_i \ell_i$ while parallel execution costs only $\max_i \ell_i$, so the speedup is the sum-over-max ratio, and a single slow sub-task (a hung sandbox) caps the gain exactly as a straggler caps a synchronous training step.

4. The Orchestration Engine: Durable Execution, Retries, and State Intermediate

Fan-out is easy to draw and hard to run reliably, because the moment work is spread across many agents and tool calls, partial failure becomes the normal case. A worker crashes mid-plan, a tool times out, the process serving the supervisor is preempted. A toy script loses everything; a production orchestration engine does not, because it treats the plan as durable state. Each completed sub-task is checkpointed, so a restart resumes from the last good step rather than re-running the whole plan; each tool call carries a retry policy and an idempotency guarantee, so a transient failure is retried without duplicating side effects; and the engine tracks which sub-tasks are ready (dependencies satisfied), running, or done, scheduling the ready ones onto available workers. These are precisely the fault-tolerance and coordination primitives of Chapter 2, applied to a workflow whose steps happen to be acts of reasoning.

The runtime substrate underneath is a distributed-execution framework. Chapter 33 develops Ray as exactly this kind of substrate: it places each agent and tool worker as a task or actor across the cluster, moves the results between them, and restarts the ones that die, so that the orchestration logic can speak in terms of sub-tasks and dependencies while the framework handles process placement, fault recovery, and state transport. The agentic application thus inherits the cluster-scheduling story of the rest of the book: agents are scheduled work, tool services are scaled pools, and the engine is the coordinator that keeps a fleet of cooperating reasoners behaving like one coherent system.

The demonstration below makes the fan-out payoff concrete. It models a plan of independent sub-tasks, each a tool call with its own latency, and compares the plan two ways: run sequentially through one worker, its critical path is the sum of the latencies; fanned out to a pool of workers as a supervisor would dispatch them, its critical path is the slowest single sub-task. It reports the ideal fan-out speedup against the Amdahl ceiling, so the gap between independent-task ideal and the ceiling a partly-serial plan can reach is visible.

import random

random.seed(7)

# A plan of N independent subtasks the planner discovered. Latencies vary
# because tools differ (a vector search is fast, a code sandbox is slow).
# Each subtask is a tool call (retrieval, code exec, or an external API),
# and `secs` is its wall-clock latency, blocking on I/O rather than on CPU.
N = 8
plan = [(f"tool_{i}", round(random.uniform(0.05, 0.30), 3)) for i in range(N)]
W = N                                          # one worker per independent subtask

# Sequential execution runs one subtask after another, so its critical path is
# the SUM of the latencies. Fanning the independent subtasks out to W workers
# runs them at once, so the parallel critical path is the MAX (the straggler).
serial_path   = sum(secs for _, secs in plan)
parallel_path = max(secs for _, secs in plan)
ideal_speedup = serial_path / parallel_path    # sum-over-max for independent work

# Amdahl-style ceiling: a real plan is part parallelizable (the fan-out) and
# part serial (the planner's reason->act->observe spine that cannot be skipped).
p = 0.85                                        # parallelizable fraction
amdahl_speedup = 1.0 / ((1.0 - p) + p / W)      # theoretical ceiling at W workers

print(f"subtasks N              : {N}")
print(f"workers  W              : {W}")
print(f"sum of subtask latency  : {serial_path:.3f} s  (critical path if serial)")
print(f"slowest single subtask  : {parallel_path:.3f} s  (critical path if parallel)")
print(f"ideal fan-out speedup    : {ideal_speedup:.2f}x  (independent sub-tasks)")
print(f"Amdahl ceiling (p={p})  : {amdahl_speedup:.2f}x  at W={W} workers")
Code 40.6.2: A planner fanning $N$ independent sub-tasks out to $W$ parallel workers versus running them sequentially. The two critical paths follow from the latency model directly: the serial path is the sum of the sub-task latencies, the parallel path is the slowest single sub-task, and their ratio is the ideal fan-out speedup.
subtasks N              : 8
workers  W              : 8
sum of subtask latency  : 1.066 s  (critical path if serial)
slowest single subtask  : 0.213 s  (critical path if parallel)
ideal fan-out speedup    : 5.00x  (independent sub-tasks)
Amdahl ceiling (p=0.85)  : 3.90x  at W=8 workers
Output 40.6.2: The fully parallel plan finishes in the time of its slowest sub-task ($0.213$ s) rather than the sum of all eight ($1.066$ s), an ideal $5.00\times$ speedup. Because these sub-tasks are fully independent the ratio exceeds the $3.90\times$ Amdahl ceiling computed for a plan that is only $85\%$ parallelizable: the gap is exactly the cost of the sequential planning spine that a real supervisor cannot avoid.
Practical Example: The Research Assistant That Stopped Timing Out

Who: A platform engineer at a company shipping an LLM research assistant that answers analyst questions over internal documents and live data feeds.

Situation: A single ReAct agent answered each question by looping through retrieval, code execution, and API calls one tool at a time, often taking eight to twelve turns.

Problem: Median latency was forty seconds and the worst cases timed out, because the agent serialized sub-tasks that did not depend on each other, and any single tool failure restarted the whole loop.

Dilemma: Keep the simple single-agent loop and accept the latency, or move to a planner-executor design that runs independent sub-tasks in parallel but introduces a supervisor, a durable state store, and per-tool retry policies.

Decision: They adopted the planner-executor pattern on a durable orchestration engine, because profiling showed most questions decomposed into three to six independent retrieval and analysis sub-tasks.

How: A supervisor agent decomposed each question and fanned the independent sub-tasks out to a pool of worker agents on a Ray cluster; each tool call got a timeout, a bounded retry, and an idempotency key, and completed sub-tasks were checkpointed.

Result: Median latency fell from forty seconds to twelve, the timeout rate dropped by an order of magnitude because a single failed tool now retried in place instead of restarting the plan, and throughput rose because workers no longer idled waiting on each other.

Lesson: The win came from decomposing aggressively (raising the parallel fraction) and making each tool call independently survivable, not from a smarter model. Orchestration, not the LLM, was the bottleneck.

5. Shared Memory and Interoperation Protocols Advanced

Parallel workers are only useful if they can share what they learn. A researcher worker that finds a key fact must make it visible to the critic worker that checks the final answer, and the supervisor must see every worker's result to plan the next round. Agentic systems therefore need shared memory: a state store the agents read from and write to, ranging from a simple shared scratchpad in the supervisor's context, to a blackboard that workers post intermediate findings on, to a persistent vector store of accumulated knowledge that is itself the retrieval index of Section 40.4. This is the shared-state coordination of Chapter 32, and it carries the same hazards as any distributed shared state: concurrent writes need ordering, stale reads mislead, and unbounded growth of the shared context blows the token budget. The engine that owns the shared memory must enforce consistency exactly as the coordination layer of Chapter 2 demands.

For agents and tools to interoperate at all, they must agree on protocols. Two are consolidating the field. The Model Context Protocol (MCP) standardizes how an agent connects to tools and data sources: a tool server advertises its functions and resources over a uniform interface, and any MCP-speaking agent can call them without bespoke glue, turning the tool layer of Section 2 into a pluggable, discoverable ecosystem. Agent-to-Agent (A2A) protocols standardize the complementary direction: how one agent delegates a task to another, exchanges intermediate state, and reports results, so that a supervisor can fan work out to workers that may run in different processes, on different machines, or even at different organizations. Together MCP and A2A do for agentic systems what the communication primitives of Chapter 4 did for parallel training: they define the interfaces over which distributed intelligence is exchanged, so that the orchestration engine can treat agents and tools as interchangeable, network-addressable participants.

Thesis Thread: The Sixth Axis, Made Concrete

Chapter 1 named six axes of distribution and put "distribute intelligence" last, as the most speculative. This section is that axis built as a shipping product. The supervisor fanning sub-tasks out to workers is the same split-move-recombine move as the gradient all-reduce of Chapter 1; the durable engine is the fault tolerance of Chapter 2; the Amdahl ceiling on fan-out is the performance model of Chapter 3; MCP and A2A are the communication primitives of Chapter 4. An agentic application is a distributed system whose payload is reasoning, and every tool the book built for distributing computation applies, unchanged in spirit, to distributing intelligence.

Research Frontier: Standardizing Agent Interoperation (2024 to 2026)

The protocols that let agents and tools interoperate are crystallizing in real time. The Model Context Protocol, introduced by Anthropic in late 2024, has been adopted across multiple vendors as the de facto standard for connecting agents to tools and data, and a growing registry of MCP servers turns external services into plug-in tools. In parallel, agent-to-agent specifications (Google's A2A, proposed in 2025, among others) target cross-agent delegation and are converging on shared notions of agent cards, task lifecycles, and streaming results. On the orchestration side, durable-execution engines (Temporal-style workflows, and agent frameworks such as LangGraph that compile an agent graph onto a checkpointed runtime) are bringing decades of workflow-reliability engineering to multi-agent systems. The open research questions are sharp: how to bound the cost of ever-larger shared context, how to verify that a fanned-out plan's sub-results compose correctly, and how to secure a fleet in which agents call tools and other agents they did not write. We meet the orchestration internals in full in Chapter 32.

Fun Note: The Straggler Has a New Costume

The slow worker that caps a synchronous all-reduce in Chapter 1 reappears here as the hung code sandbox that a supervisor waits on while three other workers sit idle, finished, twiddling their tokens. Same villain, new hat. The fix is the same too: bound its wait, speculate a replacement, and never let one slow participant set the pace for the whole fleet.

We have now turned a single language-model call into a fleet of cooperating agents and tool services: a loop that reasons and acts, tool calls dispatched as distributed RPCs, a supervisor that fans independent work out to parallel workers, a durable engine that keeps the plan alive through failure, shared memory that lets the workers learn from each other, and protocols that let any agent call any tool. Every one of those agents and tool calls ultimately bottoms out in a language-model forward pass, served from a distributed inference fleet. That fleet, the engine every agent in this section depends on, is the subject of Section 40.7.

Exercise 40.6.1: Sequential Spine or Parallel Fan-Out? Conceptual

For each goal, decide whether the work is mostly a sequential agent loop (Section 1) or mostly a parallel supervisor fan-out (Section 3), and estimate the parallelizable fraction $p$ you would assume: (a) "summarize these twelve uploaded PDFs into one brief"; (b) "debug this failing test by reading the stack trace, then the implicated file, then proposing a fix"; (c) "for each of our forty product pages, fetch its current price from the live API and flag mispricings". For the one you judge most parallel, use Amdahl's law to give the speedup ceiling as $W \to \infty$, and explain what part of the plan keeps it from being infinite.

Exercise 40.6.2: Add a Straggler and a Retry Coding

Extend Code 40.6.2 so that one sub-task is a straggler (latency ten times the others) and one sub-task fails on its first attempt and must be retried once before succeeding (model the retry as adding that sub-task's latency a second time to its own path). Recompute the parallel critical path now and compare it to the no-straggler baseline. Then add a per-task timeout that abandons any sub-task exceeding a threshold and re-plans it as a fresh task, and report how the timeout changes the worst-case parallel path. Explain which line of the orchestration engine in Section 4 each of your additions corresponds to.

Exercise 40.6.3: Cost of a Round Versus a Fan-Out Analysis

A supervisor can either run $n$ sub-tasks as one parallel fan-out, or run them as $n$ sequential turns of its own loop. Each fan-out adds a fixed supervisor overhead of $c$ seconds (decompose plus gather) on top of the parallel sub-task latency $\max_i \ell_i$, while the sequential path pays $\sum_i \ell_i$ with no fan-out overhead. Using $\mathbb{E}[T] = 1/q$ from Section 1 for the number of supervisor rounds, derive the condition on $c$, $n$, and the latency distribution under which fanning out is faster than sequential execution. State the regime (small $n$, similar latencies, large overhead) in which a planner is right to keep the work sequential, and connect your answer to the communication-versus-computation trade-off of Chapter 3.