Section 32.4: Parallel and Distributed Multi-Agent Workflows

"I dispatched five researchers at once, told them to report back, and then stood at the barrier doing absolutely nothing until the slowest one returned. That wait, it turns out, was the whole job."
An Aggregator Agent at a Fan-In Barrier

Big Picture

A multi-agent workflow is a directed acyclic graph of LLM calls, and orchestrating it is the same scheduling problem as orchestrating a data pipeline: lay out the dependencies, run independent steps concurrently, synchronize at the joins, and accept that the longest serial chain bounds how fast the whole thing can finish. The previous section split one task across role-specialized agents; this section is about the shape of the computation that connects them. Four shapes recur: a sequential chain where each agent feeds the next, a parallel fan-out/fan-in where independent subtasks run at once and an aggregator combines them, conditional routing where a router agent dispatches to the right specialist, and loops that iterate until a condition holds. Each agent call is an expensive LLM request, so the choice of shape is a direct negotiation among latency, cost, and quality. We build a fan-out/fan-in workflow from scratch, measure the real wall-clock speedup of running the agents concurrently, and show that the speedup stops exactly where the critical path predicts.

In the previous section we gave each agent a role: a planner that decomposes the task, executors that carry out the steps, a critic that checks the result. Roles answer "who does what". This section answers "in what order, and what can happen at the same time". The answer is a graph. The planner's decomposition is not a flat list; it is a dependency structure in which some subtasks must wait for others and some are completely independent. Drawing that structure as a directed acyclic graph (DAG) of agent steps turns a vague picture of agents talking to each other into a concrete object we can schedule, parallelize, and reason about quantitatively. This is the genuinely distributed core of the chapter, because the moment two agent calls are independent, they can and should run on two machines at once.

The framing is not a metaphor borrowed loosely from systems work; it is the identical problem. A Spark job is a DAG of stages where each stage runs in parallel across partitions and shuffles synchronize the boundaries (Section 7.4); an ML training pipeline is a DAG of steps that a scheduler executes respecting data dependencies (Section 26.2). Swap the nodes from "transform a partition" to "call an LLM agent" and every scheduling lesson carries over. What changes is the unit cost: a node here is a multi-second, multi-cent network request to a model, not a microsecond of vectorized arithmetic, which makes the parallelism worth chasing and the critical path worth shortening.

Figure 32.4.1: The four workflow topologies of multi-agent systems, each a fragment of a DAG of LLM calls. A sequential chain passes output forward; a parallel fan-out/fan-in runs independent agents at once and joins at an aggregator (the critical path, in red, is the plan step plus a single agent plus the aggregator, not the sum of all three agents); a router dispatches by content to one specialist; a loop iterates until a quality condition holds. The fan-out is the source of speedup and the focus of this section.

1. The Agent Workflow as a Computation Graph Beginner

Model the workflow as a DAG $G = (V, E)$. Each node $v \in V$ is one agent step, almost always one LLM call: an input prompt plus the model's response. Each edge $(u, v) \in E$ means node $v$ consumes node $u$'s output, so $v$ cannot start until $u$ finishes. The graph is acyclic for the planning portion, which is what lets us talk about a well-defined order; loops, when present, are bounded iterations layered on top, and we treat them separately in Section 2. Once the workflow is a DAG, every question we care about has a precise answer. Which steps can run at once? Any set of nodes with no path between them. When must we wait? At any node whose inputs come from more than one predecessor. How fast can the whole thing finish, given unlimited machines? The length of the longest path from a source to a sink, which we make formal below.

The four topologies in Figure 32.4.1 are the building blocks from which real workflows are assembled. A sequential chain is a path: agent one's output is agent two's input, a pipeline where each stage refines the last (extract, then summarize, then translate). A parallel fan-out/fan-in is a node that branches into several independent children which later rejoin at a single successor: a planner spawns three researchers, each investigates a different subtopic with no dependence on the others, and an aggregator waits for all three and writes the synthesis. This is the map-reduce of agents, and the resemblance is exact: the fan-out is the map (independent work per shard), the aggregator is the reduce (combine the partial results), and the barrier between them is the same shuffle boundary that Section 6.2 builds the entire MapReduce model around.

Thesis Thread: Map-Reduce Returns, Now Over Agents

The fan-out/fan-in pattern is map-reduce wearing new clothes. In Section 6.2 the map phase ran a pure function over each input shard in parallel and the reduce phase combined per-key partials; here the "map" is an independent LLM agent over each subtopic and the "reduce" is an aggregator agent that synthesizes their outputs. The same primitive that organized terabyte log processing now organizes a panel of researchers. The constant that changed is unit cost: a map task was microseconds of arithmetic, an agent task is seconds and cents of model inference, which is exactly why moving the independent work off the serial path matters so much more here. When you meet a fan-out of agents, you are looking at map-reduce scaled out one more time, from data records to reasoning steps.

The third topology is conditional routing: a lightweight router agent inspects the input and dispatches it to the appropriate specialist, so a billing question goes to the billing agent and a code question goes to the code agent. Only the chosen branch executes, which keeps cost down and lets each specialist carry a focused prompt and toolset. This is the agent-level cousin of the content-aware request routing that serving systems already do: a model server that sends each request to the replica holding the right cached prefix or the right expert shard is solving the same dispatch-by-content problem we met in distributed inference (Section 23.2). The fourth topology, the loop, runs an agent (or a small sub-DAG) repeatedly until a condition is met: draft, check, revise, check again, stop when the checker is satisfied or a step budget is exhausted. We will see loops in their fullest form as debate and reflection in Section 32.5; here they matter as the one structure that deliberately reintroduces serial dependence, and therefore lengthens the critical path on purpose.

2. Why Parallelism Pays, and What Bounds It Intermediate

Independent agent calls are embarrassingly parallel. When three researchers investigate three unrelated subtopics, running them one after another is a choice, not a requirement, and it is almost always the wrong choice. Each call is dominated by latency: the round trip to the model plus the token-by-token generation, typically seconds. If the three calls each take time $t$, running them serially costs $3t$ while running them concurrently costs about $t$, and the saving grows linearly with the fan-out width. Because each call is also an expensive, billable request, the structure of the graph decides not just how fast the workflow finishes but how much it costs and, through the number of agents and steps you can afford, how good the answer gets. More agents and more iterations generally raise quality, and they always raise cost and (along the serial chain) latency. That three-way tension among latency, cost, and quality is the scaling-efficiency theme of Chapter 3 reappearing at the agent layer: adding workers helps only up to the point where the part you cannot parallelize dominates.

That point has a name. Let the workflow have a portion that must run serially (the planning step, the final aggregation, any loop) taking total time $T_s$, and a portion that is perfectly parallel across $K$ agents taking $T_p$ when run on one machine. The wall-clock time on $K$ machines is

$$T(K) = T_s + \frac{T_p}{K}, \qquad \text{speedup}\;\; S(K) = \frac{T_s + T_p}{T_s + T_p/K} \;\xrightarrow[K \to \infty]{}\; \frac{T_s + T_p}{T_s}.$$

This is Amdahl's law (Section 3.5), and for agent workflows it has a vivid interpretation: the serial fraction is the critical path of the DAG, the longest chain of dependent agent calls from input to output. No amount of parallelism shortens it, because each call on it waits for the previous one's output. In the fan-out of Figure 32.4.1, the critical path is plan, then one researcher, then aggregate, three calls, regardless of whether there are three researchers or thirty. Widening the fan-out adds quality and cost but cannot push the latency below that three-call floor. Narrowing the critical path, by contrast (a faster planner, a cheaper aggregation, fewer loop iterations), lowers the floor for everyone. This is why the critical path, not the agent count, is the number to watch.

Key Insight: The Critical Path Is the Latency, the Graph Is the Cost

Two different quantities of a multi-agent workflow are set by two different features of its DAG. End-to-end latency is governed by the critical path, the longest chain of dependent LLM calls; parallel fan-out hides everything off that path but can never shorten the path itself. Total cost is governed by the whole graph, every node bills regardless of when it runs, so a wide fan-out that finishes fast is cheap in time and expensive in money. Optimize latency by shortening the critical path (fewer serial steps, faster models on the path, bounded loops); optimize cost by pruning nodes (conditional routing so only one branch runs, caching, smaller models off the path). Confusing the two leads to the classic mistake of adding agents to "speed things up" and getting a bigger bill at the same latency.

A synchronization point makes this concrete. The aggregator in a fan-in is a barrier: it cannot begin until every parallel agent has returned, so its start time is the maximum of the agents' finish times, not the average. One slow agent, a straggler, holds the barrier and therefore the whole workflow, exactly the straggler problem that haunts synchronous data parallelism (Section 3.5). The remedies are familiar in spirit: cap each agent with a timeout, return partial results when one branch fails, or speculatively re-issue a lagging call. An orchestrator that schedules a DAG of LLM calls is, structurally, the same scheduler that runs a Spark stage graph or an MLOps pipeline (Section 26.2); it tracks which nodes are ready, dispatches the ready ones to free workers, and blocks barriers until their inputs arrive.

3. Measuring the Speedup From Scratch Intermediate

The claim that fan-out parallelism collapses a workflow to its critical path is easy to state and easy to test. The code below builds a small fan-out/fan-in workflow with eight independent researcher agents and one aggregator. The LLM call is stubbed by a function that sleeps for a fixed latency, which is faithful in the way that matters: a real agent call is latency-bound and spends almost all its wall-clock waiting on the network and the model, so concurrent calls overlap their waits. We run the identical DAG two ways, sequentially and with a thread pool, and report the measured speedup against the critical-path floor of two call-latencies (one researcher layer plus the aggregator).

import time
from concurrent.futures import ThreadPoolExecutor

LLM_LATENCY = 0.30  # seconds per stubbed LLM call (I/O-bound, releases the GIL)

def llm_call(prompt):
    """Stub LLM request: latency-dominated, returns a short string."""
    time.sleep(LLM_LATENCY)
    return f"notes[{prompt}]"

def researcher(subtopic):
    return llm_call(subtopic)

def aggregator(notes):
    return llm_call("summarize(" + "+".join(notes) + ")")

SUBTOPICS = [f"subtopic-{i}" for i in range(8)]

def run_sequential():
    notes = [researcher(s) for s in SUBTOPICS]   # one researcher at a time
    return aggregator(notes)                       # then the fan-in

def run_parallel():
    with ThreadPoolExecutor(max_workers=len(SUBTOPICS)) as pool:
        notes = list(pool.map(researcher, SUBTOPICS))  # fan-out, concurrent
    return aggregator(notes)                            # fan-in barrier

def timed(fn):
    t0 = time.perf_counter(); fn(); return time.perf_counter() - t0

n = len(SUBTOPICS)
seq, par = timed(run_sequential), timed(run_parallel)
serial_floor = 2 * LLM_LATENCY   # fan-out layer + aggregator on the critical path
print(f"agents (fan-out width)   : {n}")
print(f"sequential wall-clock    : {seq:.2f} s")
print(f"parallel wall-clock      : {par:.2f} s")
print(f"speedup                  : {seq / par:.2f}x")
print(f"critical-path floor      : {serial_floor:.2f} s  (fan-out layer + aggregator)")
print(f"parallel within          : {par - serial_floor:+.2f} s of the floor")

Code 32.4.1: A fan-out/fan-in agent workflow built with nothing but the standard library. The researchers run concurrently in a thread pool (the fan-out), then the aggregator runs once on all their outputs (the fan-in barrier). The same DAG is timed serially and in parallel.

agents (fan-out width)   : 8
sequential wall-clock    : 2.70 s
parallel wall-clock      : 0.60 s
speedup                  : 4.48x
critical-path floor      : 0.60 s  (fan-out layer + aggregator)
parallel within          : +0.00 s of the floor

Output 32.4.1: Sequential execution costs nine call-latencies (eight researchers plus the aggregator); parallel execution costs exactly the two-call critical path. The parallel run lands within rounding of the floor, and the speedup stops there, precisely as Amdahl's law predicts for a serial fraction of two calls out of nine.

The sequential run takes nine call-latencies, $9 \times 0.30 = 2.70$ seconds, because nothing overlaps. The parallel run takes the critical-path floor of $0.60$ seconds, the eight researchers collapsing into a single $0.30$-second layer plus the $0.30$-second aggregator, and it does not go below that no matter how many threads we give it. The measured $4.48\times$ speedup is exactly $T(1)/T(K)$ for this DAG: $2.70 / 0.60 = 4.5$. Note what would happen if we doubled the fan-out to sixteen researchers: the sequential time would climb toward seventeen call-latencies while the parallel time would stay pinned at two, so the speedup would grow, but the parallel latency would not improve at all, because the critical path is unchanged. That is the entire trade-off in one experiment. The aggregator's dependence on every researcher is the barrier; the floor of $0.60$ seconds is the Amdahl limit; widening the fan-out buys throughput and quality, never lower latency.

Fun Note: The Committee That Met Sequentially

Picture a research committee where, instead of everyone reading their assigned chapter over the same weekend, member two refuses to start until member one has finished, member three until member two is done, and so on down a table of eight. The report is identical to the parallel version; it just arrives eight weekends late. Running agents serially when their tasks are independent is exactly this committee. The thread pool in Code 32.4.1 is the simple instruction the committee never received: read at the same time, then meet once.

Library Shortcut: LangGraph Schedules the DAG and the Fan-Out for You

Code 32.4.1 wired the fan-out and the barrier by hand. In a real system you declare the graph and let an orchestration framework schedule it, run independent nodes concurrently, enforce the join, and handle retries and timeouts. LangGraph models the workflow as an explicit graph of nodes and edges; nodes with no dependency between them are dispatched in parallel and a downstream node waits on all of its inputs automatically:

from langgraph.graph import StateGraph, START, END

g = StateGraph(dict)
for i in range(8):                                   # eight independent researchers
    g.add_node(f"research_{i}", make_researcher(i))
    g.add_edge(START, f"research_{i}")               # fan-out: all start together
    g.add_edge(f"research_{i}", "aggregate")         # fan-in: aggregate waits for all
g.add_node("aggregate", aggregator_node)
g.add_edge("aggregate", END)

app = g.compile()                                    # scheduler runs the DAG, parallel where it can
result = app.invoke({"topic": "scale-out AI"})

Code 32.4.2: The same fan-out/fan-in as Output 32.4.1, declared as a LangGraph DAG. The roughly fifteen lines of manual thread-pool and barrier logic collapse to edge declarations; the framework handles concurrent dispatch of the independent researchers, the join at aggregate, and the retry and timeout policy that a hand-rolled pool would force you to write.

Practical Example: The Market-Research Brief That Outgrew Its Chain

Who: An applied-AI engineer building an automated competitor-analysis tool at a consulting firm.

Situation: The tool produced a brief by chaining seven agents: profile each of six competitors in turn, then write a comparison. Each agent call took roughly twelve seconds.

Problem: A single brief took about ninety seconds end to end, and analysts ran dozens a day, so the latency made the tool feel sluggish and capped how many briefs the team could request.

Dilemma: Keep the simple sequential chain that was easy to debug, or restructure the six competitor profiles as a parallel fan-out, which meant adding a synchronization barrier, timeout handling for a slow profile, and partial-result logic if one competitor's data was missing.

Decision: They restructured to fan-out/fan-in, because the six competitor profiles were genuinely independent (no profile needed another) and only the final comparison depended on all of them.

How: They declared the six profilers as parallel nodes feeding one comparison node in an orchestration graph, set a fifteen-second per-agent timeout, and configured the comparison node to proceed with whatever profiles returned.

Result: End-to-end latency fell from about ninety seconds to roughly twenty-four, the critical path of one profiler plus the comparison, matching the two-call floor of Output 32.4.1. Total token cost was unchanged (the same seven calls still ran); only the wall-clock dropped, because the six profiles now overlapped.

Lesson: When subtasks are independent, the chain is leaving latency on the table for free. Fan-out does not reduce cost (every node still bills), but it collapses latency to the critical path, and the critical path is what the user feels.

4. Composing the Topologies Into Real Workflows Advanced

Real systems nest these shapes. A customer-support workflow might begin with conditional routing (classify the ticket, dispatch to a specialist), expand into a fan-out (the specialist spawns parallel lookups against billing, shipping, and account-history tools), collapse at a fan-in (an agent synthesizes the lookups into a draft reply), and finish with a loop (a critic checks the draft for policy compliance and sends it back for revision until it passes). The composite is still a DAG with bounded loops, and its critical path is still the longest chain of dependent calls: route, one lookup, synthesize, and however many critic iterations the loop runs. Every design decision is a move on that path. Routing keeps the fan-out narrow by sending only relevant work; the fan-out hides the lookups behind their slowest member; the loop, deliberately, lengthens the path in exchange for quality, which is why bounding its iterations is not optional.

The orchestrator's job across all of this is unchanged from the data-pipeline schedulers of Part II: maintain the readiness of each node, dispatch ready nodes to available workers, respect barriers, and bound the loops. What is new at the agent layer is that nodes can fail in soft ways a Spark task cannot. An agent may return a malformed answer, hallucinate a tool call, or simply produce something the downstream node cannot use, so the orchestrator needs validation and retry at the node level, not only crash recovery. That richer failure model, and the question of how agents check each other's work, is the natural next subject: the loop topology, taken seriously, becomes debate, critique, and reflection across agents, which is where Section 32.5 goes.

Research Frontier: Graph-Structured Agent Orchestration (2024 to 2026)

The move from ad-hoc agent chains to explicit, schedulable graphs is the defining shift of this period. LangGraph (2024) made the workflow a first-class stateful graph with conditional edges, parallel branches, and persistence, giving orchestration the same DAG vocabulary that data systems have used for years. OpenAI's Swarm and its successor Agents SDK (2024 to 2025) formalized handoffs, the conditional-routing topology, as a primitive. Microsoft's AutoGen and the surrounding work on multi-agent conversation patterns push on the fan-out and debate structures, while research systems such as DSPy compile declarative agent programs into optimized call graphs. A live question is automatic critical-path reduction: given a declared agent DAG, can the runtime reorder, batch, parallelize, and cache calls to minimize latency and cost without the developer hand-tuning the graph, the agent-orchestration analogue of the query optimizers that already plan Spark and SQL DAGs. We return to the engines that execute these graphs at scale in Section 32.8.

Exercise 32.4.1: Draw the Critical Path Conceptual

A travel-planning workflow has these agent steps with dependencies: a planner (no input deps); three parallel searchers for flights, hotels, and activities (each depends only on the planner); a budget agent that depends on all three searchers; and a writer that depends on the budget agent. Each call takes one unit of time. (a) Draw the DAG. (b) Identify the critical path and give its length. (c) State the wall-clock time on unlimited machines and on a single machine, and compute the maximum possible speedup. (d) Explain why adding a fourth searcher (say, for restaurants, also depending only on the planner) raises the cost but not the latency.

Exercise 32.4.2: Add a Straggler and a Timeout Coding

Modify Code 32.4.1 so that one of the eight researchers is a straggler whose latency is five times the others (give that single call a longer sleep). First measure the parallel wall-clock and confirm that the fan-in barrier is now governed by the straggler, not the fast agents. Then wrap each researcher call with a timeout (using concurrent.futures futures and result(timeout=...)), have the aggregator proceed with whatever returned in time, and measure the new wall-clock. Report the latency-versus-completeness trade-off: how much faster did the timeout make the workflow, and how many researcher results were dropped?

Exercise 32.4.3: When Does Fan-Out Stop Helping? Analysis

Using the Amdahl form $T(K) = T_s + T_p/K$, model a workflow whose serial portion is a planner plus an aggregator (two call-latencies) and whose parallel portion is $K$ researcher calls of one latency each, so $T_p = K$ before parallelization. (a) Write $T(K)$ and the speedup $S(K)$ as functions of $K$, and show that $S(K)$ grows without bound while $T(K)$ converges. (b) The aggregator's own latency grows with $K$ because it must read $K$ results; model it as $T_s = 1 + cK$ for a small constant $c$ and find the $K$ that minimizes $T(K)$. (c) Interpret the result: explain why, once the aggregator's read cost is counted, an unboundedly wide fan-out eventually increases latency, and connect this to the straggler and barrier discussion of Section 2.