"Two of us reached the doorway at the same instant, agreed loudly that the other should go first, and stood there agreeing until the building closed."
An Agent Stuck Waiting on a Lock
Coordination is the art of organizing the actions of many agents so their individual choices add up to a coherent joint action rather than a pile-up. Agents in a shared world have interdependent actions: they compete for the same corridor, the same database row, the same GPU, the same subtask. Left to choose independently, they collide, duplicate work, and miss the joint moves that only succeed if several agents act together. Coordination supplies the structure that prevents the collisions and unlocks the synergies, whether that structure is a fixed role, an agreed-upon convention, a merged plan, or a graph that says exactly which agents must agree with which. The catch, which this section makes quantitative, is that every bit of coordination costs communication and computation, and past a point that cost overruns the benefit. The skill is to coordinate exactly as much as the interdependence demands and not one message more.
In Section 27.7 we treated coordination as one capability among several that a distributed-AI system needs. Here we make it the whole subject, at the level of a multi-agent system: a set of autonomous agents, each with its own goals and its own view of the world, that must nonetheless act in one shared environment. The previous section gave the agents a way to talk (Section 29.4). Talking is necessary but not sufficient. A team that exchanges messages perfectly can still stall in a doorway, re-fetch the same web page eight times, or send every robot to the same charging station. Coordination is what turns the ability to communicate into the ability to act well together.
The coordination problem has a sharp definition. Each agent $i$ chooses an action $a_i$ from its own set $A_i$, and the agents collectively realize a joint action $a = (a_1, \dots, a_n)$ drawn from the product space $A_1 \times \cdots \times A_n$. Their actions are interdependent: the value of agent $i$'s choice depends on what the others choose. When the dependence is negative, independent choices conflict, two robots routed through one-wide corridor, two workers claiming the same task. When the dependence is positive, independent choices miss a synergy, a heavy object that lifts only if two agents push at once. Coordination is any mechanism that steers the agents toward a good joint action without requiring a central mind to enumerate the exponentially large product space.
1. Why Independent Choice Fails Beginner
Consider the cleanest failure: a narrow corridor that admits one agent at a time, with agents wanting to cross from both ends. Each agent, choosing for itself, sees that stepping forward is locally optimal, so all of them step forward, and the corridor jams. No agent did anything irrational; the joint outcome is bad precisely because nothing reconciled the individual choices. This is the multi-agent system version of a resource contention problem, and it generalizes directly to the cases that matter in a distributed-AI deployment: two training jobs grabbing the same idle GPU, two crawler agents fetching the same shard of the web, two language-model agents in a team both deciding they are the one who should answer the user.
The damage from missing coordination comes in three recognizable forms. Collisions are direct conflicts over an exclusive resource, where one agent's action invalidates another's. Duplicated work is the softer waste where several agents independently do the same thing, each unaware the others are doing it, so the team pays $n$ times for one result. Missed synergies are the opportunities lost because a jointly valuable action required simultaneous commitment that nobody arranged. A coordination mechanism is judged by how well it suppresses the first two and captures the third, against the communication it spends doing so.
It is tempting to equate coordination with messaging, but they are different layers. Communication moves bits; coordination decides actions. The proof is that some of the strongest coordination mechanisms send no messages at all: a shared convention (drive on the right) or a fixed role assignment lets agents pick compatible actions purely from common knowledge, with zero runtime communication. When you design a multi-agent system, ask first how much the agents' actions actually depend on each other. That interdependence, not the size of the message bus, sets how much coordination you truly need.
2. Four Mechanisms, From Cheapest to Costliest Beginner
The coordination mechanisms of Chapter 27 reappear here organized by what they cost at runtime. The cheapest spend no messages because they front-load the agreement; the costliest negotiate continuously. We treat four, in increasing order of runtime communication.
Organizational roles and hierarchies assign each agent a standing position that constrains its choices in advance. A "scout" never claims a task a "worker" should take; a coordinator agent owns the right to break ties. Roles cut the joint action space by construction: once agent $i$ holds a role, large parts of $A_i$ are simply off the table, so fewer conflicts are even reachable. The cost is paid at design time and on reconfiguration, not on every action.
Norms and conventions are shared rules of behavior that every agent can apply unilaterally. "When two agents meet head-on, the one with the lower ID yields" resolves the corridor with no messages exchanged at the moment of conflict, because both agents already know the rule and both can evaluate it from information they already hold. Conventions are the cheapest possible coordination, communication-free at runtime, which is exactly why they are everywhere in practice, from network back-off schedules to the tie-breaking rules inside a consensus protocol.
Coordination via planning has the agents construct individual plans and then merge them, detecting where the plans conflict over shared resources or timing and resolving those conflicts before execution. This is distributed planning, and the resolution can reorder, delay, or substitute actions so that the merged plan is conflict-free. It spends real computation and several rounds of communication, but it handles interdependence that no fixed rule anticipated.
Coordination graphs sit between the rigid and the fully general. They exploit the fact that in most systems an agent's payoff depends on only a few neighbors, not on everyone. We make this precise next, because it is the structure that links this section back to the distributed constraint optimization of Section 27.6 and forward to multi-agent reinforcement learning in Chapter 30.
3. Coordination Graphs: Factoring the Joint Action Intermediate
The joint action space $A_1 \times \cdots \times A_n$ has size exponential in $n$, so optimizing over it directly is hopeless for any real team. The rescue is a structural observation: the team's total value rarely depends on all agents jointly. It usually decomposes into a sum of local terms, each involving only a small group of agents whose actions actually interact. Write the global payoff as
$$Q(a_1, \dots, a_n) = \sum_{(i,j) \in E} Q_{ij}(a_i, a_j),$$where $E$ is the set of edges in a coordination graph: a vertex per agent, an edge between any two agents whose actions interact. An edge between two robots that share a corridor carries a term that is strongly negative when both enter at once; agents with no edge between them can choose independently without any loss. This is the identical factorization that turns a distributed constraint optimization problem into a graph in Section 27.6, and it is the same structure that lets value functions factor across agents in cooperative MARL (Chapter 30).
The payoff comes from how the factorization changes the cost of finding a good joint action. Instead of searching $|A|^n$ combinations, we optimize the sum over the graph, and the work scales with the graph's structure (its treewidth) rather than with $n$. For the sparse graphs that real teams produce, where each agent touches a handful of neighbors, this is the difference between intractable and instant. Figure 29.5.1 shows the idea: four agents, two of which contend for a corridor, and a coordination mechanism that resolves the single conflicting edge while leaving the independent agents alone.
4. A Corridor, Coordinated and Not Intermediate
The demonstration below pits an uncoordinated team against a coordinated one on the corridor problem. Several agents repeatedly try to cross a one-wide corridor. In the uncoordinated run, each agent enters whenever it wants; when two enter at once, they collide, both fail, and the team earns nothing for that step. In the coordinated run, a coordination graph identifies which pairs of agents contend, and a convention (the lower-indexed agent of any contending pair goes first) resolves each conflict edge with no negotiation. The code reports collisions, successful crossings, and total reward for both.
import random
random.seed(7)
N_AGENTS, N_STEPS = 6, 200
# Each step, a random subset of agents wants the single-width corridor.
# A crossing earns +1; a collision (two or more entering at once) earns 0 for all involved.
def demand(step):
# Deterministic-but-varied subset of agents wanting to cross this step.
rng = random.Random(step * 2654435761 % (2**32))
return [i for i in range(N_AGENTS) if rng.random() < 0.45]
def run_uncoordinated():
collisions = successes = reward = 0
for s in range(N_STEPS):
want = demand(s)
if len(want) == 0:
continue
if len(want) == 1: # lone agent crosses cleanly
successes += 1; reward += 1
else: # everyone steps forward, all fail
collisions += 1
return collisions, successes, reward
def run_coordinated():
# Coordination graph: any two agents that both want the corridor share a
# conflict edge this step. Convention: the lowest-index contender goes now,
# the rest defer to the next step (no messages, just the shared rule).
pending = [] # agents who deferred from earlier steps
collisions = successes = reward = 0
for s in range(N_STEPS):
want = sorted(set(demand(s)) | set(pending))
pending = []
if not want:
continue
winner = want[0] # lower ID yields: lowest index crosses
successes += 1; reward += 1
pending = want[1:] # the rest wait one step, no collision
return collisions, successes, reward
for name, fn in [("uncoordinated", run_uncoordinated), ("coordinated", run_coordinated)]:
c, su, r = fn()
print(f"{name:>14} : collisions={c:4d} crossings={su:4d} total_reward={r:4d}")
uncoordinated : collisions= 162 crossings= 32 total_reward= 32
coordinated : collisions= 0 crossings= 200 total_reward= 200
The result is stark because the corridor is a pure-conflict resource: every collision is a total loss, and the convention turns each one into a clean crossing plus a one-step deferral. The coordinated team's reward is higher not because its agents are smarter but because their choices were reconciled before they clashed. Notice what the coordination cost was here: zero runtime communication, because the lower-ID rule is common knowledge every agent can evaluate alone. That is the best case. The next section is about the cases where coordination is not free.
The opposite of the collision is just as real and twice as embarrassing. Replace "everyone steps forward" with "everyone yields," and the corridor empties out while both agents wait for the other to go, the doorway shuffle from the epigraph. A convention has to break the symmetry in one direction; a rule that says "yield to the other" without a tie-breaker produces livelock, where the agents keep politely deferring forever. The lower-ID rule works precisely because it is asymmetric: somebody is always designated to move.
5. The Coordination Overhead and Its Ceiling Advanced
More coordination improves the joint outcome, up to a point, and then it stops paying. Each message exchanged, each round of plan merging, each conflict resolved, consumes time and bandwidth that the agents could have spent acting. When the team is small or the interdependence is dense, that cost is worth it. When the team is large and the dependence is sparse, a protocol that makes every agent talk to every other turns coordination from a help into the bottleneck. The communication of an all-to-all coordination round grows like $n^2$ in the number of agents, while the useful work per agent stays flat, so beyond some team size the round dominates.
This is the same ceiling that governs every distributed system in this book, and it is worth stating in the language of Section 3.5. Coordination is the serial, non-parallelizable fraction of the team's work: the part that cannot be sped up by adding agents because it is precisely the part where the agents must wait for each other. If a fraction $s$ of the total effort is irreducibly coordination, then by Amdahl's law the team's speedup from $n$ agents is bounded by
$$\text{speedup}(n) = \frac{1}{s + \frac{1 - s}{n}} \;\xrightarrow[n \to \infty]{}\; \frac{1}{s},$$a hard ceiling of $1/s$ no matter how many agents you add. A team that spends ten percent of its effort coordinating can never go more than ten times faster than a single agent, however large it grows. This is why the coordination-graph factorization matters so much: by replacing dense all-to-all agreement with a sparse set of local edges, it shrinks $s$, which raises the ceiling. The design goal is not "more coordination" but "exactly the coordination the interdependence requires," which is the smallest $s$ that still prevents the collisions.
The book's central tension, that distribution buys throughput but charges communication, arrives here in its multi-agent form. In data-parallel training the tax was the all-reduce (Chapter 30 will multiply it across learning agents); among agents the tax is the coordination round. The remedy is structurally identical: exploit sparsity. Just as a sparse communication topology keeps gradient sync cheap, a sparse coordination graph keeps joint-action selection cheap. Whenever a multi-agent design proposes that "all agents agree on everything," read it as an $n^2$ cost and a low Amdahl ceiling, and ask which edges of the coordination graph are actually present.
Who: A platform engineer running a customer-support team of five large-language-model agents behind one chat endpoint.
Situation: Each user message was broadcast to all five agents so the "best" answer could be chosen, and the system felt thorough in testing.
Problem: In production every message triggered five full model calls; four were discarded, so the team paid five times the token cost and frequently posted two near-identical replies when a race let two agents respond before the selector fired.
Dilemma: Add a central coordinator that locks each message to one agent (simple, but a single point of contention and latency), or impose a convention that needs no coordinator at all.
Decision: They chose a convention, the cheapest mechanism from Section 2: hash each conversation ID to one agent, so every agent can compute locally and with no messages which conversations are its own.
How: A one-line assignment, owner = hash(conversation_id) % n_agents, replaced the broadcast. An agent answers only if it is the owner, exactly the corridor convention from Code 29.5.1 applied to conversations instead of a corridor.
Result: Token cost dropped to one fifth, duplicate replies vanished, and median latency improved because four redundant model calls per message disappeared. The coordination overhead fell to zero runtime messages.
Lesson: Before reaching for a coordinator or a negotiation protocol, check whether a communication-free convention resolves the interdependence. For an embarrassingly partitionable workload, it usually does, and it has no Amdahl ceiling to speak of.
In Code 29.5.1 we wrote the coordination graph and its convention by hand. In a cooperative multi-agent reinforcement-learning stack, the factorization $Q = \sum_{(i,j) \in E} Q_{ij}$ is a first-class object. Value-factorization frameworks in the PyMARL family (QMIX, and explicit coordination-graph methods such as DCG, Deep Coordination Graphs) represent the edges and run the message-passing maximization over the graph for you, so you specify which agents interact and the library handles the per-edge payoff tables and the max-plus optimization. For orchestrating teams of LLM or tool-using agents, Ray and similar actor frameworks give you the conversation-ownership routing of the practical example as a built-in placement or hashing primitive, so the convention is a configuration choice, not code. The from-scratch loop above is roughly a dozen lines; the library version is the edge list plus a one-line solver call, and the framework supplies the max-plus inference and the distributed actor scheduling.
6. Distributed Planning and Plan Merging Advanced
When no fixed convention anticipates the conflict, agents fall back on coordination via planning. Each agent builds an individual plan, a sequence of actions to reach its own goal, and the team merges these plans into one that no longer conflicts. Merging proceeds by detecting interactions: two plans conflict if they claim the same resource at the same time, or if one's precondition is undone by the other's effect. The merger then resolves each detected conflict by inserting an ordering constraint (agent $i$ acts before agent $j$), a delay, or a substitution, until the combined plan is consistent. The corridor convention of Code 29.5.1 is the degenerate, communication-free case of exactly this: the lower-ID rule is a pre-agreed ordering constraint that needs no runtime detection because the conflict is known in advance.
Plan merging is more powerful and more expensive than a convention. It can coordinate interdependences that no designer foresaw, but it spends computation to detect conflicts and rounds of communication to agree on resolutions, and in the worst case the detection is itself combinatorial. The practical systems keep it tractable the same way the coordination graph does: they only check for conflicts among agents whose plans touch shared resources, which is again a sparse-graph structure rather than an all-pairs comparison. Keeping an LLM agent team coherent uses the same discipline, detecting when two agents' planned tool calls would write the same file or claim the same budget, and serializing only those, a pattern that the agent-orchestration treatment of Chapter 32 builds on directly.
The 2024 to 2026 surge in multi-agent LLM systems has revived classical coordination theory under new names. Frameworks such as AutoGen, CrewAI, LangGraph, and OpenAI's Swarm (all 2024 to 2025) make role assignment and handoff conventions the primary coordination mechanism: a "manager" or "router" agent assigns subtasks and a handoff rule decides which agent speaks next, exactly the organizational-roles and conventions of Section 2. A parallel research line measures the failure modes empirically; the "Why Do Multi-Agent LLM Systems Fail?" analysis (Cemri et al., 2025) traces a large share of failures to coordination breakdowns, duplicated work, conflicting actions, and agents ignoring each other's outputs, rather than to any single agent's reasoning error. A third thread asks when the coordination overhead is worth it at all, finding that for many tasks a single well-prompted agent matches a multi-agent team at a fraction of the token cost, the Amdahl ceiling of Section 5 showing up as a token budget. The open problem is principled coordination-graph structure for LLM teams: learning which agents actually need to agree, so the expensive all-to-all "everyone reads everything" default can be replaced by a sparse, justified set of edges.
7. Putting the Mechanisms in Order Intermediate
Table 29.5.1 lines up the four mechanisms by their runtime communication cost, the kind of interdependence each handles, and where else in the book the same structure appears. The ordering is the practical decision procedure: reach for the cheapest mechanism that covers your interdependence, and escalate only when it does not.
| Mechanism | Runtime communication | Handles | Same structure appears in |
|---|---|---|---|
| Roles & hierarchies | None (set at design / reconfig) | Standing division of labor | Coalition structure (Section 29.7) |
| Norms & conventions | None (common knowledge) | Foreseeable, symmetric conflicts | Consensus tie-breaking (Section 29.9) |
| Coordination graphs | Local, per-edge messages | Sparse pairwise interactions | DCOP (Section 27.6), MARL (Ch 30) |
| Plan merging | Several rounds, conflict-driven | Unforeseen interdependence | Negotiation (Section 29.6) |
The escalation ladder also explains the shape of the rest of this chapter. When a convention or a coordination graph cannot fix the action assignment, because the agents have conflicting goals rather than merely conflicting actions, coordination gives way to negotiation, the subject of Section 29.6: agents bargain over who does what and at what price. Coordination assumes the agents want the team to succeed and only need their actions reconciled; negotiation drops that assumption and lets agents pursue their own interests, which is why it needs the game-theoretic machinery the next section brings to bear.
For each scenario, state whether the dominant cost of missing coordination is a collision, duplicated work, or a missed synergy, and name the cheapest mechanism from Table 29.5.1 that resolves it: (a) eight web-crawler agents that each independently fetch the most popular URL first; (b) two warehouse robots that must jointly lift a shelf no single robot can move; (c) a fleet of delivery drones that all route through the same airspace corridor at rush hour. For each, explain why the next-cheaper mechanism in the table would fail to handle it.
Extend Code 29.5.1 so that resolving each contested step costs the coordinated team one communication round, and a round takes a fixed time that is charged against the team's reward (reward per step becomes $1 - c$ when a round fires, for a cost $c$ you choose). Sweep the per-agent demand probability from 0.1 to 0.9 and plot coordinated versus uncoordinated total reward. Identify the demand level at which the coordination cost $c$ makes the coordinated team no better than the uncoordinated one, and relate that crossover to the Amdahl ceiling $1/s$ of Section 5 by estimating the serial fraction $s$ your costed protocol induces.
Consider $n$ agents arranged so that the coordination graph is (a) a single line (each agent shares an edge only with its two neighbors), (b) a star (one hub agent shares an edge with all others), and (c) a complete graph (every pair shares an edge). For each, give the number of edges as a function of $n$, and argue how the cost of selecting the optimal joint action over the graph scales. Explain why the complete graph reproduces the $n^2$ all-to-all cost of Section 5 while the line stays linear, and connect this to why a sparse coordination graph raises the Amdahl ceiling. You may reference the distributed-constraint structure of Section 27.6.