Part VI: Distributed AI and Multi-Agent Systems
Chapter 29: Multi-Agent Systems

Coordination

"Two of us reached the doorway at the same instant, agreed loudly that the other should go first, and stood there agreeing until the building closed."

An Agent Stuck Waiting on a Lock
Big Picture

Coordination is the art of organizing the actions of many agents so their individual choices add up to a coherent joint action rather than a pile-up. Agents in a shared world have interdependent actions: they compete for the same corridor, the same database row, the same GPU, the same subtask. Left to choose independently, they collide, duplicate work, and miss the joint moves that only succeed if several agents act together. Coordination supplies the structure that prevents the collisions and unlocks the synergies, whether that structure is a fixed role, an agreed-upon convention, a merged plan, or a graph that says exactly which agents must agree with which. The catch, which this section makes quantitative, is that every bit of coordination costs communication and computation, and past a point that cost overruns the benefit. The skill is to coordinate exactly as much as the interdependence demands and not one message more.

In Section 27.7 we treated coordination as one capability among several that a distributed-AI system needs. Here we make it the whole subject, at the level of a multi-agent system: a set of autonomous agents, each with its own goals and its own view of the world, that must nonetheless act in one shared environment. The previous section gave the agents a way to talk (Section 29.4). Talking is necessary but not sufficient. A team that exchanges messages perfectly can still stall in a doorway, re-fetch the same web page eight times, or send every robot to the same charging station. Coordination is what turns the ability to communicate into the ability to act well together.

The coordination problem has a sharp definition. Each agent $i$ chooses an action $a_i$ from its own set $A_i$, and the agents collectively realize a joint action $a = (a_1, \dots, a_n)$ drawn from the product space $A_1 \times \cdots \times A_n$. Their actions are interdependent: the value of agent $i$'s choice depends on what the others choose. When the dependence is negative, independent choices conflict, two robots routed through one-wide corridor, two workers claiming the same task. When the dependence is positive, independent choices miss a synergy, a heavy object that lifts only if two agents push at once. Coordination is any mechanism that steers the agents toward a good joint action without requiring a central mind to enumerate the exponentially large product space.

1. Why Independent Choice Fails Beginner

Consider the cleanest failure: a narrow corridor that admits one agent at a time, with agents wanting to cross from both ends. Each agent, choosing for itself, sees that stepping forward is locally optimal, so all of them step forward, and the corridor jams. No agent did anything irrational; the joint outcome is bad precisely because nothing reconciled the individual choices. This is the multi-agent system version of a resource contention problem, and it generalizes directly to the cases that matter in a distributed-AI deployment: two training jobs grabbing the same idle GPU, two crawler agents fetching the same shard of the web, two language-model agents in a team both deciding they are the one who should answer the user.

The damage from missing coordination comes in three recognizable forms. Collisions are direct conflicts over an exclusive resource, where one agent's action invalidates another's. Duplicated work is the softer waste where several agents independently do the same thing, each unaware the others are doing it, so the team pays $n$ times for one result. Missed synergies are the opportunities lost because a jointly valuable action required simultaneous commitment that nobody arranged. A coordination mechanism is judged by how well it suppresses the first two and captures the third, against the communication it spends doing so.

Key Insight: Coordination Manages Interdependence, Not Communication

It is tempting to equate coordination with messaging, but they are different layers. Communication moves bits; coordination decides actions. The proof is that some of the strongest coordination mechanisms send no messages at all: a shared convention (drive on the right) or a fixed role assignment lets agents pick compatible actions purely from common knowledge, with zero runtime communication. When you design a multi-agent system, ask first how much the agents' actions actually depend on each other. That interdependence, not the size of the message bus, sets how much coordination you truly need.

2. Four Mechanisms, From Cheapest to Costliest Beginner

The coordination mechanisms of Chapter 27 reappear here organized by what they cost at runtime. The cheapest spend no messages because they front-load the agreement; the costliest negotiate continuously. We treat four, in increasing order of runtime communication.

Organizational roles and hierarchies assign each agent a standing position that constrains its choices in advance. A "scout" never claims a task a "worker" should take; a coordinator agent owns the right to break ties. Roles cut the joint action space by construction: once agent $i$ holds a role, large parts of $A_i$ are simply off the table, so fewer conflicts are even reachable. The cost is paid at design time and on reconfiguration, not on every action.

Norms and conventions are shared rules of behavior that every agent can apply unilaterally. "When two agents meet head-on, the one with the lower ID yields" resolves the corridor with no messages exchanged at the moment of conflict, because both agents already know the rule and both can evaluate it from information they already hold. Conventions are the cheapest possible coordination, communication-free at runtime, which is exactly why they are everywhere in practice, from network back-off schedules to the tie-breaking rules inside a consensus protocol.

Coordination via planning has the agents construct individual plans and then merge them, detecting where the plans conflict over shared resources or timing and resolving those conflicts before execution. This is distributed planning, and the resolution can reorder, delay, or substitute actions so that the merged plan is conflict-free. It spends real computation and several rounds of communication, but it handles interdependence that no fixed rule anticipated.

Coordination graphs sit between the rigid and the fully general. They exploit the fact that in most systems an agent's payoff depends on only a few neighbors, not on everyone. We make this precise next, because it is the structure that links this section back to the distributed constraint optimization of Section 27.6 and forward to multi-agent reinforcement learning in Chapter 30.

3. Coordination Graphs: Factoring the Joint Action Intermediate

The joint action space $A_1 \times \cdots \times A_n$ has size exponential in $n$, so optimizing over it directly is hopeless for any real team. The rescue is a structural observation: the team's total value rarely depends on all agents jointly. It usually decomposes into a sum of local terms, each involving only a small group of agents whose actions actually interact. Write the global payoff as

$$Q(a_1, \dots, a_n) = \sum_{(i,j) \in E} Q_{ij}(a_i, a_j),$$

where $E$ is the set of edges in a coordination graph: a vertex per agent, an edge between any two agents whose actions interact. An edge between two robots that share a corridor carries a term that is strongly negative when both enter at once; agents with no edge between them can choose independently without any loss. This is the identical factorization that turns a distributed constraint optimization problem into a graph in Section 27.6, and it is the same structure that lets value functions factor across agents in cooperative MARL (Chapter 30).

The payoff comes from how the factorization changes the cost of finding a good joint action. Instead of searching $|A|^n$ combinations, we optimize the sum over the graph, and the work scales with the graph's structure (its treewidth) rather than with $n$. For the sparse graphs that real teams produce, where each agent touches a handful of neighbors, this is the difference between intractable and instant. Figure 29.5.1 shows the idea: four agents, two of which contend for a corridor, and a coordination mechanism that resolves the single conflicting edge while leaving the independent agents alone.

Coordination graph: who must agree with whom shared corridor (conflict edge Q₁₂) Agent 1 Agent 2 no edge: independent choices Agent 3 Agent 4 resolve After a coordination mechanism Order assigned on the conflict edge Agent 1 enters first, Agent 2 waits one step (lower ID yields, a convention) Agents 3 and 4 act in parallel untouched: their actions never interacted conflict-free joint action, no global search
Figure 29.5.1: A coordination graph factors the joint action of four agents. Agents 1 and 2 share a corridor, so an edge carries the conflict term $Q_{12}$; agents 3 and 4 have no edge and choose independently. A coordination mechanism (here a lower-ID-yields convention) resolves only the single conflict edge, ordering agents 1 and 2 while leaving 3 and 4 to act in parallel. The cost scales with the graph, not with the $|A|^4$ joint space.

4. A Corridor, Coordinated and Not Intermediate

The demonstration below pits an uncoordinated team against a coordinated one on the corridor problem. Several agents repeatedly try to cross a one-wide corridor. In the uncoordinated run, each agent enters whenever it wants; when two enter at once, they collide, both fail, and the team earns nothing for that step. In the coordinated run, a coordination graph identifies which pairs of agents contend, and a convention (the lower-indexed agent of any contending pair goes first) resolves each conflict edge with no negotiation. The code reports collisions, successful crossings, and total reward for both.

import random

random.seed(7)
N_AGENTS, N_STEPS = 6, 200
# Each step, a random subset of agents wants the single-width corridor.
# A crossing earns +1; a collision (two or more entering at once) earns 0 for all involved.

def demand(step):
    # Deterministic-but-varied subset of agents wanting to cross this step.
    rng = random.Random(step * 2654435761 % (2**32))
    return [i for i in range(N_AGENTS) if rng.random() < 0.45]

def run_uncoordinated():
    collisions = successes = reward = 0
    for s in range(N_STEPS):
        want = demand(s)
        if len(want) == 0:
            continue
        if len(want) == 1:                       # lone agent crosses cleanly
            successes += 1; reward += 1
        else:                                    # everyone steps forward, all fail
            collisions += 1
    return collisions, successes, reward

def run_coordinated():
    # Coordination graph: any two agents that both want the corridor share a
    # conflict edge this step. Convention: the lowest-index contender goes now,
    # the rest defer to the next step (no messages, just the shared rule).
    pending = []                                 # agents who deferred from earlier steps
    collisions = successes = reward = 0
    for s in range(N_STEPS):
        want = sorted(set(demand(s)) | set(pending))
        pending = []
        if not want:
            continue
        winner = want[0]                         # lower ID yields: lowest index crosses
        successes += 1; reward += 1
        pending = want[1:]                       # the rest wait one step, no collision
    return collisions, successes, reward

for name, fn in [("uncoordinated", run_uncoordinated), ("coordinated", run_coordinated)]:
    c, su, r = fn()
    print(f"{name:>14} : collisions={c:4d}  crossings={su:4d}  total_reward={r:4d}")
Code 29.5.1: Two teams on the one-wide corridor. The uncoordinated team lets everyone step forward and loses every contested step to a collision; the coordinated team applies a coordination-graph convention (lowest-index contender crosses, the rest defer one step) and converts every contested step into a clean crossing, with zero runtime messages.
 uncoordinated : collisions= 162  crossings=  32  total_reward=  32
   coordinated : collisions=   0  crossings= 200  total_reward= 200
Output 29.5.1: The uncoordinated team collides on 162 of the contested steps and crosses only when a single agent happens to want the corridor alone, earning 32. The coordinated team never collides and serves the deferred backlog, earning 200, a 6.25x improvement bought entirely with a shared convention and no messages.

The result is stark because the corridor is a pure-conflict resource: every collision is a total loss, and the convention turns each one into a clean crossing plus a one-step deferral. The coordinated team's reward is higher not because its agents are smarter but because their choices were reconciled before they clashed. Notice what the coordination cost was here: zero runtime communication, because the lower-ID rule is common knowledge every agent can evaluate alone. That is the best case. The next section is about the cases where coordination is not free.

Fun Note: The Politeness Deadlock

The opposite of the collision is just as real and twice as embarrassing. Replace "everyone steps forward" with "everyone yields," and the corridor empties out while both agents wait for the other to go, the doorway shuffle from the epigraph. A convention has to break the symmetry in one direction; a rule that says "yield to the other" without a tie-breaker produces livelock, where the agents keep politely deferring forever. The lower-ID rule works precisely because it is asymmetric: somebody is always designated to move.

5. The Coordination Overhead and Its Ceiling Advanced

More coordination improves the joint outcome, up to a point, and then it stops paying. Each message exchanged, each round of plan merging, each conflict resolved, consumes time and bandwidth that the agents could have spent acting. When the team is small or the interdependence is dense, that cost is worth it. When the team is large and the dependence is sparse, a protocol that makes every agent talk to every other turns coordination from a help into the bottleneck. The communication of an all-to-all coordination round grows like $n^2$ in the number of agents, while the useful work per agent stays flat, so beyond some team size the round dominates.

This is the same ceiling that governs every distributed system in this book, and it is worth stating in the language of Section 3.5. Coordination is the serial, non-parallelizable fraction of the team's work: the part that cannot be sped up by adding agents because it is precisely the part where the agents must wait for each other. If a fraction $s$ of the total effort is irreducibly coordination, then by Amdahl's law the team's speedup from $n$ agents is bounded by

$$\text{speedup}(n) = \frac{1}{s + \frac{1 - s}{n}} \;\xrightarrow[n \to \infty]{}\; \frac{1}{s},$$

a hard ceiling of $1/s$ no matter how many agents you add. A team that spends ten percent of its effort coordinating can never go more than ten times faster than a single agent, however large it grows. This is why the coordination-graph factorization matters so much: by replacing dense all-to-all agreement with a sparse set of local edges, it shrinks $s$, which raises the ceiling. The design goal is not "more coordination" but "exactly the coordination the interdependence requires," which is the smallest $s$ that still prevents the collisions.

Thesis Thread: Coordination Is the Serial Fraction, Scaled Out

The book's central tension, that distribution buys throughput but charges communication, arrives here in its multi-agent form. In data-parallel training the tax was the all-reduce (Chapter 30 will multiply it across learning agents); among agents the tax is the coordination round. The remedy is structurally identical: exploit sparsity. Just as a sparse communication topology keeps gradient sync cheap, a sparse coordination graph keeps joint-action selection cheap. Whenever a multi-agent design proposes that "all agents agree on everything," read it as an $n^2$ cost and a low Amdahl ceiling, and ask which edges of the coordination graph are actually present.

Practical Example: The Agent Team That Answered Every Question Five Times

Who: A platform engineer running a customer-support team of five large-language-model agents behind one chat endpoint.

Situation: Each user message was broadcast to all five agents so the "best" answer could be chosen, and the system felt thorough in testing.

Problem: In production every message triggered five full model calls; four were discarded, so the team paid five times the token cost and frequently posted two near-identical replies when a race let two agents respond before the selector fired.

Dilemma: Add a central coordinator that locks each message to one agent (simple, but a single point of contention and latency), or impose a convention that needs no coordinator at all.

Decision: They chose a convention, the cheapest mechanism from Section 2: hash each conversation ID to one agent, so every agent can compute locally and with no messages which conversations are its own.

How: A one-line assignment, owner = hash(conversation_id) % n_agents, replaced the broadcast. An agent answers only if it is the owner, exactly the corridor convention from Code 29.5.1 applied to conversations instead of a corridor.

Result: Token cost dropped to one fifth, duplicate replies vanished, and median latency improved because four redundant model calls per message disappeared. The coordination overhead fell to zero runtime messages.

Lesson: Before reaching for a coordinator or a negotiation protocol, check whether a communication-free convention resolves the interdependence. For an embarrassingly partitionable workload, it usually does, and it has no Amdahl ceiling to speak of.

Library Shortcut: Coordination Graphs in PyMARL and Ray

In Code 29.5.1 we wrote the coordination graph and its convention by hand. In a cooperative multi-agent reinforcement-learning stack, the factorization $Q = \sum_{(i,j) \in E} Q_{ij}$ is a first-class object. Value-factorization frameworks in the PyMARL family (QMIX, and explicit coordination-graph methods such as DCG, Deep Coordination Graphs) represent the edges and run the message-passing maximization over the graph for you, so you specify which agents interact and the library handles the per-edge payoff tables and the max-plus optimization. For orchestrating teams of LLM or tool-using agents, Ray and similar actor frameworks give you the conversation-ownership routing of the practical example as a built-in placement or hashing primitive, so the convention is a configuration choice, not code. The from-scratch loop above is roughly a dozen lines; the library version is the edge list plus a one-line solver call, and the framework supplies the max-plus inference and the distributed actor scheduling.

6. Distributed Planning and Plan Merging Advanced

When no fixed convention anticipates the conflict, agents fall back on coordination via planning. Each agent builds an individual plan, a sequence of actions to reach its own goal, and the team merges these plans into one that no longer conflicts. Merging proceeds by detecting interactions: two plans conflict if they claim the same resource at the same time, or if one's precondition is undone by the other's effect. The merger then resolves each detected conflict by inserting an ordering constraint (agent $i$ acts before agent $j$), a delay, or a substitution, until the combined plan is consistent. The corridor convention of Code 29.5.1 is the degenerate, communication-free case of exactly this: the lower-ID rule is a pre-agreed ordering constraint that needs no runtime detection because the conflict is known in advance.

Plan merging is more powerful and more expensive than a convention. It can coordinate interdependences that no designer foresaw, but it spends computation to detect conflicts and rounds of communication to agree on resolutions, and in the worst case the detection is itself combinatorial. The practical systems keep it tractable the same way the coordination graph does: they only check for conflicts among agents whose plans touch shared resources, which is again a sparse-graph structure rather than an all-pairs comparison. Keeping an LLM agent team coherent uses the same discipline, detecting when two agents' planned tool calls would write the same file or claim the same budget, and serializing only those, a pattern that the agent-orchestration treatment of Chapter 32 builds on directly.

Research Frontier: Coordinating Teams of Language-Model Agents (2024 to 2026)

The 2024 to 2026 surge in multi-agent LLM systems has revived classical coordination theory under new names. Frameworks such as AutoGen, CrewAI, LangGraph, and OpenAI's Swarm (all 2024 to 2025) make role assignment and handoff conventions the primary coordination mechanism: a "manager" or "router" agent assigns subtasks and a handoff rule decides which agent speaks next, exactly the organizational-roles and conventions of Section 2. A parallel research line measures the failure modes empirically; the "Why Do Multi-Agent LLM Systems Fail?" analysis (Cemri et al., 2025) traces a large share of failures to coordination breakdowns, duplicated work, conflicting actions, and agents ignoring each other's outputs, rather than to any single agent's reasoning error. A third thread asks when the coordination overhead is worth it at all, finding that for many tasks a single well-prompted agent matches a multi-agent team at a fraction of the token cost, the Amdahl ceiling of Section 5 showing up as a token budget. The open problem is principled coordination-graph structure for LLM teams: learning which agents actually need to agree, so the expensive all-to-all "everyone reads everything" default can be replaced by a sparse, justified set of edges.

7. Putting the Mechanisms in Order Intermediate

Table 29.5.1 lines up the four mechanisms by their runtime communication cost, the kind of interdependence each handles, and where else in the book the same structure appears. The ordering is the practical decision procedure: reach for the cheapest mechanism that covers your interdependence, and escalate only when it does not.

Table 29.5.1: Coordination mechanisms ordered by runtime communication cost. Cheaper mechanisms front-load agreement at design time; costlier ones negotiate during execution. Pick the cheapest that covers the interdependence.
MechanismRuntime communicationHandlesSame structure appears in
Roles & hierarchiesNone (set at design / reconfig)Standing division of laborCoalition structure (Section 29.7)
Norms & conventionsNone (common knowledge)Foreseeable, symmetric conflictsConsensus tie-breaking (Section 29.9)
Coordination graphsLocal, per-edge messagesSparse pairwise interactionsDCOP (Section 27.6), MARL (Ch 30)
Plan mergingSeveral rounds, conflict-drivenUnforeseen interdependenceNegotiation (Section 29.6)

The escalation ladder also explains the shape of the rest of this chapter. When a convention or a coordination graph cannot fix the action assignment, because the agents have conflicting goals rather than merely conflicting actions, coordination gives way to negotiation, the subject of Section 29.6: agents bargain over who does what and at what price. Coordination assumes the agents want the team to succeed and only need their actions reconciled; negotiation drops that assumption and lets agents pursue their own interests, which is why it needs the game-theoretic machinery the next section brings to bear.

Exercise 29.5.1: Name the Failure and the Mechanism Conceptual

For each scenario, state whether the dominant cost of missing coordination is a collision, duplicated work, or a missed synergy, and name the cheapest mechanism from Table 29.5.1 that resolves it: (a) eight web-crawler agents that each independently fetch the most popular URL first; (b) two warehouse robots that must jointly lift a shelf no single robot can move; (c) a fleet of delivery drones that all route through the same airspace corridor at rush hour. For each, explain why the next-cheaper mechanism in the table would fail to handle it.

Exercise 29.5.2: Push the Corridor Past Its Ceiling Coding

Extend Code 29.5.1 so that resolving each contested step costs the coordinated team one communication round, and a round takes a fixed time that is charged against the team's reward (reward per step becomes $1 - c$ when a round fires, for a cost $c$ you choose). Sweep the per-agent demand probability from 0.1 to 0.9 and plot coordinated versus uncoordinated total reward. Identify the demand level at which the coordination cost $c$ makes the coordinated team no better than the uncoordinated one, and relate that crossover to the Amdahl ceiling $1/s$ of Section 5 by estimating the serial fraction $s$ your costed protocol induces.

Exercise 29.5.3: Treewidth and the Cost of Agreement Analysis

Consider $n$ agents arranged so that the coordination graph is (a) a single line (each agent shares an edge only with its two neighbors), (b) a star (one hub agent shares an edge with all others), and (c) a complete graph (every pair shares an edge). For each, give the number of edges as a function of $n$, and argue how the cost of selecting the optimal joint action over the graph scales. Explain why the complete graph reproduces the $n^2$ all-to-all cost of Section 5 while the line stays linear, and connect this to why a sparse coordination graph raises the Amdahl ceiling. You may reference the distributed-constraint structure of Section 27.6.