Part VI: Distributed AI and Multi-Agent Systems
Chapter 27: Distributed Artificial Intelligence

History of Distributed Artificial Intelligence

"They told me intelligence was a single thread of reasoning. Then they gave me a sensor network the size of a country and asked who, exactly, was supposed to be thinking."

An Agent Stuck Waiting on a Lock
Big Picture

The previous five parts distributed everything an AI system does except the one thing that makes it intelligent; this part distributes the reasoning itself across many autonomous agents. We have spread the data across machines, spread the training across workers, spread the model across devices, and spread inference across a serving fleet. In each case a single coherent answer was reassembled at the end. Distributed Artificial Intelligence asks a harder question: what happens when there is no single reasoner to reassemble, when decision-making itself is split across many agents that each perceive part of the world, hold part of the knowledge, and pursue goals that may or may not align? This is not a new question. It is the classical field of DAI, worked out in the 1980s and 1990s, and its two branches, distributed problem solving and multi-agent systems, are exactly the foundations that today's multi-agent LLM systems are rediscovering. This section opens the part by placing intelligence on the map of distribution axes and tracing where the field came from.

By this point in the book the pattern is familiar. Some essential activity of an AI system outgrows one machine, so we partition it, move information between the pieces, and recombine the result. Data, training, the model, and inference have each had their turn. There is one activity left, and it is the one we usually mean when we say a system is intelligent: deciding what to do. For most of the book that decision has lived in one place, a single model emitting a single prediction or action. Part VI is about what happens when the deciding is itself distributed, when many agents each reason locally and the system's behavior emerges from their interaction rather than from any one of them. Distributed Artificial Intelligence, abbreviated DAI, is the discipline that studied this long before deep learning existed, and its concepts are the vocabulary the rest of this part speaks.

The six axes of distribution Distribute data (Part II) Distribute training (Part III) Distribute the model (Part IV) Distribute inference (Part V) Coordinate the cluster (I, VII) Distribute intelligence (Part VI) Distributed AI (1980s onward) Distributed problem solving one problem, many cooperating solvers (27.2) Multi-agent systems autonomous agents, own goals, interacting (Ch 29)
Figure 27.1.1: Part VI turns to the last axis of distribution, intelligence itself, highlighted among the six axes introduced in Section 1.2. That axis is the subject of the classical field of Distributed AI, which splits into two historical branches: distributed problem solving, where one problem is decomposed across cooperating solvers (Section 27.2), and multi-agent systems, where autonomous agents with their own goals interact (Chapter 29).

1. Why Intelligence Became the Last Axis Beginner

The six axes of distribution from Section 1.2 gave the book its map, and five of them have now been walked. The first four spread a mechanical activity (processing bytes, computing gradients, holding parameters, answering requests) and the fifth keeps the cluster that runs them healthy. In every one of those cases a controller stands above the workers and stitches their partial results into a single answer; the worker is a means, not an end. Figure 27.1.1 marks the sixth axis, distributing intelligence, in a different color for a reason: here the pieces are not interchangeable workers serving one objective but agents, each with its own view of the world and, sometimes, its own purpose. There may be no controller at all.

Two distinct pressures push intelligence off a single reasoner, and they mirror the data and model pressures of Section 1.1. The first is that some problems are inherently distributed: a sensor network spans a country, a supply chain spans many companies that will not share their internal data, a fleet of robots is physically scattered and cannot fit in one process. No amount of buying a bigger machine collapses these into one reasoner, because the distribution is in the world, not in our hardware budget. The second pressure is the familiar one of size: some reasoning problems are simply too large for any single solver to hold, so they must be decomposed across cooperating solvers much as a dataset is sharded. DAI arose to study both, and the two pressures gave it its two branches.

Key Insight: Distributing Intelligence Is Different from Distributing Work

The first five axes distribute a mechanical activity under a controller that recombines the pieces into one answer; the worker has no goals of its own. Distributing intelligence removes that guarantee. An agent perceives only part of the world, knows only part of what is true, and may pursue a goal that conflicts with its neighbor's. The system's behavior emerges from interaction rather than from a controller's recombination step. That shift, from coordinating workers to coordinating agents, is what makes Part VI a different kind of distribution problem and why it needs its own classical theory.

2. The Two Branches of Classical DAI Beginner

When the field organized itself in the 1980s it divided along exactly the two pressures of the previous section, and the division still structures this chapter. The first branch, distributed problem solving, starts from a single well-defined problem and asks how to split it across solvers that cooperate by construction; they were built by one designer to chip away at one goal, and the research questions are about decomposition, sub-result sharing, and recombination. Classical results in this branch include the blackboard architecture (Section 27.4), where solvers post partial findings to a shared workspace, and the Contract-Net protocol (Section 27.5), where a manager announces a task and solvers bid to take it on. The whole of Section 27.2 is devoted to this branch, and the runnable demo at the end of this section is its smallest honest instance.

The second branch, multi-agent systems, drops the assumption of a single designer and a shared goal. Here each agent is autonomous: it has its own objectives, makes its own decisions, and may have been built by someone else entirely. The questions become whether agents will cooperate at all, how they negotiate when their interests diverge, and what coordination is possible without a central authority. This branch is the subject of Chapter 29, and because self-interested agents need a theory of strategic interaction, it borrows the equilibrium and mechanism concepts that Chapter 28 develops first. Table 27.1.1 lays the two branches side by side so the rest of the part can refer back to a single contrast.

Table 27.1.1: The two classical branches of Distributed AI, the assumption that separates them, and where this book develops each.
PropertyDistributed problem solvingMulti-agent systems
Starting pointOne problem to decomposeMany pre-existing agents
Agent goalsShared by constructionEach agent's own, possibly conflicting
Who designed themOne designerPossibly different designers
Central questionHow to split and recombineHow to coordinate and negotiate
Coordination byDecomposition and protocolsNegotiation, incentives, equilibria
Developed inSection 27.2, Sections 27.4 to 27.7Chapter 29, with Chapter 28
Fun Note: The Field That Named Itself Twice

For a decade the same researchers used "Distributed AI" and "multi-agent systems" almost interchangeably, then spent a good deal of conference time arguing over which was the superset of which. The settled convention is the one in Table 27.1.1: DAI is the umbrella, and multi-agent systems is one branch under it. The argument was not wasted; it forced the field to be precise about whether agents share a goal, which turned out to be the single most consequential modeling decision you make.

3. The Same Concerns, Now at the Level of Reasoning Intermediate

What makes DAI belong in this book, rather than in a separate course on agents, is that distributing intelligence raises the exact concerns the rest of the book has been managing, only now at the level of reasoning. Agents are distributed components. They must communicate, because no agent perceives the whole world. They must coordinate, because uncoordinated action wastes effort or causes conflict. They must reach some consistency about shared facts, because two agents acting on contradictory beliefs will work against each other. And they must tolerate failure, because in any real deployment some agent will crash, lag, or send a stale message. These are the same four concerns, communication, coordination, consistency, and failure, that Chapter 2 introduced for distributed systems in general.

The mapping is close enough to be useful as a working intuition. The message passing between agents is the communication layer of Chapter 4, now carrying beliefs and bids instead of gradients. The agreement an agent group needs on shared state is a consistency problem with the same flavor as the consensus of Chapter 2, which is why Section 27.8 treats distributed knowledge and belief explicitly. A coordinator that disappears mid-negotiation is the lost-coordinator failure mode that fault tolerance has addressed since Chapter 6. DAI is not a departure from the book's thesis; it is the thesis applied to the most abstract activity a system performs, and that is why the classical foundations transfer so cleanly.

Thesis Thread: The Last Activity to Be Distributed

The spine of this book is that every essential activity of an AI system can be spread across machines that communicate and coordinate to act as one. Part VI completes the arc by distributing the activity we held back the longest, deciding what to do. The four scale-out concerns return unchanged in name and only changed in content: communication now moves beliefs and proposals, coordination now resolves goals instead of scheduling kernels, consistency now concerns shared knowledge, and failure now means an agent rather than a worker drops out. When the agents are modern LLMs, these same four concerns are what Chapter 32 must engineer, which is the strongest evidence that DAI's questions were the right ones all along.

4. A Distributed Solution That Equals the Centralized One Intermediate

The cleanest way to feel the distributed-problem-solving branch is to build its smallest honest instance and check that decomposition followed by recombination reproduces the centralized answer, just as the gradient identity in Section 1.1 did for training. Consider a search problem: among a large catalog of $M$ candidate items, each carrying a scalar cost $c_i$, find the cheapest. A centralized solver scans all $M$ items. A distributed solver splits the catalog into $K$ disjoint parts, hands one part to each agent, lets every agent solve its own sub-problem in isolation, and then a recombiner selects the global best from just the $K$ local winners. The structure is exactly the $\min$ over a partition,

$$\min_{1 \le i \le M} c_i = \min_{1 \le k \le K}\; \Big(\min_{i \in S_k} c_i\Big),$$

where $S_1, \dots, S_K$ are the disjoint shards. The outer $\min$ is the recombination step, and it touches only $K$ numbers, not $M$. The code below builds this with $M = 10^6$ items and $K = 6$ agents, runs both solvers, and checks that they return the identical item.

import numpy as np

rng = np.random.default_rng(7)

# The problem: find the cheapest item in a large catalog of M candidates,
# where item i has scalar cost c[i].
M, K = 1_000_000, 6            # items, agents
cost = rng.standard_normal(M)  # the cost of each candidate item

# Centralized solver: one reasoner scans the whole catalog.
central_idx = int(np.argmin(cost))
central_val = float(cost[central_idx])

# Distributed problem solving: decompose the catalog into K disjoint parts,
# each agent solves its sub-problem in isolation and returns ONLY a summary
# (its local-best index and value); a recombiner selects the global winner.
parts = np.array_split(np.arange(M), K)
local_best = []                                  # one summary per agent
for a, idxs in enumerate(parts):
    sub = cost[idxs]
    j = int(np.argmin(sub))                      # agent solves its sub-problem
    local_best.append((int(idxs[j]), float(sub[j]), len(idxs)))

# Recombination: the coordinator compares K summaries, not M items.
winner = min(local_best, key=lambda t: t[1])
dist_idx, dist_val = winner[0], winner[1]

print("items M                 :", M)
print("agents K                :", K)
print("items the coordinator saw:", K, "(local summaries, not", M, ")")
print("centralized argmin index:", central_idx)
print("distributed argmin index:", dist_idx)
print("answers identical?      :", central_idx == dist_idx and central_val == dist_val)
print("centralized min value   :", f"{central_val:.6f}")
print("distributed min value   :", f"{dist_val:.6f}")
Code 27.1.1: Distributed problem solving from first principles. Each of the $K$ agents sees only its shard of the catalog and returns a one-line summary; the coordinator selects the global best from those summaries alone, never touching the full catalog.
items M                 : 1000000
agents K                : 6
items the coordinator saw: 6 (local summaries, not 1000000 )
centralized argmin index: 334838
distributed argmin index: 334838
answers identical?      : True
centralized min value   : -4.586801
distributed min value   : -4.586801
Output 27.1.1: The distributed solution matches the centralized one exactly: both name item 334838 as the cheapest. The coordinator examined six agent summaries rather than a million items, and the recombination step lost nothing.

The two solvers return the identical item, 334838, and the coordinator reached that answer by comparing six summaries instead of a million costs. This is distributed problem solving in miniature: a single goal, decomposed into independent sub-problems, recombined by an operation that touches only the per-agent results. It is exact here for the same structural reason data parallelism was exact in Code 1.1.1, because $\min$ over a partition equals the partitioned $\min$, just as a sum over a partition equals the partitioned sum. The branch gets hard, as Section 27.2 shows, precisely when the sub-problems are not independent and an agent's local answer depends on what its neighbors found, which is where coordination protocols earn their keep.

Library Shortcut: Ray Turns the Loop into Distributed Agents

In Code 27.1.1 the agents were a Python loop in one process, so the decomposition and recombination were obvious but not actually distributed. Ray makes each agent a real remote worker, possibly on a different machine, with one decorator and one get; the framework handles serialization, scheduling, and collecting the results:

import ray, numpy as np
ray.init()

@ray.remote
def solve_subproblem(cost_shard):          # one agent, runs anywhere in the cluster
    j = int(np.argmin(cost_shard))
    return j, float(cost_shard[j])

shards = np.array_split(cost, K)
futures = [solve_subproblem.remote(s) for s in shards]   # K agents dispatched
local_best = ray.get(futures)                            # collect K summaries
winner = min(local_best, key=lambda t: t[1])             # recombination step
Code 27.1.2: The same decompose-and-recombine pattern as Output 27.1.1, now genuinely distributed across a cluster. Ray replaces the manual loop and result collection; the agent-orchestration frameworks of Chapter 32 generalize this to LLM agents that exchange messages, not just return values.
Practical Example: The Outage Map No Single Server Could Hold

Who: A platform engineer at a regional electric utility building a grid-fault locator.

Situation: Thousands of smart meters across the service area each report local voltage anomalies, and operators needed the single most likely fault location within seconds of an event.

Problem: Streaming every meter's full readings to one central reasoner saturated the uplink and left the central server scanning millions of samples per query, missing the seconds-level deadline.

Dilemma: Centralize all raw data into one powerful solver, simple to reason about but bandwidth-bound and slow, or decompose the search so each substation's local controller solves its own region and reports only a summary, faster and lighter but requiring a recombination protocol.

Decision: They decomposed the problem, because the data was inherently distributed across substations that already had local compute, exactly the first pressure from Section 1.

How: Each substation controller ran the local search over its own meters and emitted only its best candidate fault and a confidence score; a lightweight coordinator selected the global best from those summaries, the structure of Code 27.1.1 scaled to real hardware.

Result: Uplink traffic fell by orders of magnitude, the coordinator compared dozens of summaries instead of millions of samples, and fault localization landed inside the deadline with the same answer the centralized version would have produced.

Lesson: When the data is already distributed and the recombination is cheap, distributed problem solving is not a compromise; it is both faster and exact, as long as the sub-problems decompose cleanly.

5. Why the Classical Foundations Matter Now Advanced

For two decades DAI was a respected but somewhat quiet field, because building an agent that could reason flexibly about an open world was itself unsolved; multi-agent theory was elegant but the agents themselves were brittle. Large language models changed that overnight. An LLM is, for the first time, a general-purpose reasoner you can instantiate many times, give different roles and tools, and let converse. Suddenly the practical bottleneck is no longer the individual agent but the coordination among many, which is precisely the problem DAI spent the 1980s and 1990s formalizing. A modern system of LLM agents that decompose a task, post partial results to shared memory, bid for sub-tasks, and reconcile conflicting conclusions is running the blackboard architecture, the Contract-Net protocol, and distributed belief revision under new names.

This is why a part on multi-agent systems opens with history rather than with the latest framework. The frameworks will change; the underlying problems, who does what, how partial results combine, what to do when agents disagree or vanish, are the same ones DAI named. Chapter 32 builds modern orchestration on these foundations, and the rest of Part VI fills in the theory between here and there. The throughline of the whole book holds at this final axis: agents are distributed components that must communicate and coordinate, so the scale-out concerns reappear, and the classical answers are the place to start.

Research Frontier: LLM Multi-Agent Systems Revive DAI (2024 to 2026)

The clearest sign that classical DAI was ahead of its time is how directly current work reinstantiates it. Multi-agent LLM frameworks such as AutoGen (Wu et al., 2024) and router-based crews coordinate role-specialized agents through structured conversation, a direct descendant of agent communication languages and the Contract-Net protocol. Surveys of LLM-based multi-agent systems (Guo et al., 2024) explicitly map modern agent designs back to the distributed-problem-solving and multi-agent-systems branches of Figure 27.1.1. Studies of emergent cooperation and debate among LLM agents, including work showing that multi-agent debate improves factuality and reasoning (Du et al., 2023), are empirical tests of coordination hypotheses the field posed decades ago. The open questions, scaling the number of agents without coordination collapse, reaching consistency over shared belief, and tolerating agents that fail or behave adversarially, are the questions Sections 27.7 and 27.8 first frame in their classical form before Chapter 32 engineers them.

We now have the part's map: intelligence is the sixth and last axis of distribution, the classical field that studied it is DAI, and DAI splits into distributed problem solving and multi-agent systems. We have seen the smaller branch reproduce a centralized answer exactly, and we have seen why these foundations are suddenly the working vocabulary of modern agent systems. The rest of Chapter 27 develops the distributed-problem-solving branch in depth, beginning with how one problem is decomposed across cooperating solvers in Section 27.2.

Exercise 27.1.1: Which Branch, and Why Conceptual

For each scenario, decide whether it belongs to distributed problem solving or to multi-agent systems as defined in Table 27.1.1, and justify your choice by identifying whether the agents share a goal and a designer: (a) a team of warehouse robots built by one company to pick orders together as fast as possible; (b) autonomous bidding agents from competing advertisers in a real-time ad auction; (c) a search engine that splits a query across many index shards and merges the top results; (d) several airlines' scheduling systems negotiating gate assignments at a shared airport. For any case that is genuinely on the boundary, say what extra fact would settle it.

Exercise 27.1.2: When Decomposition Stops Being Exact Coding

Code 27.1.1 was exact because the sub-problems were independent: an agent's best item did not depend on any other agent's shard. Modify it so the global objective couples the shards, for example by requiring the cheapest pair of items drawn from two different agents. Implement a centralized solver and a distributed one where each agent may report more than just its single best item, and find the smallest amount of per-agent summary information that still lets the coordinator recover the exact global answer. Explain why one summary per agent is no longer enough, and connect your finding to why coupled sub-problems need the coordination protocols of Section 27.2.

Exercise 27.1.3: The Cost of Recombination Analysis

In Code 27.1.1 the centralized solver inspects $M$ items while the distributed solver has each of $K$ agents inspect $M/K$ items in parallel and the coordinator inspect $K$ summaries. Write the work the coordinator does as a function of $K$, and argue that the recombination step is negligible only while $K \ll M$. Now suppose recombination itself were expensive, say the coordinator had to compare every pair of agent summaries at cost $O(K^2)$. Find the value of $K$ that minimizes total wall-clock time under a simple model where per-agent work is $M/K$ and recombination is $K^2$, and relate the existence of an optimal $K$ to the "distribution can hurt" lesson of Chapter 3.