"Each of us could see one corner of the room. It took eleven rounds of polite disagreement before we agreed there was an elephant."
A Sensor Node That Only Saw a Leg
Distributed problem solving is what happens when one problem is too big, too spread out, or too sensor-local to be solved on a single machine, so it is cut into subproblems handed to many cooperating solvers that share partial results until their local answers fit together into one global answer. Unlike the self-interested agents of Chapter 29, these solvers share a common goal; the difficulty is not conflicting incentives but conflicting partial conclusions. A solution that is locally consistent on one node may contradict another node's local solution, so the solvers must exchange tentative results, detect the clashes, and revise. That makes distributed problem solving iterative and communication-bound, which ties it straight back to the communication-cost theme of Chapter 4. The two hard steps, decomposition into subproblems and synthesis of a global solution, are the spine of this section and the recurring pattern of the whole book.
Some problems are distributed before anyone decides to distribute them. A network of acoustic sensors monitoring a forest sees, at each node, only the sounds near that node; no single node hears the whole scene, yet the question (where is the moving vehicle, and on what heading?) is global. A team of warehouse robots each perceives a slice of the floor; the question (what is the safe, conflict-free set of routes right now?) spans the whole floor. In cases like these the input is already partitioned across machines by physics or geography, and gathering everything to one machine is either impossible, too slow, or too expensive. The discipline of solving such a problem in place, with cooperating solvers that pool partial conclusions rather than raw data, is distributed problem solving, and it is the oldest idea in distributed AI.
1. One Problem, Many Solvers, One Goal Beginner
It is worth fixing the boundary of the topic precisely, because the next chapters relax exactly the assumptions we make here. In distributed problem solving every solver is cooperative: the solvers are parts of one designed system, they share a single objective, and none of them gains by misleading the others. The classic illustration is a distributed sensor network for vehicle tracking. Each node has a directional microphone or a small radar and can estimate a vehicle's position only within its own coverage area, with its own noise and blind spots. No node can name the global track. Yet if the nodes share their local estimates with their neighbors, a node that sees a vehicle entering its area from the west can be cross-checked against the node to its west that just reported the vehicle leaving, and the fused interpretation is far better than any single node's guess.
This is genuinely different from the self-interested setting of Chapter 29, where agents may have private goals, negotiate, and strategize, and where the hard question is incentive design. Here the agents already agree on what a good global answer looks like; the only obstacle is that each one computed its piece without seeing the others' pieces, so the pieces have to be reconciled. The reconciliation is the work, and it is never free, because every piece of evidence one solver needs from another has to travel over a link.
2. The Two Hard Steps: Decomposition and Synthesis Beginner
Every distributed problem solving method is organized around two operations that bookend the local solving. The first is task decomposition: cutting the global problem into subproblems that can be assigned to solvers. A good decomposition produces subproblems that are roughly balanced in difficulty, that map naturally onto where the data already lives, and (the hard part) that interact as little as possible, because every interaction between two subproblems becomes communication between two solvers. The second is result synthesis: combining the local solutions into a global one. When the subproblems are truly independent this is a trivial concatenation, but when they share variables or constraints, synthesis must resolve the places where the local solutions disagree.
The trouble is that interesting problems do not decompose cleanly. Subproblems interact: a route chosen by one robot blocks a corridor another robot wanted, a color one map region picks forbids that color for its neighbors, a vehicle one sensor places at a coordinate must be the same vehicle the next sensor sees a moment later. Because of these interactions a solver cannot finish its subproblem in isolation; the locally optimal piece may be globally inconsistent. So the solvers must communicate intermediate, still-tentative results, discover the conflicts, and revise, possibly many times. This is why distributed problem solving is iterative rather than one-shot, and why its cost is dominated by rounds of communication rather than by local computation.
If subproblems did not interact, distributed problem solving would be embarrassingly parallel: decompose, solve, concatenate, done. The entire difficulty (and the entire communication bill) comes from the coupling between subproblems. The skill is therefore to decompose so that coupling is minimized, and to design an exchange protocol so that each round of sharing partial results removes the maximum number of conflicts. A decomposition that cuts through a dense cluster of interactions creates a system that talks far more than it computes, which is the distributed-AI version of the communication tax quantified in Chapter 3.
3. Functionally-Accurate, Cooperative Solving Intermediate
The foundational answer to "how should the solvers exchange?" is the functionally-accurate/cooperative (FA/C) approach, introduced by Lesser and Corkill for exactly the distributed-sensor setting above. Its insight is to abandon the idea that each solver must produce a complete, correct, final answer to its subproblem before sharing anything. Instead solvers exchange tentative, partial, possibly-inconsistent results as they go. Each solver treats incoming partials as additional, soft evidence, folds them into its own reasoning, and emits revised partials in turn. The system is "functionally accurate" because it reaches a correct global result even though no individual message is guaranteed correct, and "cooperative" because the solvers volunteer their best current guesses to help others converge rather than guarding them until certain.
Formally, picture the global problem as a set of variables with constraints, where variable $x_i$ is owned by solver $i$ and each constraint couples a few variables held by different solvers. A solution is a joint assignment that satisfies every constraint. Define the global cost as the number of violated constraints,
$$C(\mathbf{x}) = \sum_{(i,j)\,\in\,E} \mathbf{1}\!\left[\,\text{constraint between } x_i \text{ and } x_j \text{ is violated}\,\right],$$where $E$ is the set of interacting variable pairs. No solver can evaluate $C$ directly, because it spans the whole system. But solver $i$ can evaluate its local cost, the violations involving only its own variable and the neighbor values it has been told, and act to reduce that. The FA/C bet, which holds for a wide class of problems, is that if every solver keeps reducing its local cost while broadcasting its current value to the neighbors it interacts with, the global cost $C$ falls to zero. The exchange of tentative partial results is precisely the mechanism that keeps each solver's local view fresh enough for its local moves to be globally productive.
The decompose-solve-synthesize shape of this section is the same shape the book has used since data parallelism in Chapter 1: split the work across machines, do the local part, combine. What changes here is the combine step. In data-parallel training the combine was a single exact all-reduce of partial gradients; here the combine is iterative, because the partial results interact and one pass cannot resolve them. Distributed problem solving is what the book's central pattern becomes when the subproblems are coupled rather than independent: the synthesis stops being one collective and becomes a loop of collectives, each shrinking the disagreement a little more.
4. A Distributed Constraint Problem, Solved by Cooperation Intermediate
The cleanest runnable instance of all of this is distributed map coloring, a constraint problem small enough to read yet faithful to the structure of sensor fusion: each agent holds one piece (one region), the pieces interact through shared constraints (adjacent regions must differ), and no agent sees the whole map. The code below builds five region-agents. Each agent knows only its own neighbors, starts from a random tentative color, and in every round broadcasts its current color to its neighbors (the exchange of partial results), then, if it is in conflict, revises toward the locally best color with some probability. The probabilistic activation is the standard trick that stops neighboring agents from flipping in perfect lockstep forever; it is the distributed stochastic algorithm familiar from distributed constraint optimization, which Chapter 27 revisits in depth in its section on DCOP. We count the communication rounds and verify the synthesized global coloring.
import random
# The shared problem (no agent ever sees this whole structure).
# Adjacency of a small map: which regions touch which.
ADJ = {
"A": ["B", "C"],
"B": ["A", "C", "D"],
"C": ["A", "B", "D", "E"],
"D": ["B", "C", "E"],
"E": ["C", "D"],
}
COLORS = ["R", "G", "B"]
random.seed(2)
class Agent:
"""Owns ONE region. Knows only its own neighbors, not the rest of the
map. Holds a tentative color and the partial results it has received."""
def __init__(self, region, neighbors):
self.region = region
self.neighbors = neighbors # local view only
self.color = random.choice(COLORS) # tentative guess
self.inbox = {} # neighbor -> reported color
def conflicts(self):
"""How many local constraints my current color violates."""
return sum(1 for n in self.neighbors
if self.inbox.get(n) == self.color)
def revise(self, activation):
"""If I violate a local constraint, with probability `activation`
move to the local color that minimizes conflicts with what I have
heard. Randomized activation breaks the symmetry that makes
neighbors flip in lockstep. Returns True if I changed."""
if self.conflicts() == 0:
return False
if random.random() > activation:
return False # stay put this round
taken = [self.inbox.get(n) for n in self.neighbors]
best = min(COLORS, key=lambda c: taken.count(c))
changed = best != self.color
self.color = best
return changed
# Build the solver society: one agent per region, each with a local view.
agents = {r: Agent(r, ADJ[r]) for r in ADJ}
print("Initial tentative colors :", {r: a.color for r, a in agents.items()})
total_msgs = 0
for rnd in range(1, 21):
# Communication phase: every agent shares its tentative partial result.
for r, a in agents.items():
for n in a.neighbors:
agents[n].inbox[r] = a.color # send my color to neighbor
total_msgs += 1
# Conflicts visible AFTER exchange, BEFORE this round's revision.
conflicts = sum(a.conflicts() for a in agents.values()) // 2
# Local solve phase: conflicted agents revise against received partials.
changed = sum(agents[r].revise(activation=0.7) for r in agents)
print(f"round {rnd:2d}: {dict((r, a.color) for r, a in agents.items())} "
f"| edge-conflicts={conflicts} | revised={changed}")
if conflicts == 0 and changed == 0:
print(f"\nConverged after {rnd} rounds, "
f"{total_msgs} partial-result messages exchanged.")
break
# Synthesis: collect local colors into the global solution and verify it.
solution = {r: a.color for r, a in agents.items()}
ok = all(solution[u] != solution[v] for u in ADJ for v in ADJ[u])
print("Synthesized global coloring :", solution)
print("Globally consistent :", ok)
Agent sees only its own neighbors, exchanges its tentative color every round (the FA/C partial-result share), and revises locally; the global coloring is synthesized and checked only at the end. No agent ever holds the whole map.Initial tentative colors : {'A': 'R', 'B': 'R', 'C': 'R', 'D': 'G', 'E': 'R'}
round 1: {'A': 'R', 'B': 'B', 'C': 'B', 'D': 'G', 'E': 'B'} | edge-conflicts=4 | revised=3
round 2: {'A': 'R', 'B': 'R', 'C': 'R', 'D': 'G', 'E': 'R'} | edge-conflicts=2 | revised=3
round 3: {'A': 'G', 'B': 'B', 'C': 'R', 'D': 'G', 'E': 'R'} | edge-conflicts=4 | revised=2
round 4: {'A': 'G', 'B': 'B', 'C': 'R', 'D': 'G', 'E': 'B'} | edge-conflicts=1 | revised=1
round 5: {'A': 'G', 'B': 'B', 'C': 'R', 'D': 'G', 'E': 'B'} | edge-conflicts=0 | revised=0
Converged after 5 rounds, 70 partial-result messages exchanged.
Synthesized global coloring : {'A': 'G', 'B': 'B', 'C': 'R', 'D': 'G', 'E': 'B'}
Globally consistent : True
Two features of Output 27.2.1 are the whole lesson in miniature. First, convergence is not monotone: the conflict count climbs before it falls, because each agent acts on a snapshot of its neighbors that is already one round out of date, so well-meant local fixes can briefly create new global clashes. This is the price of having no agent that sees everything. Second, the work is counted in rounds and messages, not in arithmetic: five rounds and seventy messages settled a problem whose local computation was trivial. Scale the map to thousands of regions on thousands of machines and the message count, not the coloring logic, is what determines whether the system is practical, which is exactly why Chapter 4 treats the cost of communication as a first-class quantity.
Code 27.2.1 hand-rolls the agents, their inboxes, the round loop, and the revision rule. A distributed-constraint toolkit such as pydcop expresses the same problem declaratively and ships several cooperative algorithms (DSA, MGM, Max-Sum, DPOP) that handle agent messaging, scheduling, and convergence bookkeeping for you:
# pip install pydcop ; then: dcop.py solve --algo dsa map_coloring.yaml
# map_coloring.yaml declares one variable per region over domain [R, G, B]
# and an "all_diff"-style constraint on each adjacent pair. pydcop builds the
# agents, runs the chosen distributed algorithm, and reports the assignment.
from pydcop.dcop.dcop import DCOP
from pydcop.dcop.objects import Variable, AgentDef
from pydcop.dcop.relations import constraint_from_str
# ... define variables/constraints, pick algo="dsa", and call the solver:
# pydcop solve --algo dsa --timeout 5 map_coloring.yaml
pydcop. Roughly forty lines of agent, inbox, and round-loop scaffolding collapse to a problem declaration plus an algorithm name; the toolkit handles message passing, activation scheduling, and the convergence check that we wrote by hand.5. The MapReduce Parallel and the Modern Agent Parallel Intermediate
The decompose-solve-synthesize shape should feel familiar, because it is the shape of MapReduce. The map phase decomposes the input and applies a local function to each shard; the reduce phase synthesizes the per-shard results into a global one. The difference is exactly the interaction question of Section 2. MapReduce is built for problems whose subproblems do not interact within a phase: every mapper is independent, every reducer is independent, and the only communication is the single shuffle between the two phases. Distributed problem solving is what you need when that assumption fails, when the "mappers" must talk to each other repeatedly because their pieces are coupled. Seen this way, MapReduce is the degenerate, one-round case of distributed problem solving, the case where the synthesis is a single reduce because the partial results never conflict.
The pattern has returned, scaled out and modernized, in multi-agent LLM workflows. A complex task (write a researched report, fix a bug across a large codebase, plan a multi-step trip) is decomposed across role-specialized agents: a planner, several workers each owning a sub-task, a critic, an integrator. The workers solve their sub-tasks with their own local context, exchange intermediate results (a draft section, a candidate patch, a partial itinerary), and an integrator synthesizes the global deliverable, often looping when the pieces do not cohere. This is FA/C with language models in the solver role: tentative partial results flow between agents, each agent refines on what it receives, and the system is functionally accurate even though no single agent produced a complete correct answer alone. The orchestration machinery that makes this reliable at scale is the subject of Chapter 32.
The old story of blind men examining an elephant, one finding a snake at the trunk, another a wall at the side, a third a rope at the tail, is usually told as a fable about partial knowledge and stubbornness. Read it as a systems engineer and it is a perfect specification of a distributed problem solving failure: correct local conclusions, zero exchange of partial results, no synthesis step, and therefore no global answer. Add one FA/C round of "here is what I feel, what do you feel?" and the society reconstructs the elephant. The moral the fable forgot is that the bug was not blindness; it was the missing communication protocol.
6. When It Helps, and When It Is Just Overhead Advanced
Distributed problem solving earns its complexity in a specific regime, and it is worth naming the boundary so it is not reached for reflexively. It helps when the data or sensing is genuinely distributed and cannot be centralized cheaply (sensor networks, robot teams, federated sites under privacy constraints), when the problem is large enough that local solving across many machines is faster than one machine grinding through the whole thing, and when the interactions between subproblems are sparse, so that a few rounds of partial-result exchange resolve most conflicts. In that regime the communication is modest and the parallelism pays.
It is pure overhead in the opposite regime. If all the data could be gathered to one machine quickly and that machine could solve the problem outright, a centralized solver is simpler, exact, and usually faster, because it pays no communication tax and never revises on stale views. If the subproblems interact densely, every solver ends up talking to nearly every other solver every round, the message count explodes, and the system spends more time exchanging tentative results than it would have spent solving the problem centrally. The honest design question, echoing the one this book asks of every distribution decision, is whether the coupling is sparse enough that cooperation converges in few rounds; when it is not, the right move is to centralize, and the trade-off between centralized, decentralized, and hybrid designs is exactly what Section 27.3 takes up next.
Who: A systems engineer building an acoustic vehicle-tracking network for a wildlife reserve.
Situation: Two hundred battery-powered sensor nodes are scattered over rough terrain; each hears vehicles only within a few hundred meters and runs on a tiny radio budget.
Problem: The reserve needs a single continuous track for each intruding vehicle, but no node sees more than a fragment of any track, and the noise floor makes single-node estimates unreliable.
Dilemma: Stream every node's raw audio to a base station and solve the tracking centrally, simple but far over the radio energy budget and useless when the backhaul link drops, or solve it in the field with each node sharing only compact partial estimates with its few neighbors.
Decision: They solved it in the field with an FA/C scheme, because the interactions are sparse (a node only needs the estimates of physically adjacent nodes) and the raw audio was far too expensive to centralize.
How: Each node computed a local position-and-heading estimate, broadcast it to neighbors, fused incoming neighbor estimates with its own, and revised; adjacent nodes reconciled hand-offs as a vehicle crossed coverage boundaries, exactly the revise-on-partial-results loop of Code 27.2.1.
Result: The network produced coherent global tracks while moving a few hundred bytes per detection between neighbors instead of streaming gigabytes of audio, and it kept tracking through base-station outages because no central solver was on the critical path.
Lesson: When sensing is distributed and interactions are sparse, exchanging partial conclusions beats centralizing raw data; the win is measured in communication avoided, not arithmetic performed.
Classical FA/C has been rediscovered inside multi-agent LLM systems, and 2024 to 2026 work is busy mapping the old theory onto the new solvers. Multi-agent debate and role-decomposed pipelines (planner, solvers, critic, integrator) report gains on reasoning and code tasks over a single model, which is FA/C with tentative partial answers exchanged in natural language; frameworks such as AutoGen, CrewAI, and LangGraph have made the decompose-exchange-synthesize loop a standard engineering pattern. At the same time, careful studies (for example, the 2024 analyses of why multi-agent LLM systems fail) find that the gains are fragile: agents amplify each other's errors, exchange redundant or contradictory partials, and run up large token and latency bills, the modern echo of the communication-overhead regime of Section 6. The open questions are exactly the classic ones in new clothing: how to decompose a task so the sub-agents interact sparsely, how to fuse contradictory partial results without a single agent that sees everything, and when a single strong model is simply the cheaper, more accurate centralized solver. We pick these threads up with the orchestration machinery in Chapter 32.
We now have the founding pattern of distributed AI: cut one problem into coupled subproblems, hand them to cooperative solvers, exchange tentative partial results until the local solutions stop contradicting each other, and synthesize the global answer. The pattern's cost is rounds of communication, its benefit is solving problems whose data is already spread out, and its boundary is the density of the interactions. The natural next question is architectural: should the coordination that makes this work live in one place, in no place, or somewhere in between? That is the centralized-versus-decentralized-versus-hybrid question, and it is the subject of Section 27.3.
For each problem, describe a natural decomposition into subproblems for cooperating solvers, and state whether the resulting subproblems interact sparsely or densely: (a) labeling every road segment in a city map as one-way or two-way so that the network stays fully connected; (b) scheduling final exams for a university so no student has two exams at once; (c) computing the average temperature reported by a field of weather sensors. For each, say whether distributed problem solving is the right tool or whether a centralized solver (or a single MapReduce-style pass) would be simpler, and justify the answer using the interaction-density argument of Section 6.
Extend Code 27.2.1 in two ways. First, replace the five-region map with a larger random graph (for example, 30 regions where each pair is adjacent with probability 0.2) and reduce the palette to two colors so that some instances are unsolvable; instrument the loop to report whether it converged, how many rounds it took, and the final conflict count when it stalls. Second, sweep the activation probability from 0.1 to 1.0 and plot rounds-to-convergence against it. Explain why very low activation converges slowly and why activation of exactly 1.0 (every conflicted agent always moves) can fail to converge at all, connecting your observation to the lockstep-oscillation problem the randomized activation was introduced to fix.
Consider a problem with $n$ solvers in which each round every solver sends one message to each of its $k$ neighbors, the system converges in $r$ rounds, and one message costs $\alpha$ seconds of latency plus $\beta$ seconds per byte for a message of $b$ bytes. Write the total communication time of the distributed solve. Now suppose centralizing the problem requires shipping $S$ bytes of raw data to one machine over a link of bandwidth $B$ bytes per second, after which a central solver finishes in $T_c$ seconds. Give the inequality on $r$, $k$, and the message size under which the distributed solve is faster than centralizing, and use it to argue, in one paragraph, why sparse interactions (small $k$) and few rounds (small $r$) are the precise conditions under which distributed problem solving wins. Relate your inequality to the $\alpha$-$\beta$ communication model of Chapter 3.