Part VI: Distributed AI and Multi-Agent Systems
Chapter 29: Multi-Agent Systems

Agent Architectures

"They asked me to plan three steps ahead. I am still planning the first one, and the world has already moved twice."

A Deliberative Agent on a Deadline
Big Picture

An agent architecture is the internal wiring that turns percepts into actions, and the choice of wiring sits on a spectrum from reflex to reasoning that decides everything else about the agent: how fast it responds, how far ahead it sees, and what kinds of coordination it can join. At one end, a reactive agent maps what it senses straight to what it does, with no model of the world and no plan; it is fast and hard to break but cannot look ahead. At the other end, a deliberative agent maintains an explicit model of the world and searches over it for a good course of action; it can plan, negotiate, and reason about other agents, but it is slower and more fragile. Real systems, including the large language model agents now driving distributed applications, are hybrids: a fast reactive layer underneath a slower deliberative one. This section builds each architecture from its primitives and shows, in runnable code, the trade-off that makes the hybrid the practical norm.

In Section 29.1 we defined an agent as anything that perceives its environment and acts on it to pursue an objective. That definition is silent on the most consequential design question: what goes on between perceiving and acting? A thermostat and a chess engine are both agents, yet one responds in microseconds with a single comparison while the other may search millions of positions before it moves. The space of answers to "what goes in the middle" is the space of agent architectures, and it is not a grab-bag of unrelated designs but a continuum organized by one axis: how much internal modeling and lookahead the agent does before it commits to an action. Naming the points on that axis, and understanding what each one buys and costs, is the work of this section, and it sets up everything that follows, because the coordination an agent can perform with its peers (Sections 29.5 through 29.9) is bounded by the architecture inside it.

Reactive Percept Stimulus to response if-then behaviour rules Action no world model, no lookahead, fast Deliberative (BDI) Percept Beliefs (world model) Desires (goals) Intentions (committed plan) Planner / search Action explicit model, plans ahead, slower Hybrid (layered) Percept Deliberative layer plan, negotiate, set goals Reactive layer fast reflexes, safety Action goals flow down, reflexes act now
Figure 29.2.1: The architecture spectrum. The reactive agent (left) maps a percept straight to an action through if-then rules with no internal model. The deliberative agent (centre) follows the BDI loop: percepts update Beliefs, goals are its Desires, the planner commits to Intentions, and only then does it act. The hybrid (right) places a fast reactive layer beneath a slow deliberative layer, so reflexes fire immediately while plans set the goals the reflexes serve. The dashed arc marks the layer interface where the deliberative layer can pre-empt or be pre-empted by the reactive one.

1. Reactive Architectures: Sense and Act Beginner

The simplest agent does not think; it reacts. A reactive architecture maps the current percept directly to an action through a fixed set of condition-action rules, with no internal representation of the world beyond what the sensors report right now. Formally, the agent is a function $\pi: \mathcal{P} \to \mathcal{A}$ from the percept space to the action space, evaluated fresh on every cycle. There is no memory of past states, no model of how the world will evolve, and no search over possible futures. A wall-following robot that turns left whenever its right sensor goes quiet is a reactive agent; so is a network load balancer that forwards each request to the least-loaded server it can currently see.

The canonical formulation is Brooks's subsumption architecture, which builds an agent as a stack of simple behaviors, each a tight sense-act loop, with higher layers able to suppress or override lower ones. A mobile robot might have a bottom layer that avoids obstacles, a middle layer that wanders, and a top layer that heads for a goal; each runs continuously and the conflict between them is resolved by suppression, not by a central planner. The radical claim of subsumption was that complex, robust behavior can emerge from layered reflexes with no world model at all, an idea that became the intellectual basis for the swarm agents we study in Chapter 31, where thousands of reflex agents produce collective behavior no single one represents.

Key Insight: Reactive Agents Trade Foresight for Speed and Robustness

A reactive agent re-decides from scratch every cycle using only the present percept, so it is fast (one rule lookup, not a search) and robust (a stale internal model cannot mislead it, because there is no internal model). The price is myopia: with no representation of the future, it cannot avoid a trap it could see coming, cannot pursue a goal that requires temporarily moving away from it, and cannot reason about what another agent will do next. Reactivity is the right architecture exactly when the environment changes faster than the agent could plan over it, or when correctness depends on responding now rather than responding optimally.

2. Deliberative Architectures and the BDI Model Intermediate

A deliberative agent does the opposite of a reactive one: it maintains an explicit, symbolic model of the world and chooses its actions by reasoning over that model, typically by searching for a sequence of actions that achieves a goal. Where the reactive agent asks "what does my rule say to do right now?", the deliberative agent asks "what is the state of the world, what do I want, and what sequence of actions gets me from here to there?" This is the architecture of classical planning, of the chess engine, and of any agent that must look several moves ahead.

The most influential deliberative architecture is the Belief-Desire-Intention model, or BDI, drawn from a philosophical account of practical reasoning and made concrete for agents by Rao and Georgeff. BDI factors an agent's internal state into three components. Beliefs are the agent's model of the world, updated from percepts and possibly wrong. Desires are the states of affairs the agent would like to bring about, its goals, which may conflict. Intentions are the desires the agent has committed to and is actively planning to achieve; an intention is a goal plus the resources and persistence to pursue it. The BDI control loop runs continuously: revise beliefs from new percepts, generate options (candidate desires), filter them into intentions given current commitments, and execute the plan for the chosen intention, reconsidering when beliefs change enough to matter. The genius of the model is the notion of commitment: an intention persists so the agent does not endlessly re-deliberate, but it is dropped when beliefs show it has become impossible or pointless, so the agent is not blindly stubborn.

Deliberation buys foresight. Because the agent has a model, it can simulate the consequences of actions before taking them, avoid traps a reactive agent would walk into, and pursue goals that require moving away from them first. It can also reason about other agents as part of its world model, which is the capability that makes negotiation and explicit coordination possible (we build on exactly this in Section 29.6). The cost is the mirror image of reactivity's benefit: planning takes time and compute, and a deliberative agent acting on a stale or wrong model can plan confidently toward disaster. The deeper the lookahead, the worse both costs get.

3. The Reactive-Deliberative Trade-off, Measured Intermediate

The trade-off between reacting and deliberating is easiest to feel when you watch both architectures attempt the same task. The code below puts a reactive agent and a deliberative agent on the same gridworld: walk from the top-left corner to the bottom-right corner of a $7 \times 7$ grid threaded with walls. The reactive agent uses a single greedy rule, step to whichever free neighbor most reduces the straight-line distance to the goal, with a tiny memory of the last few cells so it does not oscillate in place. The deliberative agent first builds a model of the whole grid and runs $A^{*}$ search to find a provably shortest path, then walks it.

import heapq

# 0 = free, 1 = wall. Start at (0,0), goal at (6,6). The right wall hides a
# dead-end pocket that lures a greedy reactive agent; A* sees it and goes left.
GRID = [
    [0, 0, 0, 0, 0, 0, 0],
    [0, 1, 1, 1, 1, 1, 0],
    [0, 0, 0, 0, 0, 1, 0],
    [1, 1, 1, 1, 0, 1, 0],
    [0, 0, 0, 0, 0, 1, 0],
    [0, 1, 1, 1, 1, 1, 0],
    [0, 0, 0, 0, 0, 0, 0],
]
ROWS, COLS = len(GRID), len(GRID[0])
START, GOAL = (0, 0), (6, 6)
MOVES = [(-1, 0), (1, 0), (0, -1), (0, 1)]

def free(cell):
    r, c = cell
    return 0 <= r < ROWS and 0 <= c < COLS and GRID[r][c] == 0

def manhattan(a, b):
    return abs(a[0] - b[0]) + abs(a[1] - b[1])

# Reactive: percept -> action. No map, no plan. Step greedily toward the goal,
# with a short memory so it does not bounce between two cells forever.
def reactive_agent():
    pos, path, recent, steps = START, [START], {START}, 0
    while pos != GOAL and steps < 200:
        steps += 1
        cands = [n for n in (tuple(map(sum, zip(pos, m))) for m in MOVES) if free(n)]
        fresh = [n for n in cands if n not in recent]
        pos = min(fresh or cands, key=lambda n: manhattan(n, GOAL))
        path.append(pos); recent.add(pos)
        if len(recent) > 4:
            recent.discard(path[-5])
    return path, steps

# Deliberative: build a world model and plan a shortest path with A* up front.
def deliberative_agent():
    frontier = [(manhattan(START, GOAL), 0, START)]
    came_from, cost, expanded = {START: None}, {START: 0}, 0
    while frontier:
        _, g, cur = heapq.heappop(frontier)
        expanded += 1
        if cur == GOAL:
            break
        for m in MOVES:
            nxt = tuple(map(sum, zip(cur, m)))
            if free(nxt) and (nxt not in cost or g + 1 < cost[nxt]):
                cost[nxt] = g + 1
                heapq.heappush(frontier, (g + 1 + manhattan(nxt, GOAL), g + 1, nxt))
                came_from[nxt] = cur
    path, node = [], GOAL
    while node is not None:
        path.append(node); node = came_from[node]
    return path[::-1], expanded

r_path, r_steps = reactive_agent()
d_path, d_expanded = deliberative_agent()
opt = len(d_path) - 1
print("task: walk a 7x7 gridworld from (0,0) to (6,6)\n")
print("REACTIVE agent (percept -> action, no plan)")
print(f"  reached goal      : {r_path[-1] == GOAL}")
print(f"  moves taken       : {r_steps}")
print(f"  path length       : {len(r_path) - 1} cells\n")
print("DELIBERATIVE agent (world model + A* plan)")
print(f"  path length       : {opt} cells (provably shortest)")
print(f"  nodes expanded    : {d_expanded} (planning cost, paid up front)\n")
print(f"shortest path is {opt} cells; reactive used {r_steps} moves "
      f"({r_steps / opt:.2f}x the optimum)")
Code 29.2.1: Two architectures on one task. The reactive agent commits to an action from the present percept alone; the deliberative agent searches a model of the entire grid before its first step. Both reach the goal, but only one is guaranteed to take the shortest route.
task: walk a 7x7 gridworld from (0,0) to (6,6)

REACTIVE agent (percept -> action, no plan)
  reached goal      : True
  moves taken       : 20
  path length       : 20 cells

DELIBERATIVE agent (world model + A* plan)
  path length       : 12 cells (provably shortest)
  nodes expanded    : 21 (planning cost, paid up front)

shortest path is 12 cells; reactive used 20 moves (1.67x the optimum)
Output 29.2.1: The deliberative agent finds the 12-cell shortest path; the reactive agent reaches the goal in 20 moves, 1.67 times the optimum, because its greedy rule walked it down the right wall into the dead-end pocket before it could crawl back out. The deliberative agent paid for that quality by expanding 21 search nodes before taking a single step.

The two numbers tell the whole story. The reactive agent never expanded a search node; each of its moves was one cheap rule evaluation, so on a per-decision basis it is far faster and it would keep working even if the walls shifted under it mid-run. But its greedy local rule has no way to know that the inviting path down the right wall ends in a pocket, so it pays with a path that is two thirds longer than it needed to be. The deliberative agent spent real work, 21 node expansions, before it moved at all, and that investment bought a provably optimal route. On a tiny grid the planning cost is trivial; multiply the state space by a few orders of magnitude and the deliberative agent can still be planning while the reactive one has already finished. Neither architecture is universally better, which is exactly why the next section refuses to choose.

Fun Note: The Agent That Optimized Itself Into a Corner

Greedy reactive agents have a comic failure mode that anyone who has used a phone map in a dense city has felt: the route that points most directly at the destination is the one that dead-ends at a river. The agent in Code 29.2.1 falls for the same trick, marching confidently toward the goal until the wall says no. The fix is not a smarter reflex; it is a map. That is the entire argument for deliberation in one gridworld pocket.

4. Hybrid Layered Architectures: Why the Middle Wins Intermediate

Faced with a trade-off where each end has a fatal weakness, practical designers refuse the choice and build hybrids. A hybrid, or layered, architecture stacks a reactive layer and a deliberative layer so that each does what it is good at. The reactive layer runs in a tight, fast loop and owns anything that must happen immediately: avoid the obstacle, respect the safety limit, keep the robot upright. The deliberative layer runs more slowly, maintains the world model, and plans toward goals, handing those goals down to shape what the reactive layer is reaching for. When the world demands an instant response, reflexes win; when there is time to think, the planner sets the agenda. This is the structure of essentially every serious autonomous system, from self-driving stacks to warehouse robots, because it gets the foresight of deliberation without surrendering the millisecond reflexes that keep the agent safe.

The interface between the layers is where the design effort goes. A common pattern is three tiers: a reactive skill layer at the bottom, a sequencing layer in the middle that strings skills into behaviors, and a deliberative planning layer on top. The planner cannot micromanage the reflexes (that would reintroduce the latency it was meant to escape), and the reflexes cannot ignore the plan (that would discard the foresight). The art is letting goals flow down and letting urgent percepts pre-empt upward, which is precisely the dashed interface in Figure 29.2.1.

Thesis Thread: Architecture Bounds the Coordination an Agent Can Join

This book's spine is that intelligence at scale is distributed across many machines, and the multi-agent chapters are where decision-making itself is the thing being distributed (the sixth axis from Section 1.1). The architecture inside each agent sets the ceiling on how those agents can coordinate. A purely reactive agent can only coordinate implicitly, through the marks it leaves in a shared environment, which is exactly the stigmergy that powers swarms in Chapter 31. A deliberative agent, because it models other agents, can coordinate explicitly: it can negotiate, bid in an auction, and reason about commitments, the machinery of Section 29.6 onward. Choosing an architecture is therefore choosing what kind of distributed intelligence you can build on top of it.

5. The Modern LLM Agent Is a Deliberative Loop Advanced

The agents now wiring together distributed applications, the ones built around a large language model, are deliberative agents in the BDI lineage, even though their builders rarely use that vocabulary. The mapping is exact. The LLM agent perceives by assembling a context: the user's request, observations from previous steps, and retrieved memory. It plans by reasoning in natural language, chain-of-thought prompting that thinks step by step, or ReAct, which interleaves a reasoning thought with an action. It acts by calling a tool: a search engine, a code interpreter, an API. It then observes the tool's result, folds it back into the context, and repeats the loop until the goal is met. The model's beliefs are its context window plus external memory; its intentions are the plan it is currently executing; its desires are the objective in the prompt.

Perceive build context Plan reason (ReAct) Act call a tool Observe read result Memory beliefs across steps loop until goal reached
Figure 29.2.2: The LLM agent as a deliberative cycle. Perceive (assemble context) feeds Plan (reason about the next step, often in the ReAct thought-then-action style), which feeds Act (invoke a tool), which feeds Observe (read the result). The observation updates a Memory store that carries beliefs across iterations, and the loop repeats until the objective is satisfied. This is the BDI loop of Section 2 rendered in tokens and tool calls.

Two structural choices recur in these systems. The first is the planner-executor split, a hybrid in everything but name: one component (often a stronger model with more reasoning budget) decomposes the goal into a plan, and a separate, cheaper executor carries out each step, calling tools and reporting back. The planner deliberates; the executor reacts. When a single agent's planner-executor split is lifted to many agents, with a planner agent dispatching subtasks to a fleet of executor agents, you have a distributed orchestration problem, which is the subject of Chapter 32. The second choice is memory architecture: because the context window is finite, a serious agent keeps short-term working memory in the prompt and long-term memory in an external store it retrieves from, the same retrieval machinery we built for vector search in Chapter 25, now serving an agent's beliefs instead of a user's query.

Research Frontier: Reasoning, Acting, and Remembering in LLM Agents (2024 to 2026)

The architecture of LLM agents is moving fast. ReAct (Yao et al., 2023) established the interleaving of reasoning traces and tool actions that almost every framework now copies, and Reflexion added a verbal self-critique step so an agent revises its plan from its own failures. Planning has split into a planner-executor division of labor, with plan-and-solve and tree-of-thoughts style search letting a deliberative planner explore several lines before committing, an explicit return of the lookahead from Section 2. The hardest open problem is memory: agents need beliefs that persist coherently across long horizons, and 2024 to 2026 work on agent memory architectures, from explicit memory streams that score entries by recency, importance, and relevance, to learned summarization and retrieval policies over an external store, is the current frontier. The throughline is that the field is rediscovering, in the language of prompts and tools, the belief-desire-intention structure that deliberative agent research named decades ago, now scaled across the serving fleets of Chapter 24.

Library Shortcut: A ReAct Agent Loop in a Dozen Lines

The perceive-plan-act-observe loop of Figure 29.2.2 is what agent frameworks such as LangChain, LlamaIndex, and the OpenAI and Anthropic tool-use SDKs package for you. You declare the tools and the objective; the framework runs the deliberative cycle, parses the model's chosen action, executes the tool, feeds the observation back, and stops when the model emits a final answer. The hand-rolled loop below shows the shape the library implements:

# Sketch of the loop a ReAct agent framework runs for you.
def run_agent(llm, tools, objective, max_steps=8):
    memory = [f"Goal: {objective}"]            # beliefs, as a running transcript
    for _ in range(max_steps):
        thought = llm(prompt(memory, tools))   # PLAN: reason about the next step
        if thought.is_final:                   # goal reached -> stop deliberating
            return thought.answer
        result = tools[thought.tool](thought.args)   # ACT: call the chosen tool
        memory.append(f"Observation: {result}")      # OBSERVE: fold result back in
    return "step budget exhausted"
Code 29.2.2: The deliberative LLM-agent loop in skeleton form. A production framework replaces this with robust parsing, retries, tool schemas, and an external memory store, collapsing roughly a hundred lines of plumbing to a single agent.run(objective) call while preserving exactly this perceive-plan-act-observe structure.
Practical Example: Choosing an Architecture for a Warehouse Picking Robot

Who: A robotics engineer at a fulfilment-center automation company.

Situation: A fleet of mobile robots must navigate a warehouse floor shared with human workers, picking items and routing them to packing stations.

Problem: A purely deliberative stack that replanned the whole route on every change reacted too slowly to a worker stepping into an aisle; a purely reactive stack never reached distant shelves efficiently and deadlocked at congested intersections.

Dilemma: Plan everything centrally for optimal routes but risk a collision when a human appears in the 200 milliseconds the planner is still thinking, or react locally for safety but accept inefficient, sometimes stuck, paths.

Decision: They built a hybrid layered architecture, putting a reactive collision-avoidance and speed-limit layer in a 50-millisecond loop beneath a deliberative route planner that ran a few times a second.

How: The planner produced a route as a goal stream; the reactive layer followed it but could override any command instantly to stop or swerve for a human, then resumed toward the same goal once the path cleared.

Result: Collisions with workers fell to zero in testing while throughput stayed within a few percent of the all-deliberative optimum, because reflexes handled the rare urgent case and the planner handled the common efficient one.

Lesson: When safety needs reflexes and efficiency needs foresight, do not choose, layer. The hybrid wins precisely because the two failure modes (too slow, too myopic) have non-overlapping cures.

6. Architecture as a Coordination Decision Intermediate

We can now close the loop back to the chapter's purpose. The architecture inside an agent is not a private implementation detail; it determines what the agent can do with other agents. A reactive agent has no model of its peers, so it cannot negotiate, cannot make or honor an explicit agreement, and cannot reason about what another agent intends. It can still coordinate, but only implicitly, by sensing and responding to the shared state others have changed, which is the indirect coordination that produces flocking and ant-trail optimization in Chapter 31. A deliberative agent, because its world model can include models of other agents, can coordinate explicitly: it can communicate intentions (Section 29.4), reach agreements through negotiation (Section 29.6), and form coalitions (Section 29.7). The game-theoretic reasoning we set up in Chapter 28 presupposes exactly this deliberative capacity, because reasoning about equilibria requires modeling what rational others will do.

So the first design question for any multi-agent system is not "how will the agents coordinate?" but "what is inside each agent?", because the answer fixes the coordination mechanisms available. A swarm of cheap reactive agents and a society of deliberative negotiators are different systems with different physics, and the difference is decided by the architecture this section has mapped. The next section, Section 29.3, turns from the inside of one agent to the environments many agents share, the medium through which all of this coordination, implicit or explicit, actually happens.

Exercise 29.2.1: Place the Agent on the Spectrum Conceptual

For each system, classify its architecture as reactive, deliberative, or hybrid, and justify the call by naming the world model (if any) and the lookahead (if any): (a) a TCP congestion-control algorithm that adjusts its sending rate from observed packet loss; (b) a logistics planner that schedules a week of deliveries by searching over routes; (c) a self-driving car's full stack; (d) a single thermostat. For each, state one coordination behavior the architecture cannot support and explain why, connecting your answer to Section 6.

Exercise 29.2.2: Make the Reactive Agent Smarter, or Not Coding

Starting from Code 29.2.1, give the reactive agent a one-cell lookahead: before stepping, it may peek at the free neighbors of each candidate and break ties by their distance to the goal, but it still keeps no global map. Measure whether this closes the gap to the deliberative agent's 12-cell path on the given grid, then design a new wall layout where even this improved reactive rule is badly suboptimal. Explain what property of the environment defeats any fixed-depth reactive rule, and why only a full world model removes the gap entirely.

Exercise 29.2.3: The Planning-Latency Budget Analysis

Suppose a deliberative agent's $A^{*}$ search expands a number of nodes that grows roughly as $b^{D}$ for branching factor $b$ and solution depth $D$, and each expansion costs $t$ seconds, while a reactive step costs a fixed $c$ seconds with $c \ll t$. The environment changes meaningfully every $\Delta$ seconds. Derive the condition on $\Delta$, $b$, $D$, $t$, and $c$ under which the deliberative agent's plan is already stale by the time it finishes planning. Argue from your condition why a hybrid architecture, deliberating only over a coarse abstraction while reacting at full rate, can stay useful in environments where pure deliberation cannot, and relate this to the warehouse robot's choice of loop rates in the practical example.