Section 39.7: Multi-Agent Reinforcement Learning

"I rehearsed the whole formation in a simulator where I could see every drone at once. On the real flight I can see only my three neighbors, and somehow the rehearsal still holds."
A Policy, Learning to Cooperate With Copies of Itself

Big Picture

A swarm behavior worth deploying cannot be hand-coded for every contingency, so we learn it; but a learned policy that needs a global view to act cannot fly on a drone that sees only its neighbors, so we train with global information and execute with local information. The earlier sections of this chapter built the swarm from the bottom up: the distributed control loops of Section 39.6, the local sensing and consensus that let robots agree without a central authority, and the flocking rules that turn local interactions into global formations. Those rules were designed by hand. This section replaces the designer with a learner. We apply the multi-agent reinforcement learning of Chapter 30 to the concrete robotics setting, and the single idea that makes the result deployable is centralized training with decentralized execution: a critic that sees the whole swarm in simulation shapes policies that, once trained, run on each robot using only what that robot can sense. The training itself is a distributed-systems problem, thousands of simulated swarms stepping in parallel on the actor-learner infrastructure of Chapter 20, which is why this section belongs in a case study and not in the theory chapter that derived the algorithms.

Chapter 30 developed multi-agent reinforcement learning as theory: Markov games, the non-stationarity that arises when every agent learns at once, the value-decomposition and actor-critic families that tame it. This section does not re-derive that machinery; it assumes it and asks the engineering question a robotics team actually faces. We have a swarm of physical robots, each with a narrow sensor footprint and an onboard computer too small to model the whole group. We want behaviors, coordinated search, formation flight, cooperative transport, that no one wants to specify rule by rule. How do we learn a policy that each robot can run alone, and how do we train it at the scale that learning demands? The answer threads two ideas that recur throughout the book: train where information is cheap and abundant (a simulator), execute where it is scarce (the robot), and parallelize the training across many machines because reinforcement learning is desperately sample-hungry.

1. The Robotics Setting as a Decentralized Markov Game Intermediate

To learn swarm behavior we first say precisely what kind of problem it is. A team of $n$ robots interacting with a shared world, each seeing only part of it, is a decentralized partially observable Markov decision process, a Dec-POMDP, the cooperative special case of the Markov games studied in Chapter 30. It is specified by the tuple

$$\langle\, \mathcal{S},\; \{\mathcal{A}_i\}_{i=1}^{n},\; P,\; r,\; \{\Omega_i\}_{i=1}^{n},\; O,\; \gamma \,\rangle,$$

where $\mathcal{S}$ is the global state of the swarm and its environment, $\mathcal{A}_i$ is robot $i$'s action set (a velocity command, a thrust vector), and the joint action $\mathbf{a} = (a_1,\dots,a_n)$ drives the transition kernel $P(s' \mid s, \mathbf{a})$. The reward $r(s,\mathbf{a})$ is shared by the whole team, which is what makes the task cooperative: there is one number, and every robot is measured against it. Crucially, robot $i$ never sees $s$; it receives only an observation $o_i \in \Omega_i$ drawn from $O(o_i \mid s, i)$, its own local view. The objective is a set of decentralized policies $\pi_i(a_i \mid \tau_i)$, each a function of robot $i$'s own observation history $\tau_i$, that jointly maximize the discounted team return $\mathbb{E}\!\left[\sum_{t} \gamma^t\, r(s_t, \mathbf{a}_t)\right]$. The defining constraint, the one that shapes every algorithm in this section, is that each $\pi_i$ may depend only on $\tau_i$, because at deployment that is all robot $i$ has.

This formalism makes the central tension explicit. The thing we want to optimize, team return, depends on the global state and the joint action. The thing each robot can condition on, its local observation, is a shadow of that state. If we insist on optimizing only with local information, learning is slow and unstable; if we train with global information we may learn a policy that cannot run on the robot. The resolution is to use global information only while learning, a structure named in the next section.

2. Centralized Training, Decentralized Execution Intermediate

The organizing principle of modern cooperative MARL, and the reason learned swarms are deployable at all, is centralized training with decentralized execution, abbreviated CTDE. During training we run in a simulator, where the global state, every robot's observation, and every robot's action are all visible to the learning algorithm at once; we exploit that visibility to compute well-informed learning signals. At execution we discard the global view entirely: each robot carries only its own policy network and feeds it only its own local observation. Figure 39.7.1 contrasts the two phases. The asymmetry is deliberate and is exactly the simulation-to-reality split that Section 39.8 takes up, and it mirrors the train-global, deploy-local pattern that the distributed control of Section 39.6 already imposed by hardware necessity.

Figure 39.7.1: Centralized training with decentralized execution. During training (left) a single critic consumes the global state and every robot's action to produce a learning signal that updates all actors; this is cheap because training runs in a simulator that already holds the global state. At execution (right) the critic is gone: each actor network runs on its own robot and is fed only that robot's local observation $o_i$. The policies learned under the critic's guidance remain valid because they were always functions of local observations alone.

CTDE works because the critic is a training-time scaffold, not part of the deployed agent. The policy $\pi_i$ is, from the first gradient step, a function of $o_i$ only; the centralized critic merely supplies a better estimate of how good robot $i$'s action was, given what everyone else did. Once training ends, the scaffold is removed and the policies stand on their own. The two algorithm families below differ only in the form that the centralized training signal takes: a decomposed value function, or a centralized critic feeding decentralized actors.

Key Insight: The Critic Is a Simulator-Only Luxury, the Policy Is the Deliverable

Everything that needs the global state, the centralized critic, the joint-action value, the privileged observations, lives only in the training simulator, where global state is free because the simulator computes it. The robot ships with the policy network alone. This is why a learned swarm can be honest about its deployment constraint while still being trained with rich information: the constraint binds the policy's inputs, not the trainer's. When you evaluate a MARL method for robotics, the first question is always "what does the policy condition on at execution?", never "what did the critic see during training".

3. Value Decomposition: VDN and QMIX Advanced

The first CTDE family addresses cooperative value-based learning. We want each robot to choose its action greedily from a small per-robot utility $Q_i(\tau_i, a_i)$ that depends only on local information, yet we have only one team reward to learn from. Value decomposition resolves this by positing that the joint action value factorizes through the per-robot utilities. The simplest form, the value decomposition network of Sunehag et al. (2018), takes the factorization to be a sum:

$$Q_{\text{tot}}(\boldsymbol{\tau}, \mathbf{a}) \;=\; \sum_{i=1}^{n} Q_i(\tau_i, a_i).$$

Only $Q_{\text{tot}}$ is ever trained against the shared reward, with an ordinary temporal-difference loss on the team return; the per-robot $Q_i$ emerge as the additive pieces. The payoff is the decentralization guarantee: because the sum is increasing in each $Q_i$, maximizing $Q_{\text{tot}}$ over the joint action is the same as each robot independently maximizing its own $Q_i$. That equivalence, $\arg\max_{\mathbf{a}} Q_{\text{tot}} = (\arg\max_{a_1} Q_1, \dots, \arg\max_{a_n} Q_n)$, is precisely the property that makes the greedy policy decentralizable.

QMIX (Rashid et al., 2018) keeps that property while relaxing the rigid sum. It mixes the per-robot utilities through a network whose weights are constrained to be non-negative, so $Q_{\text{tot}}$ is a monotone function of each $Q_i$:

$$Q_{\text{tot}}(\boldsymbol{\tau}, \mathbf{a}) \;=\; f_{\text{mix}}\big(Q_1(\tau_1,a_1), \dots, Q_n(\tau_n,a_n);\, s\big), \qquad \frac{\partial Q_{\text{tot}}}{\partial Q_i} \;\ge\; 0 \ \ \forall i.$$

The mixing network may itself depend on the global state $s$ (available in the simulator), which lets the team value respond to context that no single robot observes, while the monotonicity constraint preserves the same greedy-decentralization equivalence VDN enjoys. The richer mixer represents cooperative value functions that a plain sum cannot, at the cost of a constrained architecture; both share the property that only local utilities are needed to act. The runnable demonstration in Section 6 isolates the additive case so the credit-assignment mechanism is visible in a handful of lines.

4. Multi-Agent Actor-Critic: MADDPG and MAPPO Advanced

The second CTDE family keeps explicit policies and is the natural choice for the continuous action spaces of robotics, where a robot emits a velocity or a thrust rather than picking from a discrete menu. Here each robot has an actor $\pi_i(a_i \mid o_i; \theta_i)$ trained by policy gradient, but the gradient uses a centralized critic that sees the global state and the joint action. For robot $i$ the policy-gradient update takes the form

$$\nabla_{\theta_i} J(\theta_i) \;=\; \mathbb{E}\!\left[\, \nabla_{\theta_i} \log \pi_i(a_i \mid o_i)\; A_i\big(s, a_1, \dots, a_n\big) \right],$$

where the advantage $A_i$ is computed from a critic $Q(s, a_1, \dots, a_n)$ or $V(s)$ that is conditioned on global information. Because the critic knows what the other robots did, the value it reports to robot $i$ is not corrupted by their unseen choices; the environment looks stationary from the critic's vantage even though it does not from any single actor's. This is the mechanism by which the centralized critic of multi-agent deep deterministic policy gradient, MADDPG (Lowe et al., 2017), stabilizes learning, and multi-agent PPO, MAPPO (Yu et al., 2022), applies the same recipe with a clipped on-policy objective and a shared critic, reaching strong results across cooperative benchmarks with surprisingly little tuning. At execution the critic is dropped and only $\pi_i(a_i \mid o_i)$ runs on robot $i$, exactly the asymmetry of Figure 39.7.1.

Library Shortcut: PettingZoo Gives You the Dec-POMDP, the Algorithm Library Gives You CTDE

Writing a multi-agent environment loop, observation routing, per-agent action spaces, shared-reward bookkeeping, by hand is tedious and error-prone. PettingZoo standardizes the Dec-POMDP interface so that swapping environments or algorithms costs a few lines, and it pairs with CTDE implementations in libraries such as Tianshou, MARLlib, or EPyMARL so you rarely implement QMIX or MAPPO yourself:

# pip install pettingzoo
from pettingzoo.mpe import simple_spread_v3   # cooperative N-agent coverage task

env = simple_spread_v3.parallel_env(N=3, max_cycles=25)  # 3 agents, shared reward
observations, infos = env.reset(seed=0)

while env.agents:                              # all agents act each step (parallel API)
    actions = {a: env.action_space(a).sample() for a in env.agents}  # replace with policies
    observations, rewards, terms, truncs, infos = env.step(actions)
    # observations[a] is agent a's LOCAL view; rewards[a] is the shared team reward
env.close()

Code 39.7.1: The PettingZoo parallel API expresses a Dec-POMDP directly: each step returns a per-agent dictionary of local observations and the shared reward, the exact signature CTDE algorithms consume. Replacing the random actions with trained per-agent policies, and plugging the environment into a QMIX or MAPPO trainer from a MARL library, turns a roughly two-hundred-line algorithm into an import and a configuration block.

5. Non-Stationarity, Credit Assignment, and Emergent Communication Advanced

Three difficulties make multi-robot learning harder than single-agent learning, and CTDE is best understood as the answer to the first two. The first is non-stationarity. From any one robot's point of view, the environment includes the other robots, and those robots are themselves learning, so the transition and reward dynamics that robot $i$ experiences drift as its teammates change. A policy that was a best response yesterday is stale today. Independent learners, each running a single-agent algorithm and treating the others as part of the world, chase this moving target and frequently fail to coordinate. The centralized critic dissolves the problem during training: conditioned on the joint action, the world is stationary again, because the part that was moving, the other agents' choices, is now an input the critic observes.

The second difficulty is credit assignment across agents. One shared reward must be apportioned among $n$ contributors, and a robot cannot tell from the team reward alone whether its own action helped or whether a teammate carried the episode. Value decomposition answers this structurally, by learning per-robot utilities that sum or mix to the team value, so the gradient that flows to each $Q_i$ reflects that robot's marginal contribution. Counterfactual methods such as COMA make the apportionment explicit by comparing the realized return against what would have happened had robot $i$ acted differently while teammates held fixed. The demonstration below shows credit assignment failing for independent learners and succeeding for a decomposed value, on a coordination game small enough to read.

The third, and the one that turns a collection of policies into something resembling a society, is learned communication. When robots may exchange a few bits or a short message vector, MARL can learn not only what to do but what to say, discovering a protocol that no engineer specified. Differentiable inter-agent learning passes gradients through the communication channel so that the act of sending a message is trained end to end against the team objective; the emergent codes are often compact and task-specific. This connects directly to the emergent collective behavior of Chapter 31: there the coordination rules were designed, here they, and the signals that drive them, are discovered.

6. A Cooperative Game Where Decomposition Wins Intermediate

The contrast between independent learners and value decomposition is sharpest on a tiny cooperative game, where we can watch the credit-assignment mechanism work without any neural network in the way. Two robots each choose one of three actions; the team receives a single shared reward that depends on the joint action. The reward matrix has a high-payoff optimum at the joint action $(0,0)$ flanked by miscoordination penalties, plus a safe but mediocre corner. Independent Q-learners, each averaging the reward over whatever the partner happens to be doing, see action $0$ punished while the partner still explores and retreat to the safe corner. The value-decomposition learner trains one shared temporal-difference error through the additive factorization $Q_{\text{tot}} = Q_1 + Q_2$ of Section 3, and discounts the negative surprise of a single unlucky penalty, so the team climbs to the joint optimum. Code 39.7.2 implements both learners in pure NumPy.

import numpy as np

# Two-robot cooperative coordination game; ONE shared team reward per joint action.
# Optimum 10 at (0,0) is flanked by miscoordination penalties; (1,1)/(2,2) are safe.
R = np.array([[ 10.0, -10.0, -10.0],
              [-10.0,   2.0,   0.0],
              [-10.0,   0.0,   2.0]])
A, EPISODES, ALPHA, BETA = 3, 8000, 0.05, 0.005   # BETA: discounted rate for penalties
rng = np.random.default_rng(0)

def eps(t):
    return max(0.1, 1.0 - t / 4000.0)             # exploration schedule

def run_independent():
    q0, q1 = np.zeros(A), np.zeros(A)             # each agent: its OWN local utility
    for t in range(EPISODES):
        e = eps(t)
        a0 = rng.integers(A) if rng.random() < e else int(np.argmax(q0))
        a1 = rng.integers(A) if rng.random() < e else int(np.argmax(q1))
        r = R[a0, a1]
        q0[a0] += ALPHA * (r - q0[a0])            # each treats the other as the world
        q1[a1] += ALPHA * (r - q1[a1])
    a0, a1 = int(np.argmax(q0)), int(np.argmax(q1))
    return a0, a1, R[a0, a1]

def run_vdn():
    q0, q1 = np.zeros(A), np.zeros(A)
    for t in range(EPISODES):
        e = eps(t)
        a0 = rng.integers(A) if rng.random() < e else int(np.argmax(q0))
        a1 = rng.integers(A) if rng.random() < e else int(np.argmax(q1))
        r = R[a0, a1]
        td = r - (q0[a0] + q1[a1])                # central critic: Q_tot = Q1 + Q2
        lr = ALPHA if td > 0 else BETA            # optimism: discount penalty surprise
        q0[a0] += lr * td                         # shared error credited to both,
        q1[a1] += lr * td                         # so argmax-per-agent = argmax-of-sum
    a0, a1 = int(np.argmax(q0)), int(np.argmax(q1))
    return a0, a1, R[a0, a1]

print("optimal joint reward :", R.max(), "at joint action (0, 0)\n")
ni, vd = [], []
for seed in range(20):
    rng = np.random.default_rng(seed); ni.append(run_independent()[2])
    rng = np.random.default_rng(seed); vd.append(run_vdn()[2])
rng = np.random.default_rng(0); a0, a1, r = run_independent()
print(f"independent learners : joint action ({a0}, {a1}) -> reward {r:.1f}")
print(f"   mean over 20 seeds : {np.mean(ni):.2f}"
      f"   optimum hit {np.mean(np.array(ni) == R.max()) * 100:.0f}% of seeds\n")
rng = np.random.default_rng(0); a0, a1, r = run_vdn()
print(f"value decomposition  : joint action ({a0}, {a1}) -> reward {r:.1f}")
print(f"   mean over 20 seeds : {np.mean(vd):.2f}"
      f"   optimum hit {np.mean(np.array(vd) == R.max()) * 100:.0f}% of seeds")

Code 39.7.2: Independent Q-learners versus an additive value-decomposition learner on a cooperative coordination game. The only structural difference is that VDN trains a single shared temporal-difference error through $Q_{\text{tot}} = Q_1 + Q_2$ and discounts negative surprise, while the independent learners update two separate utilities against the raw reward.

optimal joint reward : 10.0 at joint action (0, 0)

independent learners : joint action (0, 0) -> reward 10.0
   mean over 20 seeds : 4.40   optimum hit 30% of seeds

value decomposition  : joint action (0, 0) -> reward 10.0
   mean over 20 seeds : 10.00   optimum hit 100% of seeds

Output 39.7.2: Independent learners reach the cooperative optimum on only 30% of seeds (mean team reward 4.40), settling for the safe corner the rest of the time. The value-decomposition learner reaches the optimum on every seed (mean 10.00). The single displayed independent run happens to succeed at seed 0; the twenty-seed average is the honest summary, and it shows the decomposed value solving a credit-assignment problem that defeats the independent baseline.

The lesson scales from this three-by-three matrix to a real swarm. Shared rewards plus learners that ignore each other produce miscoordination; a CTDE structure that assigns credit through a joint value, whether the additive sum here, the QMIX mixer of Section 3, or the centralized critic of Section 4, converts the same shared signal into coordinated behavior. What this toy omits, and what the next subsection supplies, is the scale: real swarm policies need millions of episodes, and those episodes must be generated in parallel.

7. Training Distributed Across Many Simulated Swarms Intermediate

Reinforcement learning is sample-hungry, and multi-agent reinforcement learning is more so, because the joint behavior space the team must explore grows with the number of robots. A learned swarm policy is trained not on one simulated swarm but on hundreds or thousands of them stepping in parallel, and this is where the distributed RL infrastructure of Chapter 20 becomes the engine of the case study. The actor-learner architecture introduced there carries over directly: many actor workers, each running one or more simulated swarms, generate trajectories and stream them to one or a few learners that update the shared policy and critic, then broadcast the new weights back. The synchronous-versus-asynchronous trade-off of Chapter 20, sync for reproducible gradients, async for throughput when actors straggle, is the same trade-off here, now applied to swarm rollouts.

Two scale-out facts make MARL training tractable. First, modern GPU-resident simulators step thousands of independent environments at once on a single accelerator, so the rollout fan-out of Chapter 20 happens partly inside one device and partly across a cluster of them. Second, parameter sharing, giving all robots in a swarm one policy network with the robot's index or role as an input, collapses $n$ policies into one set of weights, which both shrinks the learner's job and lets a swarm trained at one size generalize to another. The training computation is therefore distributed along two of the six axes of Chapter 1 at once, the training axis (parallel rollouts and gradient aggregation) and, when the policy is large, the model axis, while the learned artifact executes at the edge, decentralized, on each robot.

Thesis Thread: Train Distributed, Execute Decentralized

This section is the chapter's clearest instance of the book's spine. Training is a scale-out problem: thousands of simulated swarms run in parallel across many machines on the actor-learner infrastructure of Chapter 20, with the same synchronous-versus-asynchronous and gradient-aggregation choices that govern every distributed training method in the book. Execution is a decentralized problem: the single artifact that training produces, a policy conditioned on local observations, runs independently on every robot with no central coordinator. The centralized critic is the bridge, a training-time use of global information that buys a deployable, communication-light policy. Distribute where information and compute are cheap; decentralize where they are scarce.

Practical Example: The Warehouse Robot Fleet That Stopped Colliding

Who: A robotics team operating a fleet of autonomous mobile robots in a fulfillment warehouse.

Situation: Hand-tuned path-planning rules handled open aisles well but jammed at intersections, where robots deadlocked waiting for each other under a hand-coded priority scheme.

Problem: Each robot sees only a few meters of neighbors through onboard sensors; no robot has a global map of the fleet, and adding a central traffic controller created a single point of failure the operations team had rejected.

Dilemma: Keep extending the rule set, which grew more brittle with every new intersection layout, or learn an intersection-negotiation policy, which needed coordinated training the team had never run.

Decision: They trained a single shared policy with MAPPO under CTDE: a centralized critic saw the full warehouse state in simulation, while each robot's actor conditioned only on its local sensor view and a few neighbor messages.

How: They ran two thousand simulated warehouses in parallel on a GPU cluster using the actor-learner setup of Chapter 20, with parameter sharing so one network drove every robot, then deployed the actor to the real fleet unchanged.

Result: Intersection deadlocks fell sharply and throughput rose, with no central controller in the loop; each robot ran the same lightweight policy on its onboard computer.

Lesson: The global view that made coordination learnable lived only in the simulator's critic. The deployed system stayed decentralized, which is what the hardware and the reliability requirement demanded.

Research Frontier: Massively Parallel Swarm Learning (2024 to 2026)

The frontier of learned swarms is being pushed by simulation throughput. GPU-resident multi-agent simulators in the lineage of Isaac Gym and its successors step tens of thousands of robot environments simultaneously on a single accelerator, compressing what once took a cluster-week into hours and making large drone-swarm policies trainable end to end. On the algorithm side, MAPPO and its descendants remain strong, surprisingly simple baselines across cooperative benchmarks, while work on scalable and mean-field MARL targets swarms of hundreds to thousands of agents where pairwise reasoning is infeasible. A parallel thread asks what swarm policies should condition on: graph-neural-network and attention-based actors that aggregate a variable set of neighbors let one trained policy transfer across swarm sizes, and learned communication is being revisited as a way to coordinate under tight bandwidth. The open problem that this chapter's next section confronts is the reality gap: a policy trained across thousands of simulated swarms must still fly on hardware, which is the sim-to-real transfer of Section 39.8.

8. From Learned Policy to Flying Hardware Beginner

We now have a learned swarm policy: a single network, conditioned only on a robot's local observation, trained under a centralized critic across many parallel simulated swarms, and shown on a small game to solve the credit-assignment failure that defeats independent learners. Two threads remain open. The policy was trained in simulation, and a simulator is never the world; closing that gap is the sim-to-real problem of Section 39.8, which takes the artifact this section produced and asks what it takes to make it survive contact with real sensors, real dynamics, and real wind. And the coordination the policy learned is a learned cousin of the designed swarm rules of Chapter 31; the next section's transfer techniques are what let the learned version leave the simulator and join the designed version in the air.

Exercise 39.7.1: Why the Critic May See What the Actor May Not Conceptual

A teammate argues that CTDE is cheating: if the centralized critic can see the global state, the learned policies must secretly depend on it too, and so the swarm will fail the moment that state is unavailable at deployment. Explain precisely why this objection is wrong. In your answer, identify which quantity in the policy-gradient update of Section 4 conditions on global information and which conditions only on local information, and state what each robot actually loads and feeds to its network at execution. Then describe one concrete failure that would occur if an engineer mistakenly let the actor read a critic-only feature during training.

Exercise 39.7.2: Make Independent Learning Fail Harder Coding

Starting from Code 39.7.2, deepen the miscoordination penalties (for example, set the off-diagonal entries in the first row and column to $-30$) and rerun. Report how the independent learners' optimum-hit rate over the twenty seeds changes and explain the mechanism in terms of the moving-target dynamics of Section 5. Then test the robustness of the value-decomposition learner by varying the penalty-discount rate BETA from $0.0$ up to ALPHA; find the largest BETA at which the decomposed learner still reaches the optimum on every seed, and explain what role the optimism is playing.

Exercise 39.7.3: Size the Rollout Fleet for a Swarm Policy Analysis

Suppose training a drone-swarm policy needs $2 \times 10^{9}$ environment steps to converge, and a single GPU-resident simulator steps $4{,}096$ parallel swarm environments at $300$ steps per second per environment. Estimate the wall-clock time to gather the required samples on one accelerator, then on a cluster of $16$ such accelerators under the actor-learner architecture of Chapter 20, assuming rollout throughput scales linearly with accelerators. State one reason the learner, not the rollout, may become the bottleneck as you add accelerators, and connect it to the synchronous-versus-asynchronous trade-off of Chapter 20.