Part VI: Distributed AI and Multi-Agent Systems
Chapter 30: Multi-Agent Reinforcement Learning

Policy Gradient Methods in MARL

"I learned to act on what I alone could see, but I was graded by a critic who watched everyone. That is the only reason my gradient ever pointed anywhere useful."

An Actor Who Trusts the Centralized Critic
Big Picture

The dominant way to train many agents that act on partial, local observations is to give each one a decentralized actor and train all of them with a centralized critic that sees the joint state and joint action. Value-decomposition methods (Section 30.6) factor a single team value and shine in discrete, cooperative tasks; the policy-gradient branch covered here learns explicit policies and reaches the continuous-action and mixed cooperative-competitive settings that value factorization cannot. The recurring trick is the same throughout the actor-critic family: an actor that must run on its own observation at execution time is trained against a critic that, only during training, is allowed to peek at everything. That asymmetry is what turns a moving, non-stationary multi-agent target into a stationary one the policy gradient can climb, and this section proves the effect with a from-scratch run before naming the three landmark algorithms (MADDPG, COMA, MAPPO) that productionize it.

In the previous section we factored one team value across agents and recovered per-agent greedy actions from it. That route is powerful but narrow: it assumes a cooperative team with a single shared reward and, in its standard form, discrete actions chosen by an arg-max. A great deal of multi-agent reinforcement learning lives outside those assumptions. Robots and vehicles act in continuous spaces where no arg-max exists. Markets, games, and negotiation tasks (the Markov games of Section 30.2, grounded in the equilibria of Chapter 28) mix cooperation with competition, so there is no single value to decompose. For all of these we learn the policy directly, and the policy-gradient methods of this section are how MARL does it at scale.

The challenge is that a naive multi-agent policy gradient is unstable. Each agent improves its own policy while every other agent is simultaneously changing theirs, so from any one agent's vantage point the environment is non-stationary: the same action yields different returns from one update to the next, not because the world changed but because the other agents did. The policy-gradient estimate then carries the variance of all the other agents' randomness, and the update wanders. The centralized critic is the cure, and it is worth seeing exactly why before we name the algorithms that use it.

1. The Multi-Agent Policy Gradient Intermediate

Write the policy of agent $i$ as $\pi_{\theta_i}(a_i \mid o_i)$, a distribution over that agent's action $a_i$ conditioned only on its own local observation $o_i$. This is the decentralized actor: at execution time it needs nothing but what agent $i$ can see. Let $\mathbf{a} = (a_1, \dots, a_n)$ be the joint action and $\mathbf{x}$ the joint state. The team's objective is the expected return $J(\theta) = \mathbb{E}\!\left[\sum_t \gamma^t r_t\right]$, and the policy gradient for agent $i$ takes the actor-critic form

$$\nabla_{\theta_i} J = \mathbb{E}_{\mathbf{x}, \mathbf{a}}\Big[\, \nabla_{\theta_i} \log \pi_{\theta_i}(a_i \mid o_i)\; A_i(\mathbf{x}, \mathbf{a}) \,\Big], \qquad A_i(\mathbf{x}, \mathbf{a}) = Q_i(\mathbf{x}, \mathbf{a}) - b_i(\mathbf{x}).$$

The score function $\nabla_{\theta_i} \log \pi_{\theta_i}$ depends only on agent $i$'s own policy, exactly as in single-agent REINFORCE. Everything multi-agent is concentrated in the advantage $A_i$, and specifically in its critic $Q_i(\mathbf{x}, \mathbf{a})$: a centralized critic conditions on the joint state and the joint action of all agents. This is the move that makes the whole method work. Because $Q_i$ sees what every other agent did, the value it assigns to a transition no longer fluctuates when the other agents shift their policies; the critic absorbs that dependence instead of leaving it in the gradient. An independent critic that conditions only on $o_i$ and $a_i$ cannot do this: it must average over everyone else's behavior, so its estimate is a high-variance, drifting target. The difference between those two choices is the entire content of this section, and the next subsection measures it.

Decentralized actors (execution: local observation only) Actor 1 π(a₁ | o₁) sees o₁ Actor 2 π(a₂ | o₂) sees o₂ Actor n π(aₙ | oₙ) sees oₙ optional parameter sharing: one network θ serves homogeneous agents 1..n Centralized critic Q(x, a₁, a₂, ..., aₙ) sees JOINT state + actions (training only) joint state x and all actions aᵢ flow up to the critic advantage Aᵢ trains each actor's policy gradient
Figure 30.7.1: Centralized training with decentralized execution for the policy-gradient family. Each actor (green) reads only its own observation $o_i$ and is the only part that runs at execution time. During training, the joint state and every agent's action flow up to a single centralized critic (orange), which returns an advantage $A_i$ that drives each actor's gradient. The dashed green box marks optional parameter sharing, where one actor network serves all homogeneous agents. The orange dashed arrows (critic to actors) exist only during training and are removed at deployment.

Figure 30.7.1 is the template for every method in this section. The actors are the only thing that survives to execution, which is why this is the actor-critic instance of centralized training with decentralized execution (CTDE) introduced in Section 30.5. The critic is training scaffolding: it can be as large and as global as the training cluster allows, because it is discarded before deployment.

2. Why a Centralized Critic Stabilizes the Gradient Intermediate

The claim is that conditioning the critic on the joint action lowers the variance of the policy-gradient estimate, and that lower variance is what turns unstable independent learning into stable training. We can demonstrate this from scratch without any deep-learning framework. The setup below is a one-step cooperative task we will call rendezvous: $n$ agents each emit a scalar action, and the team is rewarded both for matching a common target and for agreeing with one another, so each agent's best action genuinely depends on the others. We train Gaussian policies with a REINFORCE-style gradient and change one thing only: whether the baseline subtracted from the return is a joint-action value (the centralized critic) or a per-agent running average that cannot see the joint action (the independent critic). We then measure the variance of the policy-gradient estimate itself, which is precisely the quantity the critic is supposed to reduce.

import numpy as np

n_agents, target, coupling = 4, 0.7, 1.0      # cooperative "rendezvous" task

def team_reward(actions):                      # one shared scalar for the team
    miss  = np.sum((actions - target) ** 2)                       # hit the target
    inter = coupling * np.sum((actions[:, None] - actions[None, :]) ** 2) / 2
    return -(miss + inter)                     # also reward agreeing with peers

sigma, lr, batch, steps, n_seeds = 0.25, 0.02, 32, 400, 30

def run(use_central_critic):
    finals, gnoise = np.zeros(n_seeds), np.zeros(n_seeds)
    for seed in range(n_seeds):
        r  = np.random.default_rng(1000 + seed)
        mu = r.standard_normal(n_agents) * 0.5            # policy means, start off-target
        b_indep, gv_sum = np.zeros(n_agents), 0.0
        for t in range(steps):
            A = mu[None, :] + sigma * r.standard_normal((batch, n_agents))   # rollouts
            Rteam = np.array([team_reward(a) for a in A])
            if use_central_critic:
                # ONE critic conditions on the JOINT action: baseline = V(joint),
                # so the others' randomness cancels out of the advantage.
                adv = Rteam - Rteam.mean()
                advantages = np.repeat(adv[:, None], n_agents, axis=1)
            else:
                # Per-agent critic sees only its own action; its baseline is a slow
                # running average over everyone else's noise, which it cannot cancel.
                advantages = Rteam[:, None] - b_indep[None, :]
                b_indep += 0.02 * (Rteam.mean() - b_indep)
            g = advantages * (A - mu[None, :]) / (sigma ** 2)   # Gaussian-mean PG terms
            gv_sum += g.var(axis=0).mean()                      # gradient-noise scale
            mu += lr * g.mean(axis=0)
        finals[seed], gnoise[seed] = team_reward(mu), gv_sum / steps
    return finals, gnoise

cc_f, cc_g = run(True)
ic_f, ic_g = run(False)
print("                          central critic   independent critic")
print("policy-gradient variance : %14.4f %18.4f" % (cc_g.mean(), ic_g.mean()))
print("final team reward mean   : %14.4f %18.4f" % (cc_f.mean(), ic_f.mean()))
print("final team reward std    : %14.4f %18.4f" % (cc_f.std(),  ic_f.std()))
print("gradient-variance ratio (independent / central) : %.2f x" % (ic_g.mean()/cc_g.mean()))
print("outcome-variance  ratio (independent / central) : %.2f x" % ((ic_f.std()**2)/(cc_f.std()**2)))
Code 30.7.1: A multi-agent actor-critic in pure NumPy. The two branches share every line except the baseline: the centralized critic subtracts a joint-action value, the independent critic subtracts a per-agent running average. The gradient-noise scale gv_sum accumulates the variance of the per-sample policy-gradient estimate, the quantity the critic exists to shrink.
cooperative rendezvous, n_agents = 4 , seeds = 30

                          central critic   independent critic
policy-gradient variance :        18.8079            29.3406
final team reward mean   :        -0.0127            -0.0137
final team reward std    :         0.0174             0.0196

gradient-variance ratio (independent / central) : 1.56 x
outcome-variance  ratio (independent / central) : 1.26 x
Output 30.7.1: The centralized critic cuts the policy-gradient variance by a factor of $1.56$ and the run-to-run outcome variance by $1.26$, while reaching a slightly better mean team reward. The same actors, the same task, the same number of updates: the only change is whether the critic was allowed to see the joint action.

Output 30.7.1 makes the mechanism concrete. With the joint action visible, the baseline tracks the actual return closely and the advantage is dominated by the part of the return that agent $i$'s own action controls; the variance contributed by the other agents has been subtracted away. With only a local view, the baseline lags behind the other agents' shifting behavior, so that variance stays in the gradient and the policy update is noisier and converges to a slightly worse, more scattered solution. The effect is modest at four agents and grows with the number of agents and the strength of their coupling, which is exactly the regime where independent learning is known to fail. We connect this directly to the non-stationarity problem in Section 30.9; for now the takeaway is that the centralized critic is a variance-reduction device, and variance reduction is what makes the multi-agent policy gradient trainable.

Key Insight: The Critic Sees Everything So the Actor Does Not Have To

A decentralized actor must act on a local observation, which means it cannot, by itself, tell whether a bad return was its own fault or a teammate's. Conditioning the critic on the joint state and joint action lets it answer that question on the actor's behalf: the advantage it returns has the other agents' contribution already accounted for, so the policy gradient points at what this agent should change rather than at the noise everyone else injected. You pay for a global critic only at training time, when you control the whole cluster and can afford it; you keep the cheap, local actor at execution time, when you cannot.

3. MADDPG, COMA, and MAPPO Intermediate

Three algorithms turn the template of Figure 30.7.1 into trainable systems, and they differ mainly in how the critic is built and how the policy is updated. Table 30.7.1 places them side by side; the prose then draws out what each contributes.

Table 30.7.1: The three landmark centralized-critic policy-gradient methods. All share decentralized actors and a centralized critic; they differ in the actor update, the critic's form, and the settings they target.
MethodActor updateCentralized criticBest-fit setting
MADDPGDeterministic policy gradient (off-policy)Per-agent $Q_i(\mathbf{x}, a_1, \dots, a_n)$Continuous actions; cooperative, competitive, or mixed
COMAStochastic policy gradient (on-policy)Shared counterfactual $Q(\mathbf{x}, \mathbf{a})$ with a per-agent baselineCooperative, discrete; credit assignment
MAPPOPPO clipped objective (on-policy)Shared (or central) state value $V(\mathbf{x})$Cooperative; a strong, simple baseline

MADDPG (multi-agent deep deterministic policy gradient) gives each agent a deterministic actor $\mu_{\theta_i}(o_i)$ and its own centralized critic $Q_i(\mathbf{x}, a_1, \dots, a_n)$ conditioned on every agent's action. Because each agent carries its own critic, MADDPG does not assume a shared reward: agent $i$'s critic can be trained on agent $i$'s own reward, which is what lets the method handle competitive and mixed games as naturally as cooperative ones, the first algorithm in this family to do so. The deterministic actor makes it sample-efficient and off-policy, so it can reuse a replay buffer, at the cost of the exploration care that deterministic policies always demand.

COMA (counterfactual multi-agent policy gradients) keeps a single shared critic but subtracts a counterfactual baseline that marginalizes out agent $i$'s own action, asking how much better the team did than if agent $i$ had acted by default. That baseline is a credit-assignment device, the cleanest answer in this family to the question of which agent earned the shared reward, and we give it its own full treatment in the next section. We mention it here so the family is complete; the counterfactual idea is the natural bridge from policy gradients to Section 30.8.

MAPPO (multi-agent proximal policy optimization) is the simplest of the three and, in practice, frequently the strongest. It runs ordinary PPO on each agent's actor, with its clipped surrogate objective and on-policy updates, and supplies the advantage from a centralized value function $V(\mathbf{x})$ that conditions on the joint state. Often a single critic network is shared across all agents. The 2022 study by Yu and colleagues showed that this unglamorous recipe, tuned carefully, matches or beats the value-decomposition methods of Section 30.6 on the standard cooperative benchmarks, which reset the field's sense of how strong a well-implemented baseline can be. When you start a new cooperative MARL project, MAPPO is the method to beat before reaching for anything more elaborate.

Fun Note: The Baseline That Embarrassed the Leaderboard

For several years the cooperative-MARL leaderboards were a contest among ever more intricate value-decomposition architectures. Then a careful re-implementation of plain PPO with a shared centralized value function, given the same hyperparameter-tuning budget everyone else had quietly been using, walked onto the same benchmarks and matched or beat most of them. The lesson the field took from MAPPO was less about a new algorithm and more about a recurring hazard: a strong, simple baseline that nobody bothered to tune properly can hide for years behind complexity that was never actually necessary.

4. Parameter Sharing Across Homogeneous Agents Beginner

When agents are interchangeable, a swarm of identical drones, a team of identical foragers, a fleet of identical market makers, there is no reason to learn a separate actor network for each. Parameter sharing uses one network $\pi_\theta$ for all $n$ agents, distinguishing them only by their observation and an agent-identity feature appended to the input. This is the dashed green box in Figure 30.7.1. Its benefits are large and twofold. First, sample efficiency: every agent's experience trains the one shared network, so $n$ agents collect $n$ times the data for a single set of weights. Second, scaling: the parameter count no longer grows with the number of agents, which is what lets a single shared actor drive a swarm of hundreds (the regime of Chapter 31) without the model size exploding.

Sharing is not always right. It bakes in the assumption that agents should behave identically given the same observation, which is false when agents have genuinely different roles (a defender and a striker, a leader and followers) or different action spaces. The honest rule is to share within a class of homogeneous agents and keep separate networks across classes; many real systems carry a small number of shared networks, one per role, rather than either extreme. The centralized critic, by contrast, is almost always a single network regardless of sharing, because there is only one joint state to evaluate.

Library Shortcut: RLlib and MARLlib Do the Actor-Critic Plumbing

Code 30.7.1 hand-wrote the rollout loop, the baseline, and the gradient. Production MARL frameworks express the same MAPPO setup, with a shared parameter actor and a centralized critic, as a configuration rather than a training loop. Ray RLlib's multi-agent API maps agents to policies and wires the centralized critic and parameter sharing through its policy-mapping function, and MARLlib (built on RLlib) ships MADDPG, COMA, MAPPO, and the value-decomposition baselines behind a single uniform interface so you can swap algorithms by name:

# MARLlib: a centralized-critic MAPPO run with parameter sharing, in a few lines
from marllib import marl

env = marl.make_env(environment_name="mpe", map_name="simple_spread")   # cooperative MPE
mappo = marl.algos.mappo(hyperparam_source="mpe")                       # PPO actors + central critic
model = marl.build_model(env, mappo, {"core_arch": "mlp", "encode_layer": "128-128"})
mappo.fit(env, model, share_policy="all", stop={"timesteps_total": 1_000_000})
#                       ^ "all" = full parameter sharing across homogeneous agents
Code 30.7.2: The same algorithm as Code 30.7.1, now four lines instead of a hand-written loop. share_policy="all" turns on parameter sharing, and the framework supplies the centralized critic, the PPO clipping, the rollout workers, and the distributed actor-learner execution that Chapter 20 unpacks.

5. Scaling These Methods Across a Cluster Advanced

The actor-critic methods above are algorithms; running them on anything larger than a toy needs the distributed reinforcement-learning machinery of Part IV. The centralized critic does not change the systems picture so much as add to it. The actor-learner architecture of Chapter 20, many parallel actors generating joint-action rollouts while a learner updates the policies, carries straight over, except that each rollout now records the joint observation and joint action so the centralized critic can be trained on it. The synchronous-versus-asynchronous tension from that chapter returns here too: MAPPO is on-policy and prefers synchronous rollouts, while MADDPG is off-policy and tolerates a distributed replay buffer fed by asynchronous actors.

The same infrastructure that scales single-agent reinforcement learning is what scales MARL, and it is also what scales reinforcement learning from human feedback. The RLHF systems of Chapter 19 are, structurally, a distributed actor-critic loop: a policy model is the actor, a reward model plays the role of the critic, and the same rollout-and-update infrastructure drives the training. A practitioner who has built a distributed RLHF pipeline already owns most of the systems needed to run distributed MARL, which is one reason the two fields increasingly borrow each other's tooling.

Thesis Thread: The Actor-Learner Loop, Now Multi-Agent

The distributed actor-learner loop introduced for single-agent reinforcement learning in Chapter 20 returns here as the engine of multi-agent policy gradients, just as the cross-reference arc of this book promised: distributed RL infrastructure is introduced in Chapter 20, deepened into distributed MARL training in this chapter, and transformed into a multi-robot swarm in the case study of Part VI's closing chapters. The only structural addition the centralized critic makes is that every rollout must log the joint action, so the same collectives that synchronize a single-agent learner now also synchronize a critic that reasons over all agents at once.

Practical Example: Coordinating a Warehouse Robot Fleet

Who: A robotics team training a fleet of identical mobile robots to route packages through a shared warehouse floor.

Situation: Each robot can see only its own local neighborhood and must avoid collisions while keeping aisles clear, a cooperative continuous-control task with strong coupling between nearby robots.

Problem: Independent learners trained fine in isolation but became unstable as the fleet grew, because each robot treated the others as unpredictable moving obstacles and its policy-gradient updates oscillated.

Dilemma: Add a centralized critic that conditions on the whole floor's joint state and joint action, which stabilizes training but needs every rollout to carry global information, or stay fully decentralized and accept the instability and slow convergence.

Decision: They moved to a MADDPG-style setup with a centralized critic during training and a single shared actor across the homogeneous robots, keeping execution fully local so a deployed robot still acts on its own sensors alone.

How: They logged the joint observation and action in each rollout, trained the critic on the global view, and shared one actor network across the fleet so adding robots did not grow the model, mirroring the structure of Code 30.7.1 at production scale.

Result: Training converged at fleet sizes where independent learning had diverged, the shared actor let them scale from a dozen to over a hundred robots without retraining a new network per robot, and execution stayed decentralized, so no robot needed a live global feed on the floor.

Lesson: Pay for the global view only at training time. A centralized critic plus a shared actor buys stability and scale precisely when coupling between agents is what breaks independent learning.

6. The Research Frontier Advanced

Centralized-critic policy gradients remain an active research area, and the most consequential recent shift is the realization, crystallized by MAPPO, that simple methods carefully tuned are extraordinarily hard to beat. That has redirected effort from inventing new architectures toward understanding when and why the centralized critic helps, and toward pushing these methods into much larger and more open-ended settings.

Research Frontier: From Game Benchmarks to LLM Agents (2024 to 2026)

Two threads dominate. First, the MAPPO result (Yu et al., 2022) has been stress-tested and largely held up: follow-up work through 2024 to 2025 confirms that a well-implemented shared-critic PPO is a state-of-the-art cooperative baseline across the StarCraft Multi-Agent Challenge and Multi-Agent MuJoCo suites, and the research question has shifted from beating it to characterizing the conditions under which a centralized critic provably reduces variance over an independent one. Second, and more disruptively, the same actor-critic policy-gradient machinery now trains teams of large language model agents: multi-agent RLHF and multi-agent fine-tuning treat several cooperating or debating LLMs as agents with a shared or centralized critic (a reward model conditioned on the joint transcript), so that the methods of this section are being lifted from grid worlds and robots onto language agents that negotiate, debate, and write code together. The distributed RLHF infrastructure of Chapter 19 is precisely what makes that lift feasible at the scale of billion-parameter actors.

We now have the policy-gradient branch of CTDE in hand: decentralized actors, a centralized critic that conditions on the joint action to cut variance, the three landmark algorithms that productionize it, and the parameter sharing and distributed infrastructure that scale it. Two questions remain open. COMA's counterfactual baseline hinted at the first: when a single shared reward arrives, which agent actually earned it? That is the credit-assignment problem, and Section 30.8 takes it up directly. The second, why the centralized critic was needed at all, is the non-stationarity of Section 30.9.

Exercise 30.7.1: When Does the Critic Earn Its Keep? Conceptual

The variance reduction in Output 30.7.1 was modest at four loosely coupled agents. Argue from the structure of the multi-agent policy gradient in Section 1 how the gap between the centralized and independent critic should change as (a) the number of agents $n$ grows, and (b) the coupling between agents strengthens. Then name one regime in which an independent critic is actually the better engineering choice despite its higher variance, and justify it in terms of what the centralized critic costs at training time.

Exercise 30.7.2: Scale the Coupling and Measure the Gap Coding

Modify Code 30.7.1 to sweep the number of agents $n \in \{2, 4, 8, 16\}$ and the coupling strength $\in \{0.25, 1.0, 4.0\}$, and for each combination report the gradient-variance ratio (independent over central). Plot or tabulate the result and confirm the prediction you made in Exercise 30.7.1. Then add a third branch in which the independent critic is allowed to see the joint state $\mathbf{x}$ but still not the other agents' actions, and report where it lands between the two extremes. Explain what this third branch tells you about which part of the joint information matters most.

Exercise 30.7.3: Choose the Algorithm Analysis

For each scenario, choose MADDPG, COMA, or MAPPO from Table 30.7.1 and defend the choice in two or three sentences: (a) a price-setting game between competing sellers with continuous price actions and per-seller profit rewards; (b) a fully cooperative team of discrete-action agents where you need to know which agent's action mattered for the shared reward; (c) a new cooperative continuous-control benchmark where you want the strongest simple baseline before investing in anything custom. State explicitly which feature of each method (off-policy replay, counterfactual baseline, PPO clipping) drove your decision.