Section 30.1: From Reinforcement Learning to MARL

"I finally learned the optimal policy. Then my opponent learned one too, and mine stopped being optimal. We have been doing this to each other for ten thousand episodes."
A Q-Table That Refuses to Converge

Big Picture

Multi-agent reinforcement learning is what happens when the agents of a multi-agent system stop being hand-coded and instead learn their policies by trial and error, each one optimizing its own reward inside a world that contains the other learners. Single-agent reinforcement learning rests on a quiet assumption: the environment is fixed, so the same action in the same state has the same expected consequence every time. The moment a second learning agent shares that environment, the assumption breaks, because the best thing to do now depends on what the others are currently doing, and they are changing too. This chapter is about learning to act well when the ground keeps moving. This first section states the shift precisely, separates it from the systems work of scaling a single learner, and demonstrates with a tiny runnable game that naive independent learning can chase its own tail forever.

The previous three chapters built the multi-agent picture from the top down. Chapter 27 framed distributed artificial intelligence as many loci of reasoning rather than one. Chapter 28 gave us the language of payoffs, best responses, and equilibria for agents whose interests may align or clash. Chapter 29 assembled agents into systems with communication, roles, and coordination protocols, but the policies in those systems were largely specified by a designer. The question this chapter answers is the natural next one: what if the agents do not come with policies, and must discover them through experience? That is multi-agent reinforcement learning, MARL for short, and it sits exactly at the intersection of game theory and reinforcement learning. It inherits the equilibria of the former and the trial-and-error machinery of the latter, and the friction between the two is where the subject becomes hard and interesting.

1. A One-Paragraph Recap of Single-Agent RL Beginner

Reinforcement learning studies an agent that learns how to act by interacting with an environment, with no labeled examples telling it the right move, only a scalar reward signal it tries to maximize over time. The standard formalism is the Markov decision process (MDP): at each step the agent observes a state $s$, chooses an action $a$ from a policy $\pi(a \mid s)$, receives a reward $r$, and the environment transitions to a new state $s'$ drawn from a fixed transition kernel $P(s' \mid s, a)$. The agent's objective is the expected discounted return, the reward it can expect to accumulate from now on,

$$G_t = \sum_{k=0}^{\infty} \gamma^{k}\, r_{t+k}, \qquad J(\pi) = \mathbb{E}_{\pi}\!\left[\, G_0 \,\right], \qquad 0 \le \gamma < 1,$$

where the discount factor $\gamma$ weights near rewards more heavily than distant ones. Algorithms such as Q-learning learn an action-value function $Q(s,a)$ estimating the return for taking $a$ in $s$ and acting well thereafter, and the policy follows by acting greedily, or nearly so, with respect to $Q$. The infrastructure for running this at scale, the actor and learner processes that generate and consume experience by the billion, is the subject of Chapter 20. What matters here is the load-bearing word in the formalism: the kernel $P$ is fixed. Convergence guarantees for Q-learning depend on it. Pull that assumption and the theory wobbles.

2. Adding a Second Learner Breaks the Ground Beginner

Now put two learning agents in the same world. From agent 1's point of view, the environment it interacts with is no longer just the physics of the task; it includes agent 2, whose actions help determine the reward agent 1 receives and the state it lands in. If agent 2's policy were frozen, agent 1 would simply face a slightly more complicated but still fixed MDP, and ordinary reinforcement learning would apply. But agent 2 is not frozen; it is learning, which means its policy is drifting from one episode to the next as it adapts to agent 1. So the effective transition and reward dynamics that agent 1 experiences are changing over time, driven by the other agent's learning. The environment, from any single agent's perspective, is non-stationary. The fixed kernel $P$ that single-agent reinforcement learning assumes simply does not exist when the other agents are themselves in motion.

Figure 30.1.1 draws the contrast directly. On the left is the familiar single-agent loop: one agent, one environment, a stationary box that returns the same distribution of outcomes for the same action forever. On the right is the multi-agent loop, where each agent's "environment" is a composite that contains the other learning agents, so the arrows that feed agent 1 pass through agent 2's changing policy, and vice versa. Nothing in the right-hand picture stands still.

Figure 30.1.1: Single-agent reinforcement learning (left) is a closed loop between one agent and a stationary environment whose kernel never changes. In MARL (right), each agent's effective environment contains the other learning agent, so the dynamics agent 1 sees depend on agent 2's current policy and shift as agent 2 learns. The orange arrows are actions, the blue arrows are states and rewards, and the dashed box marks the shared dynamics that neither agent controls alone.

Key Insight: The Other Agents Are Your Environment, and Your Environment Is Learning

Single-agent reinforcement learning works because the environment is a fixed, if unknown, object you can probe until you understand it. In MARL the environment includes the other agents, and they are improving against you at the same time you improve against them. There is no fixed target to converge to; the best response keeps moving because everyone's best response keeps moving. Every difficulty in this chapter, unstable training, cycling policies, the need for centralized critics, traces back to this one fact.

3. Watching Naive Independent Learning Fail Intermediate

The most tempting way to do MARL is to ignore the problem: give each agent its own ordinary single-agent learner and let them loose, treating the others as part of the environment. This is the independent-learners approach, and Section 30.4 studies it properly. Here we use it only to make non-stationarity visible. The setting is a repeated two-action mismatch game, the reinforcement-learning cousin of matching pennies: the row agent is rewarded when the two agents choose the same action, the column agent is rewarded when they differ, and the two rewards always sum to zero. There is no pair of pure actions both agents can be happy with at once; the only stable solution is for each to randomize uniformly. Two independent tabular Q-learners chase this game below, each updating its own action-values while blind to the other's update rule.

import numpy as np

# Two independent tabular Q-learners playing a repeated 2x2 game with NO
# stationary best response: the "matching pennies" payoff. Row wants to match,
# Column wants to mismatch. Each agent treats the other as part of a fixed
# environment and runs vanilla Q-learning over its own two actions.
rng = np.random.default_rng(7)

# Row payoff R[a_row, a_col]; Column payoff is -R (zero-sum, mismatch game).
R = np.array([[+1.0, -1.0],
              [-1.0, +1.0]])

alpha, eps, T = 0.1, 0.1, 20000
Q_row = np.zeros(2)   # action-value of "play action a" for the Row learner
Q_col = np.zeros(2)   # action-value for the Column learner

def pick(Q):
    if rng.random() < eps:
        return rng.integers(2)
    return int(np.argmax(Q))

flips = 0             # how often Row's greedy preference changes over time
prev_arg = int(np.argmax(Q_row))
snapshots = []        # Row's preferred action sampled every 4000 steps

for t in range(T):
    a_row = pick(Q_row)
    a_col = pick(Q_col)
    r_row = R[a_row, a_col]
    r_col = -r_row
    # Each agent bootstraps on its OWN best value, blind to the other's update.
    Q_row[a_row] += alpha * (r_row + 0.0 - Q_row[a_row])
    Q_col[a_col] += alpha * (r_col + 0.0 - Q_col[a_col])
    cur = int(np.argmax(Q_row))
    if cur != prev_arg:
        flips += 1
        prev_arg = cur
    if (t + 1) % 4000 == 0:
        snapshots.append((int(np.argmax(Q_row)), int(np.argmax(Q_col))))

print("Row Q-values         :", np.round(Q_row, 3))
print("Col Q-values         :", np.round(Q_col, 3))
print("times Row's greedy action FLIPPED:", flips)
print("(preferred Row,Col) every 4000 steps:", snapshots)
print("a single best response never stabilizes: each agent chases the other")

Code 30.1.1: Two independent Q-learners in a repeated zero-sum mismatch game. Each agent runs ordinary single-agent Q-learning over its two actions and treats the opponent as part of the environment; nothing in either update acknowledges that the environment is itself a learning agent.

Row Q-values         : [-0.016 -0.356]
Col Q-values         : [-0.35   0.123]
times Row's greedy action FLIPPED: 1172
(preferred Row,Col) every 4000 steps: [(0, 0), (0, 0), (0, 0), (1, 1), (0, 1)]
a single best response never stabilizes: each agent chases the other

Output 30.1.1: The greedy preference of the row agent flips more than a thousand times over twenty thousand steps, and the snapshot of who-prefers-what keeps changing. The learners never settle on a fixed pair of actions, because for this game no fixed pair is stable.

Read the output as a failure to converge, and read that failure as a feature of the problem rather than a bug in the code. The row agent's greedy action flipped 1172 times; the periodic snapshots show the preferred-action pair wandering with no fixed point. This is exactly what non-stationarity does to a single-agent algorithm: each time the row agent tilts toward an action, the column agent learns to exploit that tilt, which moves the reward landscape under the row agent and pushes it the other way, around and around. The same independent Q-learners, dropped into a stationary single-agent maze, would converge cleanly. The cycling here comes entirely from the other learner being part of the environment. The mathematically honest equilibrium, randomize fifty-fifty, is something greedy value learning cannot represent, which is the first hint that MARL will demand ideas from Chapter 28 that single-agent reinforcement learning never needed.

Thesis Thread: Non-Stationarity Is the Spine of This Chapter

The cycling in Output 30.1.1 is not an artifact of one toy game; it is the central obstacle that organizes every section that follows. Markov games (Section 30.2) give us a formalism that admits many learners at once. Centralized training with decentralized execution (Section 30.5) fights non-stationarity by letting a critic see all agents during training. Value decomposition and multi-agent policy gradients (Sections 30.6 and 30.7) are stabilization strategies in disguise. Section 30.9 confronts non-stationarity head-on as a named subject. Whenever a MARL method looks more complicated than its single-agent ancestor, ask what part of the moving-target problem it is trying to pin down; the answer is almost always a response to the loop you just watched fail to converge.

4. MARL Algorithms Versus Distributed RL Infrastructure Intermediate

It is easy to confuse this chapter with Chapter 20, because both put the words "distributed" and "reinforcement learning" near each other, and the distinction is worth stating sharply. Chapter 20 was about the systems of scaling a single learning agent: how to spread thousands of actor processes across machines so they generate experience fast enough to feed one learner, how to ship that experience over the network, and how to keep the learner's parameters fresh on every actor. There is one policy being learned; the distribution is purely a throughput device, and the math is ordinary single-agent reinforcement learning unchanged. This chapter is about the algorithms of many learning agents whose policies genuinely differ and interact. The challenge is not how to move experience around quickly but what each agent should learn at all when the others are learning too. Table 30.1.1 lays the two side by side.

Table 30.1.1: Distributed RL infrastructure (Chapter 20) scales one learner across machines; MARL (this chapter) studies many interacting learners. They compose: a large MARL run uses Chapter 20's machinery to scale each agent's training.

Aspect	Distributed RL infrastructure (Ch 20)	Multi-agent RL (Ch 30)
Number of policies learned	One	Many, one per agent
Why it is distributed	Throughput: generate experience faster	The problem itself has many actors
Core difficulty	Moving data, staleness, fault tolerance	Non-stationarity, equilibria, credit
Environment stationarity	Stationary (a normal MDP)	Non-stationary from each agent's view
The intellectual tools	Systems: actors, learners, queues	Game theory plus reinforcement learning

The two are not rivals but layers, and a serious MARL deployment uses both. The algorithms of this chapter decide what each agent learns; the infrastructure of Chapter 20 makes each agent's learning fast enough to be practical, which is exactly why Section 30.10 on distributed MARL training reaches back to Chapter 20 explicitly. Keeping the layers distinct, what to learn versus how to run it at scale, prevents a category error that otherwise muddies the whole subject.

5. The Three Settings, and Why MARL Earns Its Keep Beginner

MARL problems sort into three settings by how the agents' rewards relate, and the distinction shapes which algorithm makes sense. In the cooperative setting the agents share a single team reward and succeed or fail together, as in a fleet of warehouse robots fulfilling one order stream. In the competitive setting the agents' rewards oppose, most cleanly in a zero-sum game like the mismatch game above or two-player Go, where one agent's gain is exactly another's loss. The mixed setting, the most common in practice, has agents whose interests partly align and partly conflict, like cars at a merge that all want to get through yet each want to go first. Section 30.3 develops these three settings carefully; for now it is enough to know that "multi-agent" does not mean one thing, and the reward structure is the first question to ask about any MARL problem.

Why learn these policies at all, rather than hand-design them as the agents of Chapter 29 often were? Because coordination and competition at any real scale are too subtle to specify by hand. No engineer can write down the play that beats a world-champion Go player, the lane-changing etiquette that maximizes throughput at a busy interchange, the bidding strategy that a market-making agent should use against thousands of adaptive counterparties, or the division of labor that lets a swarm of robots cover a building fastest. These are emergent behaviors that arise from agents optimizing against each other and against a shared task, and reinforcement learning is our best tool for discovering them. MARL is how we get coordination and competition we could not have written down, which is exactly why it underlies game-playing systems, multi-robot teams, market agents, and traffic control.

Practical Example: The Traffic Signals That Learned to Cooperate

Who: A transport-systems team at a city authority piloting adaptive signal control across a congested grid of intersections.

Situation: Each intersection ran a fixed-time signal plan tuned years earlier; during incidents and surges the plans were badly mismatched to demand, and queues spilled back across the grid.

Problem: A single optimizer controlling all intersections at once was intractable, because the joint action space grows exponentially with the number of signals, yet treating each intersection in isolation ignored that one light's timing changes the traffic the next light sees.

Dilemma: Hand-tune a coordinated fixed plan that is robust but never adapts, or let each intersection learn its own policy, which adapts but risks the cycling instability of Output 30.1.1 because every signal is part of every neighbor's environment.

Decision: They modeled it as a mixed-setting MARL problem, one learning agent per intersection sharing a city-wide delay reward, rather than one giant agent or many oblivious ones.

How: Each agent learned from its local queue and phase state, training used a critic that could see neighboring intersections to tame non-stationarity (the centralized-training idea of Section 30.5), and execution stayed fully decentralized so each signal acted on local observations alone.

Result: Average vehicle delay on the pilot corridor fell meaningfully against the fixed-time baseline, with the largest gains during the irregular surges the static plans handled worst, and the signals settled into stable coordinated waves rather than fighting each other.

Lesson: When the coordination is too intricate to hand-design and every controller sits inside its neighbors' environment, MARL with a stabilized training scheme buys behavior that neither isolated learners nor a single monolithic optimizer could deliver.

6. The Road Through This Chapter Beginner

The rest of the chapter builds the toolkit for learning under the moving target we just exposed. Section 30.2 introduces Markov games, the formalism that generalizes the MDP to many agents and gives non-stationarity a precise home. Section 30.3 works through the cooperative, competitive, and mixed settings. Section 30.4 studies independent learners in full, both their surprising successes and the failure you just saw. Section 30.5 introduces centralized training with decentralized execution, the dominant paradigm for taming non-stationarity. Sections 30.6 and 30.7 cover value decomposition (VDN, QMIX) and multi-agent policy gradients (MADDPG, MAPPO). Section 30.8 tackles credit assignment, the problem of knowing which agent earned the shared reward. Section 30.9 confronts non-stationarity as a named subject. Section 30.10 returns to Chapter 20 to scale all of this across machines.

Library Shortcut: PettingZoo and RLlib Handle the Multi-Agent Plumbing

Code 30.1.1 hand-rolled a two-agent game loop so you could see every update. Real MARL environments and trainers are standardized so you do not. PettingZoo is the multi-agent counterpart to the single-agent Gymnasium API: it exposes many agents through one consistent interface, including the agent-by-agent turn order and per-agent observations that the mismatch game needed. Ray RLlib then trains a policy per agent (or shared across agents) on top of it, handling experience collection, the centralized critics of Section 30.5, and the distributed actor and learner scaling from Chapter 20:

from pettingzoo.classic import rps_v2          # a built-in repeated game
from ray.rllib.algorithms.ppo import PPOConfig

env = rps_v2.env()                              # standardized multi-agent API
config = (PPOConfig()
          .environment("rock_paper_scissors")   # one policy learned per agent
          .multi_agent(policies={"agent_0", "agent_1"}))
algo = config.build()
for _ in range(50):
    algo.train()                                # collect, learn, repeat, scaled

Code 30.1.2: The same shape of problem as Code 30.1.1, expressed through PettingZoo and RLlib. The thirty-odd lines of manual game loop, action selection, and value updates collapse to a config and a training loop, and RLlib internally manages the per-agent policies, experience batching, and the distributed actor and learner processes that Chapter 20 built by hand.

Research Frontier: Many Learning LLM Agents (2024 to 2026)

The non-stationarity loop of Output 30.1.1 has resurfaced at the frontier of large-model research, because the newest agents are language models fine-tuned by reinforcement learning, and increasingly several of them learn together. Work on multi-agent and self-play reinforcement learning for LLMs, such as SPAG (Cheng et al., 2024), trains language models to improve through adversarial and cooperative games against copies of themselves, recovering classic self-play dynamics in a text-generation setting. In parallel, group-relative policy optimization (GRPO), the multi-sample reinforcement-learning method behind reasoning models like DeepSeek-R1 (DeepSeek-AI, 2025), treats a batch of an agent's own sampled responses as competitors scored against one another, a within-agent echo of the cross-agent competition studied here. The open research question is the same one this section opened with: when many adaptive LLM agents co-train, how do we keep the joint learning process from cycling or collapsing? The stabilization ideas of Section 30.5 and Section 30.9 are being rediscovered and extended for exactly this regime.

Fun Note: The Oldest MARL Lesson Is in a Hand Game

The mismatch game in Code 30.1.1 is matching pennies, a game children play, and its lesson is over a century old: against a thinking opponent, any predictable pattern is a weakness, so the only safe play is to be genuinely random. A run of human rock-paper-scissors fails for the same reason a greedy Q-learner fails here. People cannot help leaning, and a good opponent learns the lean. MARL is, in part, the science of teaching a machine the unintuitive discipline of refusing to have a favorite move.

We now have the shift that defines the chapter, that many learners make each other's worlds non-stationary, a demonstration that ignoring it leaves naive learning chasing its tail, and a clean separation between the algorithms of interacting learners and the infrastructure of scaling one. The next step is to give this many-agent world a proper formalism, the way the MDP formalizes the single-agent world. That formalism is the Markov game, and it begins in Section 30.2.

Exercise 30.1.1: Name the Non-Stationarity Conceptual

For each scenario, explain precisely why the environment is non-stationary from a single agent's point of view, and identify what plays the role of the "other learning agent": (a) two trading bots adjusting their bid strategies against each other in a market; (b) a self-driving car learning to merge while the human drivers around it also adapt to its behavior; (c) a single reinforcement-learning agent whose reward function is being changed periodically by a human supervisor. For (c), argue whether this is genuinely a MARL problem or only resembles one, and what the distinction turns on.

Exercise 30.1.2: Make the Cycling Worse, Then Better Coding

Starting from Code 30.1.1, first raise the learning rate alpha and lower the exploration rate eps, and report what happens to the number of greedy-action flips; explain the trend in terms of how sharply each agent over-reacts to the other. Then replace the greedy action selection with a softmax (Boltzmann) policy over the Q-values and track the empirical play frequencies over the last quarter of training. Does the softmax version sit closer to the fifty-fifty equilibrium that the game's structure demands? Explain why a stochastic policy is fundamentally better suited to this game than any deterministic one.

Exercise 30.1.3: Stationary Versus Non-Stationary, Side by Side Analysis

Freeze the column agent in Code 30.1.1 by removing its Q-update so its policy never changes (a fixed, possibly random, opponent), and let only the row agent learn. Argue from the structure of the problem why the row agent's learning should now converge, since it faces a genuine stationary MDP again. Then re-enable the column agent's learning and contrast the two runs. Use this comparison to state, in one sentence, the exact assumption that distinguishes single-agent reinforcement learning (Chapter 20) from MARL, and connect it to the discounted-return objective $J(\pi)$ from Section 1.