Part VI: Distributed AI and Multi-Agent Systems
Chapter 30: Multi-Agent Reinforcement Learning

Markov Games

"I optimized my policy to perfection overnight, and woke up to find the world had a different transition function, because the other three agents had been optimizing too."

A Learner Whose Environment Kept Learning Back
Big Picture

A Markov game is the multi-agent generalization of the Markov decision process: many agents choose actions at once, the environment transitions on the joint action, and each agent collects its own reward and chases its own return. Single-agent reinforcement learning, the subject of Chapter 20, assumes one decision-maker facing a fixed, stationary world. The moment a second learning agent enters, that assumption breaks: the "environment" each agent faces now contains other agents whose policies change as they learn. The Markov game makes this precise. It is the extensive-form game of Chapter 28 made sequential and stochastic, and it inherits game theory's hardest gift: the answer is no longer a single optimal policy but an equilibrium of policies. This section sets that framework, shows why the joint action makes the action space explode with the number of agents, and so motivates every structured method in the rest of the chapter.

By the end of Chapter 20 we had a clean object to reason about: a single agent, a Markov decision process with states, actions, a transition kernel, and a reward, and a policy whose job is to maximize one expected discounted return. Section 30.1 argued that the interesting systems of Part VI, fleets of trading bots, teams of warehouse robots, swarms of drones, are not one agent but many, learning at the same time. To study them we need a model that keeps the Markov structure (the future depends on the present state, not the full history) while admitting several decision-makers whose actions interact. That model is the Markov game, also called the stochastic game, introduced by Shapley in 1953 and adopted as the standard formalism for multi-agent reinforcement learning. The rest of this section builds it piece by piece, then runs a tiny one to make the coupling unmistakable.

1. From One Decision-Maker to Many Beginner

Recall the single-agent Markov decision process: a set of states $\mathcal{S}$, a set of actions $\mathcal{A}$, a transition kernel $P(s' \mid s, a)$ giving the probability of landing in $s'$ after taking $a$ in $s$, and a reward $R(s, a)$. A policy $\pi(a \mid s)$ chooses actions, and the agent maximizes the expected discounted return $\mathbb{E}\big[\sum_t \gamma^t R(s_t, a_t)\big]$. Everything about that world is stationary: the transition and reward depend only on the state and the agent's own action, so the optimal policy is a fixed target the agent can converge to.

A Markov game keeps the state and the discount but replaces the single decision-maker with $n$ agents acting simultaneously. At each step every agent $i$ observes the state and picks an action $a^i$; the actions stack into a joint action $\mathbf{a} = (a^1, a^2, \dots, a^n)$. The environment then transitions on that joint action, not on any single agent's choice, and hands each agent its own reward. The structural change is small to write and enormous in consequence: because the transition depends on $\mathbf{a}$, the world that agent $i$ experiences is shaped by what everyone else does, and since the others are also learning, that world is no longer stationary from $i$'s point of view. This is the non-stationarity that Section 30.3 and Section 30.9 wrestle with directly; here we simply name it as the inevitable price of the joint action.

Key Insight: The Joint Action Is the Whole Difference

An MDP and a Markov game differ in exactly one place: whether the transition and reward depend on one agent's action or on the joint action of all agents. That single coupling is the source of everything hard about MARL. It makes each agent's environment non-stationary (the others keep changing), it turns the solution from a single optimum into an equilibrium, and it makes the action space grow as the product of the individual action sets. Every method in this chapter is, at bottom, a way to tame one consequence of the joint action.

2. The Markov Game, Formally Intermediate

A Markov game (stochastic game) is a tuple $$\mathcal{G} = \big(\, n,\ \mathcal{S},\ \{\mathcal{A}^i\}_{i=1}^{n},\ P,\ \{R^i\}_{i=1}^{n},\ \gamma \,\big),$$ where $n$ is the number of agents, $\mathcal{S}$ is the state space, and $\mathcal{A}^i$ is agent $i$'s action set. The joint action space is the product $\mathcal{A} = \mathcal{A}^1 \times \cdots \times \mathcal{A}^n$, and a joint action is $\mathbf{a} = (a^1, \dots, a^n) \in \mathcal{A}$. The transition kernel $$P(s' \mid s, \mathbf{a}) = P\big(s' \mid s,\, a^1, \dots, a^n\big)$$ gives the probability of the next state given the current state and the joint action; it is a single function of all the agents at once. Each agent has its own reward function $R^i(s, \mathbf{a})$, which also depends on the joint action, so one agent's payoff can hinge on another agent's move. Each agent follows its own policy $\pi^i(a^i \mid s)$, and the joint policy is $\boldsymbol{\pi} = (\pi^1, \dots, \pi^n)$.

Agent $i$ seeks to maximize its own expected discounted return, holding the others' policies fixed: $$V^i_{\boldsymbol{\pi}}(s) = \mathbb{E}\!\left[ \sum_{t=0}^{\infty} \gamma^{t}\, R^i\big(s_t, \mathbf{a}_t\big) \;\middle|\; s_0 = s,\ \mathbf{a}_t \sim \boldsymbol{\pi}(\cdot \mid s_t) \right].$$ Read this carefully against the single-agent return: the reward is now indexed by $i$ and evaluated at the joint action $\mathbf{a}_t$, and the expectation is taken over everyone's policies. Agent $i$ cannot maximize $V^i$ by itself, because $V^i$ depends on $\pi^{-i}$, the policies of all the other agents (the standard notation $-i$ means "everyone except $i$"). The single-agent MDP is the special case $n = 1$, where $\mathbf{a} = a^1$ and there is no $-i$ to worry about. Three special structures of the reward, fully cooperative ($R^i$ identical for all $i$), fully competitive (zero-sum, $\sum_i R^i = 0$), and mixed, carve up the landscape that Section 30.3 explores; the framework here is agnostic to which one holds.

A 2-agent Markov game: the joint action drives the transition and both rewards state s state s′ agent 1: a¹ agent 2: a² joint action a = (a¹, a²) P(s′ | s, a) stochastic next state per-agent rewards R¹(s, a)  R²(s, a) Partial observability (POMG / Dec-POMDP) s agent 1 sees o¹ agent 2 sees o² each agent sees a local observation, never the full state
Figure 30.2.1: The Markov game in one picture. In state $s$, agents 1 and 2 independently choose $a^1$ and $a^2$; the joint action $\mathbf{a}=(a^1,a^2)$ feeds a single stochastic transition $P(s'\mid s,\mathbf{a})$ and produces a separate reward $R^i(s,\mathbf{a})$ for each agent. The lower inset shows the partially observable extension of Section 4: the true state $s$ is hidden, and each agent acts on its own local observation $o^i$ rather than on $s$.

3. The Solution Is an Equilibrium, Not an Optimum Intermediate

In a single-agent MDP, "solve it" has an unambiguous meaning: find the policy that maximizes the return, and the Bellman optimality equation pins down a unique optimal value. A Markov game has no such single target, because agent $i$'s best policy depends on what the others do, and theirs depend on $i$. We borrow the resolution from game theory. A joint policy $\boldsymbol{\pi}^* = (\pi^{1*}, \dots, \pi^{n*})$ is a Markov-perfect (Nash) equilibrium if no agent can raise its own return by unilaterally changing its policy: $$V^i_{(\pi^{i*},\, \pi^{-i*})}(s) \;\ge\; V^i_{(\pi^i,\, \pi^{-i*})}(s) \quad \text{for every agent } i,\ \text{every policy } \pi^i,\ \text{every state } s.$$ This is exactly the Nash condition of Section 28.3, lifted from a one-shot game to the sequential, stochastic setting: at every state, the joint policy induces a stage game whose equilibrium each agent plays, and "perfect" means the property holds at every state, not just along the equilibrium path. The shift from optimum to equilibrium is not a technicality; it imports the full weight of game theory's two hard facts. Equilibria may not be unique (several stable joint policies can coexist, raising the equilibrium-selection problem of which one a learning rule should converge to), and computing even one Nash equilibrium is computationally hard in general. MARL inherits both. An algorithm that converges to a single point in the single-agent case may cycle, diverge, or settle on a poor equilibrium when several learners interact.

Thesis Thread: Game Theory Becomes Sequential and Stochastic

This section is the hinge that turns the static game theory of Chapter 28 into the learning dynamics of the rest of Chapter 30. The normal-form payoff matrix of Section 28.2 becomes a per-state stage game; the Nash equilibrium of Section 28.3 becomes the Markov-perfect equilibrium above; and the negotiation and task-allocation mechanisms of Chapter 29 become things agents must learn rather than be handed. Whenever a later section claims an algorithm "converges," ask: converges to which equilibrium, and is that equilibrium good for the team? The Markov game is the object that makes the question well-posed.

4. Partial Observability: The Realistic, Harder Case Advanced

The Markov game above assumes every agent sees the full state $s$. Real distributed agents almost never do. A warehouse robot sees its own camera, not the global map; a trading bot sees a noisy slice of the order book; a drone sees only its neighbors within radio range. The realistic model gives each agent its own observation. A partially observable Markov game (POMG) adds, for each agent, an observation set $\mathcal{O}^i$ and an observation function $O^i(o^i \mid s)$ that draws a local observation from the hidden state. Agent $i$ now conditions its policy on $o^i$ (or on its history of observations), never on $s$ directly. When the agents are fully cooperative, sharing one common reward $R(s, \mathbf{a})$, this specializes to the decentralized partially observable MDP (Dec-POMDP), the standard model for cooperative MARL under partial observability and the formal backbone of the centralized-training methods in Section 30.5.

Partial observability is not a mild complication; it is a hardness jump. Each agent must infer the hidden state from a stream of local clues while the other agents, equally in the dark, do the same, and an agent's optimal action can depend on what it believes the others have observed. Solving a Dec-POMDP optimally is NEXP-complete, a complexity class far above the polynomial-time solvability of a single-agent MDP, which is precisely why cooperative MARL leans on centralized training with decentralized execution: a learner that can peek at the global state during training sidesteps the inference problem, while each agent still acts on its own observation at deployment. We unpack that idea in Section 30.5; the link from coordination graphs (Section 29.5) is that when each agent's reward depends on only a few neighbors, the joint policy factorizes along the graph and the inference problem shrinks accordingly.

Fun Note: The Environment That Studies You Back

A single-agent learner enjoys a comforting fiction: the world is indifferent. It does not care what policy you pick; it just transitions. In a Markov game that fiction collapses. The "environment" now contains other agents who are quietly watching your moves and adjusting their own, so every improvement you make subtly rewrites the problem you were solving. It is less like climbing a fixed hill and more like climbing a hill that grows a new slope each time you find the top, because the other climbers are reshaping the terrain under your feet. Equilibrium is just the polite word for "everyone finally stopped moving the hill."

5. The Action Space Explodes With the Number of Agents Intermediate

The joint action carries a cost that grows brutally with $n$. If each of the $n$ agents has $|\mathcal{A}^i| = m$ individual actions, the joint action space has size $$|\mathcal{A}| = \prod_{i=1}^{n} |\mathcal{A}^i| = m^{\,n},$$ exponential in the number of agents. A naive learner that treats the team as one giant agent and learns a value over joint actions, or a centralized policy that outputs a distribution over $\mathcal{A}$, faces $m^n$ entries to estimate. With $m = 5$ moves and $n = 10$ agents, that is nearly ten million joint actions for a single state, and the count multiplies by the size of the state space. This is the curse of dimensionality specific to multi-agent learning, and it is the single most important reason the field does not simply reduce a Markov game to one enormous MDP and call Chapter 20's machinery. Sections 30.5 through 30.7, value decomposition and multi-agent policy gradients, are essentially structured answers to the question "how do we avoid ever materializing the $m^n$ joint action space while still coordinating?" Coordination graphs (Section 29.5) give one structural escape: if rewards couple only graph-adjacent agents, the joint value factorizes into small local terms whose total size grows linearly, not exponentially, in $n$.

6. A Tiny Markov Game You Can Run Beginner

Abstraction settles fastest against a concrete instance. The code below builds a minimal two-agent Markov game, a stochastic pursuit on a ring of five cells, and evaluates joint policies by Monte-Carlo rollout. A Pursuer and an Evader each move one step left, right, or stay; the transition resolves both moves at once (with a small random slip on the Pursuer, so the kernel is genuinely stochastic), and the rewards are per-agent and zero-sum: the Pursuer scores $+1$ for landing on the Evader, who scores $-1$. The point of the experiment is to swap one agent's policy and watch both agents' returns move, the signature that this is a Markov game and not two independent MDPs.

import random

R = 5                      # ring size
GAMMA = 0.95               # discount
H = 60                     # rollout horizon
EPISODES = 20000           # Monte-Carlo episodes per evaluation
random.seed(0)

def step(state, a_p, a_e):
    """Joint transition: both agents act, the environment resolves the joint move.
    A 10% 'slip' perturbs the Pursuer, so the kernel depends on BOTH actions
    and is genuinely stochastic."""
    pos_p, pos_e = state
    if random.random() < 0.1:
        a_p = random.choice((-1, 0, 1))          # slip: pursuer noise
    return ((pos_p + a_p) % R, (pos_e + a_e) % R)

def reward(next_state):
    """Per-agent reward from the JOINT next state. Zero-sum on capture."""
    pos_p, pos_e = next_state
    if pos_p == pos_e:
        return +1.0, -1.0                        # (r_P, r_E): Pursuer catches Evader
    return -0.02, +0.02                          # otherwise: step cost / bonus

def signed_gap(state):                           # shortest signed pursuer->evader distance
    pos_p, pos_e = state
    d = (pos_e - pos_p) % R
    return d if d <= R // 2 else d - R

def pursuer_chase(s):  g = signed_gap(s); return 0 if g == 0 else (1 if g > 0 else -1)
def pursuer_lazy(s):   return 0                  # naive baseline: never move
def evader_flee(s):    g = signed_gap(s); return -1 if g > 0 else (1 if g < 0 else 1)

def evaluate(policy_p, policy_e):
    """Monte-Carlo estimate of (E[return_P], E[return_E]) under a JOINT policy."""
    tot_p = tot_e = 0.0
    for _ in range(EPISODES):
        s = (0, R // 2)                          # fixed start: agents apart
        gp = ge = 0.0; disc = 1.0
        for _ in range(H):
            s = step(s, policy_p(s), policy_e(s))    # JOINT action drives transition
            r_p, r_e = reward(s)
            gp += disc * r_p; ge += disc * r_e; disc *= GAMMA
        tot_p += gp; tot_e += ge
    return tot_p / EPISODES, tot_e / EPISODES

print("Joint policy             return_P      return_E")
print("-" * 50)
vp, ve = evaluate(pursuer_chase, evader_flee)
print(f"(chase, flee)         {vp:11.3f}  {ve:11.3f}")
vp2, ve2 = evaluate(pursuer_lazy, evader_flee)
print(f"(lazy,  flee)         {vp2:11.3f}  {ve2:11.3f}")
print("-" * 50)
print(f"Swapping ONLY the Pursuer's policy shifts return_P by {vp - vp2:+.3f}")
print(f"... and ALSO shifts the Evader's return_E by         {ve - ve2:+.3f}")
Code 30.2.1: A complete two-agent Markov game in pure Python. step is the joint transition (it reads both agents' actions), reward returns one payoff per agent from the joint outcome, and evaluate estimates each agent's discounted return under a joint policy. Changing one agent's policy changes both returns, which two independent MDPs could never do.
Joint policy             return_P      return_E
--------------------------------------------------
(chase, flee)               1.262       -1.262
(lazy,  flee)               8.772       -8.772
--------------------------------------------------
Swapping ONLY the Pursuer's policy shifts return_P by -7.510
... and ALSO shifts the Evader's return_E by         +7.510
Output 30.2.1: Real output from the run. Holding the Evader's flee policy fixed and swapping only the Pursuer's policy moves the Pursuer's return by $-7.510$ and, in lockstep, the Evader's by $+7.510$ (the zero-sum reward keeps them mirror images). The coupling is the whole point: one agent's choice rewrites the other's outcome.

The numbers also deliver a small lesson in equilibrium intuition. The lazy Pursuer scores far higher than the chasing one, because an Evader that greedily flees a stationary Pursuer keeps circling the small ring and repeatedly steps back onto it. A best response is defined against the opponent's actual policy, not against an imagined optimal one, which is exactly why a Markov game's solution is an equilibrium pair and why "play the obviously smart move" can be the wrong move once the other agent's behavior is fixed. We make this Pursuer-Evader structure the running example for the cooperative-versus-competitive taxonomy in Section 30.3.

Library Shortcut: PettingZoo Gives You the Markov Game API

The hand-rolled loop in Code 30.2.1 spells out the Markov game so you can see every moving part, but you would not build research environments this way. PettingZoo (the Farama Foundation's multi-agent counterpart to Gymnasium) is the de facto standard: it packages the tuple of Section 2 behind a small API, where step takes a dictionary of per-agent actions (the joint action) and returns per-agent observations and rewards, and ships dozens of cooperative, competitive, and mixed environments ready to use.

# pip install "pettingzoo[mpe]"
from pettingzoo.mpe import simple_tag_v3        # a pursuit (tag) Markov game

env = simple_tag_v3.parallel_env(num_good=1, num_adversaries=3)
observations, infos = env.reset(seed=0)
while env.agents:
    # ONE action per agent -> the joint action, exactly as in Section 2
    actions = {agent: env.action_space(agent).sample() for agent in env.agents}
    observations, rewards, terminations, truncations, infos = env.step(actions)
    # rewards is a dict: a SEPARATE per-agent reward R^i, just like the tuple above
env.close()
Code 30.2.2: The same joint-action, per-agent-reward contract as Code 30.2.1, now as a few lines against PettingZoo's parallel_env. The library supplies the state, transition, partial observations, and reward bookkeeping; you supply the policies. Every learning algorithm from Section 30.4 onward plugs into this interface.
Practical Example: Modeling a Ride-Hailing Fleet as a Markov Game

Who: A research engineer on the dispatch team at a ride-hailing company.

Situation: Hundreds of idle drivers in a city must be repositioned each minute to meet incoming ride requests, and the team wanted a learning policy rather than hand-tuned heuristics.

Problem: Their first prototype trained each driver as an independent single-agent RL policy, treating other drivers as part of a fixed environment. It looked fine in simulation and degraded badly on rollout.

Dilemma: Keep the simple independent-agent model, which trains fast but ignores that drivers compete for the same riders, or adopt the heavier Markov-game model, where the transition (which driver gets which rider) and each driver's reward depend on the joint repositioning of all drivers.

Decision: They reformulated dispatch as a partially observable Markov game: state is the city's supply-demand map, each driver observes only its local neighborhood, and the reward for a reposition depends on what every other nearby driver did, because two drivers chasing one rider both lose.

How: They modeled the system as a Dec-POMDP and trained with centralized training and decentralized execution (Section 30.5), letting the critic see the global supply map while each driver acted on its local view, exactly the partial-observability structure of Section 4.

Result: Accounting for the joint action eliminated the "everyone rushes the same hotspot" failure that the independent model could not even represent, and fleet utilization rose because drivers spread out to a near-equilibrium coverage.

Lesson: When agents share a resource, independent single-agent RL silently mismodels the world; the Markov game is the model that makes the competition for that resource explicit, and only then can the learner coordinate around it.

7. Why This Framework Organizes the Rest of the Chapter Beginner

The Markov game is the lens through which every later section of Chapter 30 comes into focus. The cooperative, competitive, and mixed settings of Section 30.3 are just constraints on the reward functions $\{R^i\}$. Independent learners (Section 30.4) are the temptation to ignore the joint action and pretend each agent lives in its own MDP, and non-stationarity (Section 30.9) is the bill that temptation runs up. Centralized training with decentralized execution (Section 30.5), value decomposition (Section 30.6), and multi-agent policy gradients (Section 30.7) are three disciplined ways to respect the joint action without paying the full $m^n$ cost. And the distributed-systems payoff arrives in Section 30.10: training many agents in many parallel environments is exactly the actor-learner pattern of Chapter 20, now generating joint-action trajectories instead of single-agent ones. Hold the tuple of Section 2 in mind, and the chapter reads as a sequence of answers to the problems the joint action creates.

Research Frontier: Markov Games at Scale and Equilibrium Guarantees (2024 to 2026)

Two threads keep the Markov game framework alive at the research frontier. The first is theoretical: a wave of work on the sample complexity of learning equilibria in Markov games (Markov-game extensions of V-learning and decentralized Q-learning, and finite-sample guarantees for converging to coarse-correlated and Nash equilibria) is putting MARL on firmer footing than the empirical-only era that preceded it, and clarifying when equilibrium learning is even tractable. The second is empirical and large: open-ended and population-based systems, in the lineage of DeepMind's XLand and AlphaStar and now framed as multi-agent curricula, treat the Markov game as a generator of ever-harder co-adapting opponents, and 2024 to 2026 work on large-population and mean-field games scales the formalism toward thousands of agents by approximating the effect of the crowd on any one agent. The PettingZoo and JaxMARL ecosystems (the latter running thousands of parallel Markov-game environments on a GPU) have made these studies reproducible. The common thread: the tuple of Section 2 is unchanged, but the questions have moved from "can an agent learn at all?" to "to which equilibrium does a population of learners converge, and how fast?"

Exercise 30.2.1: MDP or Markov Game? Conceptual

For each system, state whether it is best modeled as a single-agent MDP or a multi-agent Markov game, and if the latter, identify $n$, give an example individual action, and say what the joint action couples: (a) one robot vacuum cleaning an empty apartment; (b) four robot vacuums sharing one apartment and one charging dock; (c) a single chess engine playing against a fixed, non-learning opponent; (d) two chess engines that both update their policies after every game. For each Markov game, name one way agent $i$'s reward depends on $a^{-i}$, the others' actions.

Exercise 30.2.2: Watch the Action Space Explode Coding

Write a short function that, given the number of agents $n$ and per-agent action count $m$, returns the joint action space size $m^n$ and prints a table for $m = 4$ and $n = 1, 2, 4, 8, 16$. Then extend Code 30.2.1 to a third agent (a second pursuer) so the joint action is a triple, and confirm the rollout still runs but the number of distinct joint actions a tabular method would track has grown from $3^2$ to $3^3$. Explain in two sentences why a tabular joint-action value table becomes hopeless well before $n = 16$, and which later section (30.6 or 30.7) offers the escape.

Exercise 30.2.3: Best Response and Equilibrium Analysis

Using Code 30.2.1, hold the Evader fixed at evader_flee and add a third Pursuer policy of your own (for example, one that moves toward the Evader only when adjacent and stays otherwise). Evaluate all three Pursuer policies against the fixed Evader and report which is the Pursuer's best response. Then flip it: fix the Pursuer at its best response and search for the Evader's best response among two or three candidate flee policies. Argue whether the resulting pair looks like a Markov-perfect equilibrium, and connect your reasoning to the Nash condition of Section 28.3 and the equilibrium-selection problem this section raised.