"They handed all of us one number and called it the team reward. It took us a thousand episodes to figure out which of us had earned it, and we are still arguing."
A Worker Awaiting Its Share of the Credit
The reward structure of a Markov game, not the learning algorithm, is the first thing that decides what "learning" even means when several agents share an environment. Hand every agent the same reward and the problem becomes coordination under a credit-assignment fog: who actually earned the shared number? Make one agent's gain exactly another's loss and the problem becomes a search for an equilibrium that no opponent can exploit, solved by self-play and minimax reasoning. Leave the rewards partly aligned and partly opposed, the realistic middle, and the central question turns sociological: will independently learning agents discover cooperation or collapse into mutual defection? This section maps those three regimes, shows the same agents behaving completely differently under each, and explains why the regime you are in fixes the family of algorithms the rest of this chapter can use.
The previous section gave us the Markov game as the formal object of multi-agent reinforcement learning: a shared state, a joint action, a transition rule, and a separate reward function $R_i$ for each of the $n$ agents. That last detail, one reward function per agent, is where the entire field forks. A single-agent problem has one reward and one notion of "better". A Markov game has $n$ rewards, and how those rewards relate to one another determines whether the agents are teammates, adversaries, or something messier in between. Before choosing an algorithm, before worrying about non-stationarity or credit assignment, you must classify the reward structure, because that classification tells you which solution concept you are even allowed to aim for.
There are three structures, and they form a spectrum. At one end the rewards are identical; at the other they sum to zero; in the wide middle they are arbitrary. We take them in turn, then show with a runnable experiment that the same learning agents, changing nothing but the reward table, produce three qualitatively different outcomes.
1. Fully Cooperative: One Reward, Many Hands Beginner
In the fully cooperative setting every agent receives the identical reward, $R_1 = R_2 = \dots = R_n = R$. There is one team, one scoreboard, and a single global objective the agents must jointly maximize: the expected return of that shared reward,
$$J(\pi_1, \dots, \pi_n) = \mathbb{E}\!\left[\sum_{t=0}^{\infty} \gamma^t\, R(s_t, a^1_t, \dots, a^n_t)\right].$$This is the friendliest structure in principle and one of the hardest in practice, for two reasons that recur throughout the chapter. The first is coordination. Because the reward depends on the joint action, an agent's best move depends on what the others do; two agents that each act optimally against the wrong assumption about the other can land on a poor joint outcome. The second is credit assignment. When the team earns a reward, the shared signal says nothing about which agent's action produced it, so each agent struggles to tell whether its own behavior helped or merely rode along. This is the credit-assignment problem, and it is important enough that Section 30.8 is devoted to disentangling individual contributions from a team reward.
The saving grace of the cooperative setting is structural, and it powers most of the rest of the chapter. Because the agents share an objective, training can be centralized even when execution must stay decentralized: during learning we may use information from all agents (the full joint state, every action) to compute a single team value, while each agent still acts on only its own local observation at deployment. That pattern is centralized training with decentralized execution, developed in Section 30.5, and it is the structural opening through which value decomposition methods such as VDN and QMIX (Section 30.6) factor the team value into per-agent pieces. None of that machinery is available unless the rewards are shared; it is bought entirely by the cooperative structure.
You do not pick a MARL algorithm by taste; the reward structure picks the family for you. A shared reward unlocks centralized training, value decomposition, and team-credit methods, because there is a single objective to factor. A zero-sum reward forbids those methods (there is no common objective) and instead invites self-play and minimax solvers, because the goal is an unexploitable equilibrium. A general-sum reward rules out both clean stories and forces you into opponent-aware learning and repeated-game strategy. Classify the rewards first; the algorithm shortlist falls out of the classification, not the other way around.
2. Fully Competitive: Zero-Sum and the Minimax Goal Intermediate
At the opposite end sit two agents whose rewards are exact opposites: $R_1(s, a^1, a^2) + R_2(s, a^1, a^2) = 0$ at every step. One agent's gain is precisely the other's loss, so there is no shared objective to maximize and no team to coordinate. This is the world of game-playing AI, of chess, Go, poker, and competitive video games, where "winning" for one side is "losing" for the other by definition.
Because the agents want opposite things, the right notion of a solution is no longer "maximize a common return" but an equilibrium that neither side can improve on unilaterally. In a two-player zero-sum game that equilibrium is the minimax point: agent 1 plays the policy that maximizes its worst-case return against any opponent, and by the minimax theorem this value coincides with the one agent 2 obtains by minimizing agent 1's return. The state-action value an agent learns is therefore defined against an adversarial opponent,
$$Q_1(s, a^1) = R_1(s, a^1, a^2) + \gamma\, \mathbb{E}_{s'}\!\left[\,\max_{\pi_1}\min_{a^2}\, \mathbb{E}_{a^1 \sim \pi_1}\, Q_1(s', a^1, a^2)\right],$$which is the update at the heart of Littman's minimax-Q algorithm, the zero-sum analogue of ordinary Q-learning. The equilibrium policy is often a mixed (randomized) strategy, exactly as in the matching-pennies game of Chapter 28: any deterministic choice would be exploitable, so the unexploitable answer is to randomize in a way the opponent cannot predict. We will see this concretely in the demo, where the competitive agents settle into near-uniform play rather than any fixed joint action.
The cleanest way to get a strong opponent for a zero-sum game is to clone the agent you are training and let the two copies fight. Each improvement on one side raises the difficulty for the other, so the pair drag each other up a difficulty ladder that no fixed opponent could provide. The eerie part is that nobody ever told the system the rules of good play; the curriculum is generated entirely by two copies of the same network trying to embarrass each other.
3. Self-Play and Its Instabilities Intermediate
The training paradigm that made zero-sum MARL famous is self-play: the agent improves by playing against copies or past versions of itself, using the resulting games as training data. This is the lineage of TD-Gammon, of AlphaGo and AlphaZero (one network learning chess, shogi, and Go from self-play alone), and of the league-based training that produced grandmaster-level StarCraft II play. The appeal is that self-play manufactures an automatic curriculum: the opponent is always exactly as strong as the current agent, so the difficulty rises in lockstep with skill, with no human-labeled games required.
Self-play is powerful and unstable in equal measure, and the instability is worth naming because it shapes the engineering. Training against only the latest copy of yourself can produce cycling, where strategy A beats B, B beats C, and C beats A, so the agent chases its own tail and never converges; this is the rock-paper-scissors dynamic of non-transitive games. It can also produce strategic collapse, where both copies forget how to handle styles they no longer face. The standard remedies keep a population: train against a league of past checkpoints and diverse opponents rather than only the current self, prioritizing opponents that currently exploit you. These population methods, and the actor-learner infrastructure that generates self-play games at scale, build directly on the distributed RL machinery of Chapter 20, which is what makes self-play at the AlphaZero scale a distributed-systems problem and not only an algorithmic one.
4. Mixed-Motive: The Realistic Middle and Social Dilemmas Advanced
Most real multi-agent situations are neither purely cooperative nor purely competitive. Self-driving cars share the goal of avoiding collisions but compete for the same gap in traffic. Trading agents profit individually yet collectively shape a market. Firms, nations, and recommender systems all have interests that are partly aligned and partly opposed. These are general-sum (mixed-motive) games, where the reward functions are arbitrary and $R_1 + R_2$ need not be constant. There is no single objective to maximize and no clean minimax to compute; the relevant solution concept is a Nash equilibrium, but general-sum games can have several, of differing quality, which is precisely what makes them hard.
The sharpest version of the difficulty is the social dilemma, whose archetype is the Prisoner's Dilemma. Each agent has a dominant action (defect) that is individually better regardless of what the other does, yet if both follow that dominant action they reach an outcome worse for both than if they had jointly cooperated. The payoffs satisfy $T > R > P > S$, where $R$ is mutual cooperation, $P$ mutual defection, and $T$, $S$ the temptation and sucker payoffs; individual rationality drives the pair from the $(R, R)$ they would both prefer to the $(P, P)$ they both regret. The central empirical question of mixed-motive MARL is whether learning agents, left to optimize their own returns, discover cooperation or converge to defection.
The answer hinges on whether the game is played once or repeatedly. In a one-shot dilemma defection is unavoidable. In a repeated game, the shadow of future interaction makes conditional cooperation rational: strategies such as tit-for-tat reward cooperation and punish defection across rounds, and the folk theorem (built in Section 28.7 on repeated games) shows that cooperation can be sustained as an equilibrium when agents value the future enough. Whether independent learners actually find these cooperative equilibria, rather than the safe defecting one, is not guaranteed and depends on their algorithms, their memory of past play, and their exploration, which is exactly the open problem the research frontier below pursues.
Mixed-motive learning is among the most active corners of MARL. Sequential social dilemmas, spatially and temporally extended games where cooperation and defection unfold over many steps rather than a single choice, are the standard testbed; DeepMind's Melting Pot suite (treated in Section 29.3) evaluates whether agents generalize cooperative behavior to unfamiliar co-players. A second thread is opponent shaping: methods in the LOLA and POLA lineage have each agent account for the fact that its own learning changes how others learn, and recent model-free and scalable variants push opponent shaping toward larger populations. The newest and fastest-moving thread studies large language model agents in matrix and repeated social-dilemma games, asking whether prompted or fine-tuned LLM agents reciprocate, retaliate, or get exploited; 2024 to 2026 work reports that LLM agents can sustain cooperation under the right prompting and memory yet remain sensitive to framing, and that population-based and reputation mechanisms shift the cooperation rate measurably. The unifying open question is unchanged from the classical theory: can selfish learners be reliably steered to the cooperative equilibrium they both prefer?
5. The Same Agents, Three Reward Structures Intermediate
The claim that the reward structure, not the algorithm, decides the qualitative outcome is testable, so we test it. The code below puts two identical tabular Q-learning agents into a repeated two-action stage game (action 0 is cooperate or share, action 1 is defect or grab) and runs them three times. Nothing changes between runs except the reward table: a shared team reward, a zero-sum reward, and a Prisoner's Dilemma. The agents, the learning rate, and the exploration schedule are byte-for-byte identical across all three. We then read off the joint action they converge to and how often each joint outcome occurs in the converged tail of training.
import numpy as np
ACTIONS = (0, 1) # 0 = Cooperate / share, 1 = Defect / grab
def payoff_cooperative(a0, a1):
# Shared team reward: both get the SAME number, maximized by joint cooperate.
table = {(0, 0): (2.0, 2.0), (0, 1): (-1.0, -1.0),
(1, 0): (-1.0, -1.0), (1, 1): (0.0, 0.0)}
return table[(a0, a1)]
def payoff_competitive(a0, a1):
# Zero-sum: r0 + r1 = 0. A matching-pennies-style game with a mixed Nash.
r0 = 1.0 if a0 == a1 else -1.0
return (r0, -r0)
def payoff_mixed(a0, a1):
# Prisoner's Dilemma (a social dilemma): defection dominates each round, yet
# mutual cooperation pays more than mutual defection. T > R > P > S.
table = {(0, 0): (3.0, 3.0), (0, 1): (0.0, 5.0),
(1, 0): (5.0, 0.0), (1, 1): (1.0, 1.0)}
return table[(a0, a1)]
def run(payoff, episodes=20000, alpha=0.05, eps0=0.5, seed=0):
rng = np.random.default_rng(seed)
Q = np.zeros((2, 2)) # Q[agent, action], stateless single-state game
counts = np.zeros((2, 2)) # joint-action visit counts for the tail window
for t in range(episodes):
eps = eps0 * (1.0 - t / episodes) # anneal toward greedy
a = [int(rng.integers(2)) if rng.random() < eps else int(np.argmax(Q[i]))
for i in range(2)]
r = payoff(a[0], a[1])
for i in range(2):
Q[i, a[i]] += alpha * (r[i] - Q[i, a[i]]) # independent Q-learning
if t >= int(0.9 * episodes): # record converged tail
counts[a[0], a[1]] += 1
return Q, counts / counts.sum()
def describe(name, payoff):
Q, freq = run(payoff)
greedy = (int(np.argmax(Q[0])), int(np.argmax(Q[1])))
label = {0: "Cooperate", 1: "Defect"}
r = payoff(*greedy)
print(f"=== {name} ===")
print(f" learned greedy joint action : ({label[greedy[0]]}, {label[greedy[1]]})"
f" -> payoff {r}")
print(f" joint-action freq (last 10%): CC={freq[0,0]:.2f} CD={freq[0,1]:.2f} "
f"DC={freq[1,0]:.2f} DD={freq[1,1]:.2f}")
print(f" sum of rewards at greedy : {r[0] + r[1]:+.1f}\n")
describe("Cooperative (shared team reward)", payoff_cooperative)
describe("Competitive (zero-sum)", payoff_competitive)
describe("Mixed-motive (Prisoner's Dilemma)", payoff_mixed)
run; the agents, learning rule, and exploration schedule never change, so any difference in outcome is caused by the reward structure alone.=== Cooperative (shared team reward) ===
learned greedy joint action : (Cooperate, Cooperate) -> payoff (2.0, 2.0)
joint-action freq (last 10%): CC=0.97 CD=0.01 DC=0.01 DD=0.00
sum of rewards at greedy : +4.0
=== Competitive (zero-sum) ===
learned greedy joint action : (Cooperate, Cooperate) -> payoff (1.0, -1.0)
joint-action freq (last 10%): CC=0.26 CD=0.24 DC=0.25 DD=0.26
sum of rewards at greedy : +0.0
=== Mixed-motive (Prisoner's Dilemma) ===
learned greedy joint action : (Defect, Defect) -> payoff (1.0, 1.0)
joint-action freq (last 10%): CC=0.00 CD=0.01 DC=0.01 DD=0.98
sum of rewards at greedy : +2.0
The three outcomes are exactly the three stories of this section, produced from one program. Under the shared reward the agents coordinate on the joint optimum without difficulty. Under the zero-sum reward they never settle on a pure joint action; because any deterministic choice is exploitable, the converged behavior spreads over all four joint outcomes and the reward sum stays at zero, the fingerprint of a mixed minimax equilibrium. Under the Prisoner's Dilemma the independent learners each follow their dominant action and land in mutual defection, leaving the superior cooperative outcome on the table, which is the social dilemma in its starkest form. Change nothing but the rewards and you change what the agents become.
The equilibria of Chapter 28 were static: a payoff matrix and a fixed-point strategy you solve for analytically. Here those same equilibria reappear as the resting points of a distributed learning process, reached (or missed) by agents that hold no payoff matrix and only sample rewards from a shared environment. The cooperative optimum, the mixed minimax, and the dilemma's defection trap in Output 30.3.1 are not computed; they are converged to, by separate processes exchanging nothing but their effect on a common world. Carrying a game-theoretic solution concept from a matrix into a population of learners, and asking whether the learners actually find it, is the move that turns game theory into multi-agent reinforcement learning.
Code 30.3.1 hand-rolled the stage games to expose the mechanism. In practice you reach for PettingZoo, the standard multi-agent environment API (the multi-agent counterpart to Gymnasium), which provides cooperative, competitive, and mixed-motive environments behind one uniform interface, so swapping the reward structure is swapping an import rather than rewriting a loop:
# pip install pettingzoo
from pettingzoo.mpe import simple_spread_v3 # cooperative: shared coverage reward
from pettingzoo.classic import connect_four_v3 # competitive: zero-sum board game
from pettingzoo.butterfly import pistonball_v6 # mixed cooperation under shared physics
env = simple_spread_v3.parallel_env() # same .reset()/.step() API for all three
observations, infos = env.reset(seed=0)
while env.agents: # agents may enter and leave over time
actions = {a: env.action_space(a).sample() for a in env.agents}
observations, rewards, terminations, truncations, infos = env.step(actions)
Who: An applied-RL engineer at a logistics company training a fleet of picking robots that share aisles.
Situation: Each robot was trained with its own per-robot throughput reward (items picked per hour), the obvious individual objective.
Problem: The robots learned to race for the same high-value shelves and block one another at intersections; aggregate throughput stalled even as each robot's individual reward rose, a textbook mixed-motive trap where individually rational behavior was collectively poor.
Dilemma: Switch to a single shared fleet-throughput reward, which makes the problem cleanly cooperative and unlocks centralized training, but blurs which robot earned what and slows credit assignment, or keep per-robot rewards and try to engineer cooperation into a general-sum game with opponent-aware learning, which is harder to get to converge.
Decision: They moved to a shared team reward, accepting the cooperative structure precisely because it unlocked centralized training with decentralized execution and value decomposition for the credit-assignment problem.
How: The fleet reward became total items picked minus a congestion penalty; training used a centralized critic over the joint state while each robot still acted on its local view, then a QMIX-style value mix attributed the team return back to individual robots.
Result: Aggregate throughput rose by a double-digit percentage and intersection deadlocks dropped sharply, because the agents now optimized the quantity the business actually cared about rather than a private proxy that pitted them against each other.
Lesson: The reward structure is a design choice, not a given. Reshaping a mixed-motive problem into a cooperative one, when the application allows it, can buy you an entire family of well-behaved algorithms; the cost is the credit-assignment work of Section 30.8, which is usually a price worth paying.
6. From Structure to Algorithm Beginner
The practical payoff of this section is a routing rule for the rest of the chapter. Once you have classified the reward structure, the available algorithm families and their typical failure modes follow, and Table 30.3.1 makes the routing explicit. Read it as the decision you make before opening any of the method sections that follow.
| Reward structure | Reward relation | Solution concept | Algorithm family (this chapter) |
|---|---|---|---|
| Fully cooperative | $R_1 = \dots = R_n$ | Joint optimum | CTDE, value decomposition (30.5, 30.6) |
| Fully competitive | $R_1 + R_2 = 0$ | Minimax / Nash | Self-play, minimax-Q |
| Mixed-motive | $R_i$ arbitrary | Nash (possibly many) | Opponent-aware learning, repeated-game strategy |
One structure cuts across all three rows. Whatever the reward relation, the moment several agents learn at once each one's environment is non-stationary, because the others are changing too; that difficulty, deferred here, is the subject of Section 30.9. And before any of these structured methods, the simplest possible approach is to ignore the multi-agent nature entirely and let each agent run an ordinary single-agent algorithm, treating the others as part of the environment. Whether that naive idea works, and exactly where it breaks, is the independent-learners story that Section 30.4 takes up next.
For each scenario, state whether the natural reward structure is cooperative, competitive (zero-sum), or mixed-motive, and name one algorithmic consequence of that classification using Table 30.3.1: (a) two teams of agents in a capture-the-flag game where only the winning team scores; (b) a fleet of delivery drones rewarded by total packages delivered fleet-wide; (c) several ad-bidding agents owned by different companies competing in the same auction; (d) a single chess engine playing itself to improve. For any case you find ambiguous, explain what additional detail would settle the classification.
Extend Code 30.3.1 so the Prisoner's Dilemma is played repeatedly with memory: give each agent a state equal to the opponent's previous action (so the Q-table becomes $Q[\text{agent}, \text{prev\_opp\_action}, \text{action}]$) and let it learn a conditional policy. Run the mixed-motive game again and report the converged joint-action frequencies. Does conditional cooperation (a learned tit-for-tat) now emerge where mutual defection appeared in Output 30.3.1? Vary the discount factor and exploration schedule and describe which settings tip the outcome from defection toward cooperation, connecting your finding to the folk theorem of Section 28.7.
Consider a zero-sum game whose only structure is non-transitive: strategy A beats B, B beats C, and C beats A, like rock-paper-scissors. Argue why naive self-play against only the latest opponent can cycle indefinitely without converging, and why training against a population of past checkpoints can break the cycle. Estimate, in terms of the number of distinct strategies a population must retain, the memory and compute overhead of league-based self-play relative to single-opponent self-play, and relate that overhead to the distributed actor-learner infrastructure of Chapter 20 that generates the games.