"I finally learned the optimal response to my neighbor. Took me ten thousand episodes. By then she had learned a new trick, and I was wrong again."
An Agent Stuck Chasing a Moving Target
An agent's environment is everything it does not control, and in a multi-agent system that includes the other agents, who are themselves learning and changing. A single agent faces a world it can model as fixed rules plus noise: if it acts the same way twice it can expect, on average, the same consequences. The moment a second adapting agent shares the world, that assumption breaks. Each agent has become part of the others' environment, and because every agent is changing its behavior over time, the world that each one sees is non-stationary: the same action in the same observed state can lead to different outcomes simply because the others have moved on. This non-stationarity is the single property that makes multi-agent learning hard, and it is the root cause behind most of the machinery in the next two chapters. This section names the environment dimensions that classify any agent problem, isolates the two that multi-agent settings add, and demonstrates non-stationarity in a few lines of Python that you can run yourself.
The previous section gave us agents and their internal architectures: how a single agent perceives, decides, and acts. This section turns the camera around to look at the world those agents inhabit. The reason the world deserves its own section is that the properties of the environment, not the cleverness of any one agent, dictate which coordination and learning methods are even feasible. A method that converges beautifully in a static, fully observable, single-agent world can diverge or oscillate forever once the world becomes partially observable and populated by other learners. Before you choose an algorithm, you classify the environment, and the vocabulary for that classification is the subject here.
We borrow the classical taxonomy of environment dimensions from the agent literature, then extend it with the two additions that matter most when many agents share a world. The classical dimensions tell you how hard a single agent's problem is. The multi-agent additions tell you why, even when each agent's individual problem looks easy, the joint problem can be brutal. Keeping these two layers distinct is the conceptual move that the rest of the chapter, and all of Chapter 30, rests on.
1. The Classical Environment Dimensions Beginner
Before agents had to worry about each other, the agent literature already had a compact checklist for describing any environment, popularized in the Russell and Norvig framing. Five axes do most of the work, and each one independently makes an agent's life harder when it tips toward the difficult end. Table 29.3.1 lists them. The value of the list is operational: read off where your problem sits on each axis, and you have narrowed the space of methods that can possibly apply before writing a line of code.
| Dimension | The question it answers | Consequence when it tips toward "hard" |
|---|---|---|
| Fully vs partially observable | Can the agent see the entire state at each step? | The agent must maintain a belief over hidden state, not just react to an observation. |
| Deterministic vs stochastic | Does an action have one outcome or a distribution of outcomes? | Planning becomes optimization over expectations, not lookup of a single result. |
| Static vs dynamic | Can the world change while the agent deliberates? | Decisions have deadlines; a slow optimal answer can be worse than a fast adequate one. |
| Discrete vs continuous | Are states and actions countable or real-valued? | Tabular methods give way to function approximation and continuous control. |
| Single vs multi-agent | Is the agent alone, or sharing the world with other deciders? | The environment now contains adaptive entities; see Section 2. |
The first four axes are familiar from single-agent reinforcement learning and planning, and we treat them as background here. A self-driving car faces a partially observable (occluded pedestrians), stochastic (uncertain tire grip), dynamic (traffic moves while you plan), continuous (steering and throttle are real-valued) world, and that combination already justifies a large body of method. The fifth axis is different in kind, not just degree, and it is where this chapter lives. Crossing from single to multi-agent does something the other four axes never do: it makes the environment contain other minds.
Single versus multi-agent is not just one more difficulty knob. Adding adapting agents to the world re-injects the other four difficulties through the back door. Other agents are a source of stochasticity (you cannot predict their action exactly), of partial observability (you cannot see their internal state or intentions), and of dynamism (they move while you deliberate). A world that was deterministic, fully observable, and static for a lone agent can become effectively stochastic, partially observable, and dynamic the instant a second learner joins it, even though the underlying physics never changed. That is why the fifth axis earns its own chapter.
2. What Multi-Agent Adds: Non-Stationarity and Hidden Minds Intermediate
From the vantage point of any single agent, the rest of the world, including all the other agents, is summarized by a transition model: given the current state and the agent's own action, what is the distribution over the next state and reward? In a single-agent world that model is fixed by the environment's physics. In a multi-agent world the next state depends not only on agent $i$'s action $a_i$ but on the joint action of everyone, $a_1, \dots, a_n$. Agent $i$ only chooses $a_i$; the others' actions are determined by their policies $\pi_{-i}$. So the effective transition model that agent $i$ experiences, after marginalizing out the others, is
$$P_i^{\text{eff}}(s' \mid s, a_i) = \sum_{a_{-i}} \Big[ \prod_{j \ne i} \pi_j(a_j \mid s) \Big]\, P(s' \mid s, a_i, a_{-i}).$$The crucial term is the policies $\pi_j$ baked into that sum. While the other agents learn, their policies $\pi_j$ change over time, so $P_i^{\text{eff}}$ changes over time even though the true physics $P$ never does. The environment that agent $i$ is trying to model is a moving target: non-stationary by construction. This is the root cause of the difficulty in multi-agent reinforcement learning, because the standard convergence guarantees of single-agent learning all assume a stationary environment, and that assumption is precisely what is violated. We give this its full treatment, with the algorithms designed to cope with it, in Chapter 30; here we simply name it and, in Section 4, make it visible in code.
The second addition is partial observability of the other agents specifically. Even a world whose physical state is fully visible hides the parts that matter most in a multi-agent setting: what each other agent knows, believes, intends, and is about to do. Agent $i$ can see where the other robots are, but not their goals or their plans. This forces agents to reason about each other's hidden state, a problem of distributed belief that connects directly to the distributed-AI machinery of Chapter 27, where multiple agents maintain and reconcile beliefs about a world none of them sees completely. Modeling other agents' beliefs and intentions is what separates a coordinating multi-agent system from a collection of mutually oblivious learners.
3. Cooperative, Competitive, and Mixed-Motive Worlds Beginner
One more environment property reorganizes everything: the alignment of the agents' goals. In a fully cooperative environment all agents share a single reward, so improving one agent's outcome improves everyone's, and the challenge is coordination, getting agents to act as a coherent team without a central controller. In a fully competitive (zero-sum) environment one agent's gain is exactly another's loss, and the challenge is strategy, anticipating and countering an adversary. Most realistic worlds are mixed-motive: agents have partly aligned and partly conflicting interests, like drivers who all want to reach their destinations (aligned) but compete for the same lane (conflicting). Mixed-motive worlds are the hardest and the most common, and they are where coordination, negotiation, and trust all become necessary at once.
This cooperative-competitive-mixed classification is exactly the game-type distinction developed formally in Chapter 28, now read as a property of the environment rather than of an abstract game. The connection is deliberate: the equilibrium concepts and solution methods from game theory are the tools for reasoning about what rational agents will do in each kind of world, and the type of the environment tells you which of those tools you need. A cooperative world invites team-reward methods; a competitive world invites minimax and best-response reasoning; a mixed-motive world demands the full apparatus of incentives, communication, and reputation that occupies the rest of this chapter.
Take one plain gridworld with two agents and a pile of apples. Make the apples a shared team score and you have a cooperation problem: the agents must avoid harvesting each other's targets and getting in each other's way. Make the apples a fixed pile they split, and you have a competition problem: every apple one agent grabs is one the other cannot. Leave the apples shared but let them regrow only if harvested sustainably, and you have a mixed-motive social dilemma where short-term greed starves everyone later. Same tiles, same physics, three entirely different sciences, decided purely by how the reward is wired.
Who: A robotics team deploying a fleet of autonomous picking robots in a fulfilment warehouse.
Situation: Each robot ran its own reinforcement-learning policy to choose routes, trained in a simulator where the other robots were treated as ordinary moving obstacles.
Problem: In the live warehouse the robots kept deadlocking at aisle intersections, far more often than the single-robot training success rate had predicted.
Dilemma: Retrain each robot harder against the fixed obstacle model, which was cheap but kept the flawed stationary assumption, or rebuild the simulator so that each robot's environment contained the other robots' actual learning policies, which was costly but honest about the non-stationarity of Section 2.
Decision: They rebuilt the simulator to be genuinely multi-agent, modeling each robot's world as containing the others' current policies rather than passive obstacles.
How: They moved to a standardized multi-agent environment API (the PettingZoo loop of Section 5), co-trained the robots so each one's experience reflected the others adapting, and evaluated on held-out co-players the way Melting Pot does.
Result: Intersection deadlocks fell sharply, and the policies kept working when a new robot model with a different policy was added to the floor, because they had been shaped against a moving, not a frozen, set of neighbors.
Lesson: Treating other learning agents as fixed obstacles is the stationary-world fallacy in disguise; the environment must include the other agents as the adapting entities they are, or the deployed system meets a world its training never showed it.
4. Seeing Non-Stationarity in Code Intermediate
The claim that the environment is non-stationary from one agent's perspective sounds abstract until you watch it happen. The demo below builds a minimal two-agent gridworld on a one-dimensional line. Agent A is rewarded when it lands on the same cell as agent B (a simple coordination goal), so the reward A reaps for a given move depends entirely on where B goes. We hold agent A completely fixed (same start, same single action of stepping right, every episode) and change only agent B's policy, from "mostly drift right" to "mostly drift left." From A's fixed viewpoint, the reward distribution for that one identical state-action pair flips, even though A never changed and the grid's physics never changed. That flip is non-stationarity, measured directly.
import numpy as np
from collections import Counter
# A 1-D gridworld of L cells. Two agents sit on cells; each step they move.
# Agent A's "effective environment" = the reward/next-state statistics it sees
# for a FIXED state-action pair, which are entirely shaped by agent B's policy.
L = 9
rng = np.random.default_rng(0)
def clamp(p):
return min(max(p, 0), L - 1)
# Agent A's FIXED action under study: from the centre, always step RIGHT.
# Agent A is rewarded when it lands on the SAME cell as B (a coordination goal).
A_START, B_START, A_ACTION = 4, 4, +1
# Two DIFFERENT stochastic policies for agent B. A never sees which is active;
# it only experiences the resulting outcomes for its one fixed action.
def b_mostly_right(_rng): return +1 if _rng.random() < 0.85 else -1
def b_mostly_left(_rng): return -1 if _rng.random() < 0.85 else +1
def measure_outcomes(policy_b, episodes=20000):
"""From A's view, take its ONE fixed action from the fixed start and record
whether it coincided with B (reward 1) or not (reward 0). Only B differs."""
rewards = Counter()
for _ in range(episodes):
pa = clamp(A_START + A_ACTION) # A's fixed move, every episode
pb = clamp(B_START + policy_b(rng)) # B moves under its own policy
rewards[int(pa == pb)] += 1 # 1 if A coordinated with B
total = sum(rewards.values())
return {r: c / total for r, c in sorted(rewards.items())}
dist_when_b_right = measure_outcomes(b_mostly_right)
dist_when_b_left = measure_outcomes(b_mostly_left)
print("Agent A is held FIXED: same start, same action (step right) every episode.")
print("Reward A observes for that ONE state-action pair (1 = coordinated with B):")
print(" B mostly-RIGHT :", {r: round(p, 3) for r, p in dist_when_b_right.items()})
print(" B mostly-LEFT :", {r: round(p, 3) for r, p in dist_when_b_left.items()})
exp_right = sum(r * p for r, p in dist_when_b_right.items())
exp_left = sum(r * p for r, p in dist_when_b_left.items())
print(f"\nExpected reward A sees for its fixed action: {exp_right:.3f} (B right) vs {exp_left:.3f} (B left)")
# How far apart are the two outcome distributions A experiences? (total variation)
keys = set(dist_when_b_right) | set(dist_when_b_left)
tv = 0.5 * sum(abs(dist_when_b_right.get(k, 0) - dist_when_b_left.get(k, 0)) for k in keys)
print(f"Total-variation distance between the two worlds A faces: {tv:.3f}")
print("Nonzero distance => A's environment is NON-STATIONARY: same A, different dynamics.")
Agent A is held FIXED: same start, same action (step right) every episode.
Reward A observes for that ONE state-action pair (1 = coordinated with B):
B mostly-RIGHT : {0: 0.147, 1: 0.853}
B mostly-LEFT : {0: 0.854, 1: 0.146}
Expected reward A sees for its fixed action: 0.853 (B right) vs 0.146 (B left)
Total-variation distance between the two worlds A faces: 0.706
Nonzero distance => A's environment is NON-STATIONARY: same A, different dynamics.
The result is stark: A's single fixed action looks good (expected reward 0.853) under one neighbor and bad (0.146) under another, and A did nothing differently in between. The grid obeyed the same rules. Only B's policy changed, and from A's perspective that was indistinguishable from the laws of the world being rewritten underneath it. Now imagine B is not switching between two fixed policies but continuously learning, nudging its policy a little every episode. Then A's environment drifts continuously, and any learning rule A uses that assumes a fixed environment is fitting a value that keeps moving. That is the precise mechanism behind the convergence troubles of independent multi-agent learners, unpacked with its remedies in Chapter 30.
5. Environments You Can Actually Run: Simulators and Benchmarks Intermediate
Studying multi-agent systems requires environments you can run, reset, and share, and a small ecosystem of standardized simulators has grown up for exactly that. Multi-agent gridworlds and particle worlds (the classic example is the multi-agent particle environment, where point agents push around landmarks under cooperative or competitive rewards) give cheap, fast, fully controllable testbeds where you can dial each dimension from Table 29.3.1 on and off. They are the fruit-fly organisms of the field: simple enough to reason about, rich enough to exhibit non-stationarity, coordination, and competition.
Standardization matters here for the same reason it matters anywhere in empirical science: without a common environment, two papers reporting "our method coordinates better" are not comparable, because they ran on different worlds with different reward wiring and different observation spaces. A shared, versioned environment API turns a vague claim into a reproducible measurement. This is why the community converged on common interfaces and curated suites, and why a multi-agent result today is expected to report which environment, which version, and which configuration it used, exactly the reproducibility discipline that Chapter 28 applies to game-theoretic claims.
Code 29.3.1 hand-rolled a two-agent loop with bespoke state and step logic, around forty lines before it does anything interesting. PettingZoo is the multi-agent counterpart to the single-agent Gym/Gymnasium API: it gives every environment, from simple gridworlds to the multi-agent particle worlds, the same standardized interface, so the same training code runs across dozens of worlds unchanged. Its two APIs, parallel (all agents act at once) and AEC (agents act in turn), cover the cooperative, competitive, and mixed-motive settings from Section 3:
# pip install pettingzoo
from pettingzoo.mpe import simple_spread_v3 # cooperative particle world
env = simple_spread_v3.parallel_env(N=3) # 3 agents, shared team reward
observations, infos = env.reset(seed=0)
while env.agents: # loop until all agents are done
actions = {a: env.action_space(a).sample() for a in env.agents} # random policy
observations, rewards, terms, truncs, infos = env.step(actions) # all act at once
env.close()
The frontier in multi-agent environments has moved from "can agents solve this one task" to "can agents generalize to other agents they have never trained with." DeepMind's Melting Pot suite is the flagship of this shift: it scores agents on held-out social scenarios, including cooperation, competition, and mixed-motive social dilemmas, with co-players the agent never saw during training, directly probing the non-stationarity of Section 2 by making the other agents genuinely novel at test time. Around it, the field has converged on maintained benchmark families (the StarCraft Multi-Agent Challenge and its successor SMACv2, cooperative and mixed PettingZoo suites, and large open-ended social environments) and on standardized libraries such as BenchMARL that run many algorithms across many of these worlds under one harness, so 2024 to 2026 multi-agent papers increasingly report cross-environment, cross-co-player generalization rather than single-world scores. The throughline is that a reproducible multi-agent claim now requires a named, versioned environment and an explicit account of which other agents populated it.
6. Why the Environment Decides the Method Intermediate
We close by making the section's organizing claim explicit: the environment's properties, not preference, dictate which coordination and learning methods are feasible. If the world is fully observable and cooperative with a shared reward, agents can in principle be trained as one joint policy and the hard part is scaling that joint training, an approach the next chapter calls centralized training. If the world is partially observable, each agent must act on its own local observation at execution time, which rules out any method that needs the global state online and pushes you toward learning local policies that were shaped with global information only during training. If the world is non-stationary because every agent learns, naive independent learning loses its convergence guarantee, and you reach for methods that explicitly stabilize the others, model them, or coordinate their updates.
This is the practical payoff of the taxonomy. Before choosing a coordination protocol (the subject of the coming sections on communication, negotiation, and task allocation) or a learning algorithm (the subject of Chapter 30), you read off the environment: its observability, its determinism, its dynamics, its reward alignment, and above all whether the other agents are fixed or learning. Those answers cut the space of viable methods down to a shortlist before you commit to anything. An agent that ignores the kind of world it lives in will pick a method that cannot work; an agent that classifies its world first picks a method that can. The remaining sections of this chapter equip agents to act inside these worlds, starting with how they exchange information at all, in Section 29.4.
This chapter sits on the sixth axis of distribution from Chapter 1: distribute the intelligence itself. The non-stationarity diagnosed here is the multi-agent face of a problem the whole book circles, what happens when no single node sees or controls the whole system. In data-parallel training, the partial views are reconciled by an exact collective (all-reduce); here there is no collective that makes the agents' views agree for free, because the other nodes are not cooperating workers computing one gradient, they are independent deciders with their own objectives. The distributed-RL infrastructure of Chapter 20 returns in the next chapter as the substrate on which these many learners run, and the coordination this chapter builds is the substitute for the all-reduce that a world of independent minds does not get to assume.
For each system, place it on all five axes of Table 29.3.1 and state whether it is cooperative, competitive, or mixed-motive: (a) a fleet of warehouse robots that share a single throughput score and can see the whole warehouse map; (b) two trading bots bidding against each other in a continuous-price market where each sees only its own order book; (c) a ride-hailing platform's drivers, who each want their own fares but collectively shape surge pricing. For each, name the single environment property you expect to cause the most trouble for a learning agent, and why.
Modify Code 29.3.1 so that agent B, instead of switching between two fixed policies, learns: have B move toward the cell A least often occupies, updating a simple count every episode. Hold A fixed as before. Measure the total-variation distance between A's outcome distribution in the first 500 episodes and the last 500 episodes. Plot or print how the distance grows as B keeps learning, and explain in two sentences why this is a more faithful model of real multi-agent non-stationarity than the abrupt left-to-right switch in the original code.
Consider an agent that models its world as stationary and fits a transition model $\hat{P}(s' \mid s, a)$ from experience, ignoring that the data came from episodes where the other agents' policies were themselves changing. Argue, using the effective-transition equation from Section 2, why the fitted $\hat{P}$ is a time-average of many different effective environments and therefore matches none of them. Then explain what this implies for the validity of any plan the agent computes from $\hat{P}$, and connect your answer to why methods in Chapter 30 either model the other agents explicitly or coordinate the learners' updates rather than treating the world as fixed.