Section 29.3: Multi-Agent Environments

"I finally learned the optimal response to my neighbor. Took me ten thousand episodes. By then she had learned a new trick, and I was wrong again."
An Agent Stuck Chasing a Moving Target

Big Picture

An agent's environment is everything it does not control, and in a multi-agent system that includes the other agents, who are themselves learning and changing. A single agent faces a world it can model as fixed rules plus noise: if it acts the same way twice it can expect, on average, the same consequences. The moment a second adapting agent shares the world, that assumption breaks. Each agent has become part of the others' environment, and because every agent is changing its behavior over time, the world that each one sees is non-stationary: the same action in the same observed state can lead to different outcomes simply because the others have moved on. This non-stationarity is the single property that makes multi-agent learning hard, and it is the root cause behind most of the machinery in the next two chapters. This section names the environment dimensions that classify any agent problem, isolates the two that multi-agent settings add, and demonstrates non-stationarity in a few lines of Python that you can run yourself.

The previous section gave us agents and their internal architectures: how a single agent perceives, decides, and acts. This section turns the camera around to look at the world those agents inhabit. The reason the world deserves its own section is that the properties of the environment, not the cleverness of any one agent, dictate which coordination and learning methods are even feasible. A method that converges beautifully in a static, fully observable, single-agent world can diverge or oscillate forever once the world becomes partially observable and populated by other learners. Before you choose an algorithm, you classify the environment, and the vocabulary for that classification is the subject here.

We borrow the classical taxonomy of environment dimensions from the agent literature, then extend it with the two additions that matter most when many agents share a world. The classical dimensions tell you how hard a single agent's problem is. The multi-agent additions tell you why, even when each agent's individual problem looks easy, the joint problem can be brutal. Keeping these two layers distinct is the conceptual move that the rest of the chapter, and all of Chapter 30, rests on.

1. The Classical Environment Dimensions Beginner

Before agents had to worry about each other, the agent literature already had a compact checklist for describing any environment, popularized in the Russell and Norvig framing. Five axes do most of the work, and each one independently makes an agent's life harder when it tips toward the difficult end. Table 29.3.1 lists them. The value of the list is operational: read off where your problem sits on each axis, and you have narrowed the space of methods that can possibly apply before writing a line of code.

Figure 29.3.1: The five classical environment dimensions, each an axis from an easier end to a harder end. The orange markers place a representative multi-agent gridworld, which lands toward the hard end of every axis. The last axis, single versus multi-agent, is the one this chapter is built around; it does not merely add difficulty, it changes the kind of difficulty, as Section 2 explains.

Table 29.3.1: The classical environment dimensions and the question each one answers. The rightmost column names the consequence for method choice.

Dimension	The question it answers	Consequence when it tips toward "hard"
Fully vs partially observable	Can the agent see the entire state at each step?	The agent must maintain a belief over hidden state, not just react to an observation.
Deterministic vs stochastic	Does an action have one outcome or a distribution of outcomes?	Planning becomes optimization over expectations, not lookup of a single result.
Static vs dynamic	Can the world change while the agent deliberates?	Decisions have deadlines; a slow optimal answer can be worse than a fast adequate one.
Discrete vs continuous	Are states and actions countable or real-valued?	Tabular methods give way to function approximation and continuous control.
Single vs multi-agent	Is the agent alone, or sharing the world with other deciders?	The environment now contains adaptive entities; see Section 2.

The first four axes are familiar from single-agent reinforcement learning and planning, and we treat them as background here. A self-driving car faces a partially observable (occluded pedestrians), stochastic (uncertain tire grip), dynamic (traffic moves while you plan), continuous (steering and throttle are real-valued) world, and that combination already justifies a large body of method. The fifth axis is different in kind, not just degree, and it is where this chapter lives. Crossing from single to multi-agent does something the other four axes never do: it makes the environment contain other minds.

Key Insight: The Fifth Axis Folds the Other Four Back on Themselves

Single versus multi-agent is not just one more difficulty knob. Adding adapting agents to the world re-injects the other four difficulties through the back door. Other agents are a source of stochasticity (you cannot predict their action exactly), of partial observability (you cannot see their internal state or intentions), and of dynamism (they move while you deliberate). A world that was deterministic, fully observable, and static for a lone agent can become effectively stochastic, partially observable, and dynamic the instant a second learner joins it, even though the underlying physics never changed. That is why the fifth axis earns its own chapter.

2. What Multi-Agent Adds: Non-Stationarity and Hidden Minds Intermediate

From the vantage point of any single agent, the rest of the world, including all the other agents, is summarized by a transition model: given the current state and the agent's own action, what is the distribution over the next state and reward? In a single-agent world that model is fixed by the environment's physics. In a multi-agent world the next state depends not only on agent $i$'s action $a_i$ but on the joint action of everyone, $a_1, \dots, a_n$. Agent $i$ only chooses $a_i$; the others' actions are determined by their policies $\pi_{-i}$. So the effective transition model that agent $i$ experiences, after marginalizing out the others, is

$$P_i^{\text{eff}}(s' \mid s, a_i) = \sum_{a_{-i}} \Big[ \prod_{j \ne i} \pi_j(a_j \mid s) \Big]\, P(s' \mid s, a_i, a_{-i}).$$

The crucial term is the policies $\pi_j$ baked into that sum. While the other agents learn, their policies $\pi_j$ change over time, so $P_i^{\text{eff}}$ changes over time even though the true physics $P$ never does. The environment that agent $i$ is trying to model is a moving target: non-stationary by construction. This is the root cause of the difficulty in multi-agent reinforcement learning, because the standard convergence guarantees of single-agent learning all assume a stationary environment, and that assumption is precisely what is violated. We give this its full treatment, with the algorithms designed to cope with it, in Chapter 30; here we simply name it and, in Section 4, make it visible in code.

The second addition is partial observability of the other agents specifically. Even a world whose physical state is fully visible hides the parts that matter most in a multi-agent setting: what each other agent knows, believes, intends, and is about to do. Agent $i$ can see where the other robots are, but not their goals or their plans. This forces agents to reason about each other's hidden state, a problem of distributed belief that connects directly to the distributed-AI machinery of Chapter 27, where multiple agents maintain and reconcile beliefs about a world none of them sees completely. Modeling other agents' beliefs and intentions is what separates a coordinating multi-agent system from a collection of mutually oblivious learners.

Figure 29.3.2: Agent $i$'s environment is everything outside its own control, and in a multi-agent system that boundary encloses both the fixed physical world and the other agents. The orange path is the source of non-stationarity: because the other agents keep changing their policies, the effective dynamics agent $i$ observes keep changing too, even though the physics never moves. The dashed link marks the parts of the other agents that stay hidden, the partial observability of goals and intentions discussed in Section 2.

3. Cooperative, Competitive, and Mixed-Motive Worlds Beginner

One more environment property reorganizes everything: the alignment of the agents' goals. In a fully cooperative environment all agents share a single reward, so improving one agent's outcome improves everyone's, and the challenge is coordination, getting agents to act as a coherent team without a central controller. In a fully competitive (zero-sum) environment one agent's gain is exactly another's loss, and the challenge is strategy, anticipating and countering an adversary. Most realistic worlds are mixed-motive: agents have partly aligned and partly conflicting interests, like drivers who all want to reach their destinations (aligned) but compete for the same lane (conflicting). Mixed-motive worlds are the hardest and the most common, and they are where coordination, negotiation, and trust all become necessary at once.

This cooperative-competitive-mixed classification is exactly the game-type distinction developed formally in Chapter 28, now read as a property of the environment rather than of an abstract game. The connection is deliberate: the equilibrium concepts and solution methods from game theory are the tools for reasoning about what rational agents will do in each kind of world, and the type of the environment tells you which of those tools you need. A cooperative world invites team-reward methods; a competitive world invites minimax and best-response reasoning; a mixed-motive world demands the full apparatus of incentives, communication, and reputation that occupies the rest of this chapter.

Fun Note: The Same Grid, Three Different Nightmares

Take one plain gridworld with two agents and a pile of apples. Make the apples a shared team score and you have a cooperation problem: the agents must avoid harvesting each other's targets and getting in each other's way. Make the apples a fixed pile they split, and you have a competition problem: every apple one agent grabs is one the other cannot. Leave the apples shared but let them regrow only if harvested sustainably, and you have a mixed-motive social dilemma where short-term greed starves everyone later. Same tiles, same physics, three entirely different sciences, decided purely by how the reward is wired.

Practical Example: The Warehouse Robots That Learned to Fight

Who: A robotics team deploying a fleet of autonomous picking robots in a fulfilment warehouse.

Situation: Each robot ran its own reinforcement-learning policy to choose routes, trained in a simulator where the other robots were treated as ordinary moving obstacles.

Problem: In the live warehouse the robots kept deadlocking at aisle intersections, far more often than the single-robot training success rate had predicted.

Dilemma: Retrain each robot harder against the fixed obstacle model, which was cheap but kept the flawed stationary assumption, or rebuild the simulator so that each robot's environment contained the other robots' actual learning policies, which was costly but honest about the non-stationarity of Section 2.

Decision: They rebuilt the simulator to be genuinely multi-agent, modeling each robot's world as containing the others' current policies rather than passive obstacles.

How: They moved to a standardized multi-agent environment API (the PettingZoo loop of Section 5), co-trained the robots so each one's experience reflected the others adapting, and evaluated on held-out co-players the way Melting Pot does.

Result: Intersection deadlocks fell sharply, and the policies kept working when a new robot model with a different policy was added to the floor, because they had been shaped against a moving, not a frozen, set of neighbors.

Lesson: Treating other learning agents as fixed obstacles is the stationary-world fallacy in disguise; the environment must include the other agents as the adapting entities they are, or the deployed system meets a world its training never showed it.

4. Seeing Non-Stationarity in Code Intermediate

The claim that the environment is non-stationary from one agent's perspective sounds abstract until you watch it happen. The demo below builds a minimal two-agent gridworld on a one-dimensional line. Agent A is rewarded when it lands on the same cell as agent B (a simple coordination goal), so the reward A reaps for a given move depends entirely on where B goes. We hold agent A completely fixed (same start, same single action of stepping right, every episode) and change only agent B's policy, from "mostly drift right" to "mostly drift left." From A's fixed viewpoint, the reward distribution for that one identical state-action pair flips, even though A never changed and the grid's physics never changed. That flip is non-stationarity, measured directly.

import numpy as np
from collections import Counter

# A 1-D gridworld of L cells. Two agents sit on cells; each step they move.
# Agent A's "effective environment" = the reward/next-state statistics it sees
# for a FIXED state-action pair, which are entirely shaped by agent B's policy.
L = 9
rng = np.random.default_rng(0)

def clamp(p):
    return min(max(p, 0), L - 1)

# Agent A's FIXED action under study: from the centre, always step RIGHT.
# Agent A is rewarded when it lands on the SAME cell as B (a coordination goal).
A_START, B_START, A_ACTION = 4, 4, +1

# Two DIFFERENT stochastic policies for agent B. A never sees which is active;
# it only experiences the resulting outcomes for its one fixed action.
def b_mostly_right(_rng):  return +1 if _rng.random() < 0.85 else -1
def b_mostly_left(_rng):   return -1 if _rng.random() < 0.85 else +1

def measure_outcomes(policy_b, episodes=20000):
    """From A's view, take its ONE fixed action from the fixed start and record
    whether it coincided with B (reward 1) or not (reward 0). Only B differs."""
    rewards = Counter()
    for _ in range(episodes):
        pa = clamp(A_START + A_ACTION)          # A's fixed move, every episode
        pb = clamp(B_START + policy_b(rng))     # B moves under its own policy
        rewards[int(pa == pb)] += 1             # 1 if A coordinated with B
    total = sum(rewards.values())
    return {r: c / total for r, c in sorted(rewards.items())}

dist_when_b_right = measure_outcomes(b_mostly_right)
dist_when_b_left  = measure_outcomes(b_mostly_left)

print("Agent A is held FIXED: same start, same action (step right) every episode.")
print("Reward A observes for that ONE state-action pair (1 = coordinated with B):")
print("  B mostly-RIGHT :", {r: round(p, 3) for r, p in dist_when_b_right.items()})
print("  B mostly-LEFT  :", {r: round(p, 3) for r, p in dist_when_b_left.items()})

exp_right = sum(r * p for r, p in dist_when_b_right.items())
exp_left  = sum(r * p for r, p in dist_when_b_left.items())
print(f"\nExpected reward A sees for its fixed action: {exp_right:.3f} (B right) vs {exp_left:.3f} (B left)")

# How far apart are the two outcome distributions A experiences? (total variation)
keys = set(dist_when_b_right) | set(dist_when_b_left)
tv = 0.5 * sum(abs(dist_when_b_right.get(k, 0) - dist_when_b_left.get(k, 0)) for k in keys)
print(f"Total-variation distance between the two worlds A faces: {tv:.3f}")
print("Nonzero distance => A's environment is NON-STATIONARY: same A, different dynamics.")

Code 29.3.1: A minimal two-agent gridworld. Agent A's start and action are held identical across both runs; only agent B's policy is swapped. The total-variation distance between the reward distributions A experiences for that one fixed state-action pair quantifies how much A's effective environment shifted without A changing at all.

Agent A is held FIXED: same start, same action (step right) every episode.
Reward A observes for that ONE state-action pair (1 = coordinated with B):
  B mostly-RIGHT : {0: 0.147, 1: 0.853}
  B mostly-LEFT  : {0: 0.854, 1: 0.146}

Expected reward A sees for its fixed action: 0.853 (B right) vs 0.146 (B left)
Total-variation distance between the two worlds A faces: 0.706
Nonzero distance => A's environment is NON-STATIONARY: same A, different dynamics.

Output 29.3.1: The identical action that earns A a reward 85% of the time when B drifts right earns it only 15% of the time when B drifts left. The expected reward A attributes to its own fixed action swings from 0.853 to 0.146 (total-variation distance 0.706), yet A's behavior and the grid physics were identical in both. The other agent alone moved the world.

The result is stark: A's single fixed action looks good (expected reward 0.853) under one neighbor and bad (0.146) under another, and A did nothing differently in between. The grid obeyed the same rules. Only B's policy changed, and from A's perspective that was indistinguishable from the laws of the world being rewritten underneath it. Now imagine B is not switching between two fixed policies but continuously learning, nudging its policy a little every episode. Then A's environment drifts continuously, and any learning rule A uses that assumes a fixed environment is fitting a value that keeps moving. That is the precise mechanism behind the convergence troubles of independent multi-agent learners, unpacked with its remedies in Chapter 30.

5. Environments You Can Actually Run: Simulators and Benchmarks Intermediate

Studying multi-agent systems requires environments you can run, reset, and share, and a small ecosystem of standardized simulators has grown up for exactly that. Multi-agent gridworlds and particle worlds (the classic example is the multi-agent particle environment, where point agents push around landmarks under cooperative or competitive rewards) give cheap, fast, fully controllable testbeds where you can dial each dimension from Table 29.3.1 on and off. They are the fruit-fly organisms of the field: simple enough to reason about, rich enough to exhibit non-stationarity, coordination, and competition.

Standardization matters here for the same reason it matters anywhere in empirical science: without a common environment, two papers reporting "our method coordinates better" are not comparable, because they ran on different worlds with different reward wiring and different observation spaces. A shared, versioned environment API turns a vague claim into a reproducible measurement. This is why the community converged on common interfaces and curated suites, and why a multi-agent result today is expected to report which environment, which version, and which configuration it used, exactly the reproducibility discipline that Chapter 28 applies to game-theoretic claims.

Library Shortcut: PettingZoo Standardizes the Multi-Agent Loop

Code 29.3.1 hand-rolled a two-agent loop with bespoke state and step logic, around forty lines before it does anything interesting. PettingZoo is the multi-agent counterpart to the single-agent Gym/Gymnasium API: it gives every environment, from simple gridworlds to the multi-agent particle worlds, the same standardized interface, so the same training code runs across dozens of worlds unchanged. Its two APIs, parallel (all agents act at once) and AEC (agents act in turn), cover the cooperative, competitive, and mixed-motive settings from Section 3:

# pip install pettingzoo
from pettingzoo.mpe import simple_spread_v3   # cooperative particle world

env = simple_spread_v3.parallel_env(N=3)      # 3 agents, shared team reward
observations, infos = env.reset(seed=0)
while env.agents:                             # loop until all agents are done
    actions = {a: env.action_space(a).sample() for a in env.agents}  # random policy
    observations, rewards, terms, truncs, infos = env.step(actions)  # all act at once
env.close()

Code 29.3.2: The same multi-agent stepping loop as Code 29.3.1, now through the PettingZoo standardized API. The roughly forty lines of bespoke gridworld collapse to one import and a six-line loop, and switching from a cooperative to a competitive world is a one-line change of environment, with the library handling agent bookkeeping, observation and action spaces, and turn order.

Research Frontier: Standardized Suites for Social Intelligence (2024 to 2026)

The frontier in multi-agent environments has moved from "can agents solve this one task" to "can agents generalize to other agents they have never trained with." DeepMind's Melting Pot suite is the flagship of this shift: it scores agents on held-out social scenarios, including cooperation, competition, and mixed-motive social dilemmas, with co-players the agent never saw during training, directly probing the non-stationarity of Section 2 by making the other agents genuinely novel at test time. Around it, the field has converged on maintained benchmark families (the StarCraft Multi-Agent Challenge and its successor SMACv2, cooperative and mixed PettingZoo suites, and large open-ended social environments) and on standardized libraries such as BenchMARL that run many algorithms across many of these worlds under one harness, so 2024 to 2026 multi-agent papers increasingly report cross-environment, cross-co-player generalization rather than single-world scores. The throughline is that a reproducible multi-agent claim now requires a named, versioned environment and an explicit account of which other agents populated it.

6. Why the Environment Decides the Method Intermediate

We close by making the section's organizing claim explicit: the environment's properties, not preference, dictate which coordination and learning methods are feasible. If the world is fully observable and cooperative with a shared reward, agents can in principle be trained as one joint policy and the hard part is scaling that joint training, an approach the next chapter calls centralized training. If the world is partially observable, each agent must act on its own local observation at execution time, which rules out any method that needs the global state online and pushes you toward learning local policies that were shaped with global information only during training. If the world is non-stationary because every agent learns, naive independent learning loses its convergence guarantee, and you reach for methods that explicitly stabilize the others, model them, or coordinate their updates.

This is the practical payoff of the taxonomy. Before choosing a coordination protocol (the subject of the coming sections on communication, negotiation, and task allocation) or a learning algorithm (the subject of Chapter 30), you read off the environment: its observability, its determinism, its dynamics, its reward alignment, and above all whether the other agents are fixed or learning. Those answers cut the space of viable methods down to a shortlist before you commit to anything. An agent that ignores the kind of world it lives in will pick a method that cannot work; an agent that classifies its world first picks a method that can. The remaining sections of this chapter equip agents to act inside these worlds, starting with how they exchange information at all, in Section 29.4.

Thesis Thread: The Environment Is Distributed Because Intelligence Is

This chapter sits on the sixth axis of distribution from Chapter 1: distribute the intelligence itself. The non-stationarity diagnosed here is the multi-agent face of a problem the whole book circles, what happens when no single node sees or controls the whole system. In data-parallel training, the partial views are reconciled by an exact collective (all-reduce); here there is no collective that makes the agents' views agree for free, because the other nodes are not cooperating workers computing one gradient, they are independent deciders with their own objectives. The distributed-RL infrastructure of Chapter 20 returns in the next chapter as the substrate on which these many learners run, and the coordination this chapter builds is the substitute for the all-reduce that a world of independent minds does not get to assume.

Exercise 29.3.1: Classify Three Worlds Conceptual

For each system, place it on all five axes of Table 29.3.1 and state whether it is cooperative, competitive, or mixed-motive: (a) a fleet of warehouse robots that share a single throughput score and can see the whole warehouse map; (b) two trading bots bidding against each other in a continuous-price market where each sees only its own order book; (c) a ride-hailing platform's drivers, who each want their own fares but collectively shape surge pricing. For each, name the single environment property you expect to cause the most trouble for a learning agent, and why.

Exercise 29.3.2: Make the Drift Continuous Coding

Modify Code 29.3.1 so that agent B, instead of switching between two fixed policies, learns: have B move toward the cell A least often occupies, updating a simple count every episode. Hold A fixed as before. Measure the total-variation distance between A's outcome distribution in the first 500 episodes and the last 500 episodes. Plot or print how the distance grows as B keeps learning, and explain in two sentences why this is a more faithful model of real multi-agent non-stationarity than the abrupt left-to-right switch in the original code.

Exercise 29.3.3: The Cost of Ignoring the Others Analysis

Consider an agent that models its world as stationary and fits a transition model $\hat{P}(s' \mid s, a)$ from experience, ignoring that the data came from episodes where the other agents' policies were themselves changing. Argue, using the effective-transition equation from Section 2, why the fitted $\hat{P}$ is a time-average of many different effective environments and therefore matches none of them. Then explain what this implies for the validity of any plan the agent computes from $\hat{P}$, and connect your answer to why methods in Chapter 30 either model the other agents explicitly or coordinate the learners' updates rather than treating the world as fixed.