"I finally learned the perfect response to my opponent. By the time I had, my opponent had become a different opponent, and I was perfectly responding to a ghost."
A Learner Chasing a Moving Target
Non-stationarity is the central, recurring difficulty of multi-agent reinforcement learning: every agent learns against a moving target, because the other agents are updating their policies at the same time, so the transition and reward distribution each agent experiences keeps changing underneath it. Single-agent reinforcement learning rests on a stationary environment, and almost every convergence guarantee in the field is built on that assumption. The moment several agents learn together, the assumption fails, the guarantees vanish, replay buffers fill with data that was generated by policies no one runs anymore, and learning can cycle or destabilize instead of settling. This section gathers that difficulty into one place, states it precisely, and shows how the methods built earlier in this chapter, centralized critics, value decomposition, self-play snapshots, and opponent modeling, each attack a different facet of the same problem. The deep point is that this is the multi-agent face of the same coordination-under-change problem the book has met before, as staleness in distributed optimization and as concept drift in serving.
Earlier sections of this chapter introduced the building blocks of multi-agent reinforcement learning: Markov games as the formalism, independent learners as the simplest method, centralized training with decentralized execution as the dominant paradigm, value decomposition and multi-agent policy gradients as concrete algorithms, and credit assignment as the question of who deserves which reward. In the previous section we untangled credit. Now we confront the difficulty that has been lurking behind all of them, the reason independent learning is fragile and the reason every later method is shaped the way it is. Each agent's environment is not fixed; it is partly made of other agents who are themselves still learning. That single fact is non-stationarity, and once you see it clearly, the architecture of the whole field reads as a sequence of answers to it.
1. The Moving Target, Stated Precisely Intermediate
Single-agent reinforcement learning models the world as a Markov decision process: from state $s$, taking action $a$, the environment hands back a reward and a next state drawn from fixed distributions $R(s, a)$ and $P(s' \mid s, a)$. The word fixed is doing enormous work. Because those distributions do not change, the agent is solving one well-posed problem, and methods such as $Q$-learning are guaranteed to converge to the optimal value function under mild conditions. The agent can collect experience now, store it, and reuse it later, because experience gathered an hour ago describes the same environment it faces now.
In a Markov game with $n$ agents, agent $i$ does not see the joint action; it sees only its own action $a_i$ and the consequences. From its private vantage point, the effective transition and reward it experiences are obtained by marginalizing over what everyone else does, and everyone else acts according to their current policies $\boldsymbol{\pi}_{-i}$:
$$P_i^{\text{eff}}(s' \mid s, a_i) = \sum_{a_{-i}} \Big( \prod_{j \neq i} \pi_j(a_j \mid s) \Big)\, P(s' \mid s, a_i, a_{-i}), \qquad R_i^{\text{eff}}(s, a_i) = \mathbb{E}_{a_{-i} \sim \boldsymbol{\pi}_{-i}} \big[ R_i(s, a_i, a_{-i}) \big].$$Read the subscript on $\boldsymbol{\pi}_{-i}$ as a clock. While agent $i$ learns, the other agents are revising their policies, so $\boldsymbol{\pi}_{-i}$ is really $\boldsymbol{\pi}_{-i}^{(t)}$, and therefore $P_i^{\text{eff}}$ and $R_i^{\text{eff}}$ carry a hidden time index too. The problem agent $i$ is trying to solve is being rewritten on every step by the learning of its peers. This is the moving target: from agent $i$'s perspective the environment is non-stationary not because of any external drift, but because the environment literally contains other learners.
The whole point of multi-agent learning is that agents adapt to one another: a defender that learns to counter an attacker, a team that learns to pass the ball, a market maker that learns the order flow. But co-adaptation is exactly what makes each agent's environment non-stationary, which is exactly what dissolves the convergence guarantees that single-agent methods rely on. The agents must co-adapt to do anything interesting, and co-adaptation is what makes learning unstable. Every technique in this section is a way to keep the benefit of mutual adaptation while taming the instability it creates.
Three concrete failures follow directly from the moving target, and naming them sharpens the diagnosis. First, convergence guarantees disappear: a contraction argument that worked for a fixed Bellman operator no longer applies when the operator itself changes between updates, so an agent's value estimates can oscillate indefinitely. Second, replay buffers go stale in a way single-agent practitioners never worry about. A transition $(s, a_i, r, s')$ stored when the opponents played one way is, strictly, off-policy for the wrong reason: it describes an environment, $\boldsymbol{\pi}_{-i}^{(t_0)}$, that has since been replaced, so training on it pulls the agent toward responding to opponents that no longer exist. Third, learning can cycle. In a competitive game, agent A adapts to beat B, B adapts to beat the new A, A adapts to beat the new B, and the pair can rotate forever around the same set of strategies without either improving, the multi-agent analogue of a limit cycle.
2. How This Chapter's Methods Mitigate It Intermediate
Read backward, the chapter is a catalogue of defenses against the moving target, and it helps to line them up against the precise failures from Section 1. The fragility of independent learning, introduced in Section 30.4, is now fully explained: an independent learner treats the other agents as part of a fixed environment, which is exactly the assumption non-stationarity violates, so its $Q$-values chase a target that the other learners keep moving. Every method that follows is, in effect, a way to make some part of the agent's world stationary again.
Centralized training with decentralized execution, the paradigm of Section 30.5, attacks the problem at its root. During training, a centralized critic is allowed to see the joint state and the other agents' actions, so it conditions on $a_{-i}$ rather than marginalizing over a shifting $\boldsymbol{\pi}_{-i}$. Conditioned on what the others actually did, the critic's learning target is stationary even while their policies drift, because the part that was moving has been moved into the inputs. Centralized critics in the policy-gradient methods of Section 30.7, MADDPG and MAPPO, are the concrete realization of this idea: the actor stays decentralized for execution, but the critic that trains it sees enough of the joint picture to keep its target still.
Value decomposition, from Section 30.6, factors the joint problem so that a single learned joint value is consistent with per-agent components; by training the team value end to end rather than letting each agent fit its own slice against a moving backdrop, VDN and QMIX absorb a large part of the cross-agent movement into a shared structure. Self-play with frozen opponent snapshots, and the policy pools that generalize it into a league, slow the moving target by construction: if the opponent you train against is a fixed snapshot rather than a live learner, your environment is stationary for the duration of that phase, and the league schedules which frozen opponents you face so that progress accumulates instead of cycling. Opponent modeling takes the complementary route of anticipation: instead of pretending the opponents are fixed, the agent predicts how they will act or how they are changing, and conditions its own policy on that prediction, turning an unobserved moving quantity into a modeled, and therefore trackable, one.
Non-stationarity is not a quirk of games; it is the multi-agent face of a tension this book keeps meeting. In distributed optimization, workers compute gradients against parameters that other workers are simultaneously updating, so a gradient applied late is stale: it was correct for a model state that no longer exists (Section 10.6). In online serving, a deployed model scores a data distribution that drifts away from the one it was trained on, so yesterday's correct predictor is today's slightly wrong one (concept drift, Section 9.9). Stale gradients, drifting distributions, and moving opponents are three names for one structural fact: when many components adapt at once, each one is optimizing against a world the others are busy changing. The remedies rhyme too, bound the staleness, snapshot the target, or condition on the change rather than ignore it.
3. Watching the Target Move, Then Pinning It Down Intermediate
The cleanest way to feel non-stationarity is to make a learner's value estimate visibly unstable, then quiet it with the two fixes above and measure the difference. The demonstration below uses a tiny repeated two-action game where the opponent's choice flips which of the learner's actions is good. In the first setting the opponent is a live learner that best-responds, committing to whichever action punishes the learner's current preference; this is the moving target, and the learner's temporal-difference error never settles. In the second, the opponent is a frozen self-play snapshot, one fixed policy for the whole run, so the learner faces a stationary environment. In the third, a centralized critic observes the opponent's action and keeps a value conditioned on it, so each target it tracks is stationary even though the same flipping opponent is present. We report the variance of the learner's temporal-difference error over the second half of training, which a stationary target drives down toward the irreducible observation noise.
import numpy as np
# A repeated 2-action game. The learner runs tabular TD on the value of its
# action 0; the reward depends on what the OPPONENT does, so as the opponent
# adapts the learner's target moves: the classic MARL non-stationarity.
rng = np.random.default_rng(7)
# Reward to the learner for (learner_action, opponent_action). The opponent's
# choice flips which learner action is good (a coordination tension).
R = np.array([[ 1.0, -1.0], # learner action 0
[-1.0, 1.0]]) # learner action 1
STEPS, ALPHA, NOISE = 6000, 0.1, 0.2 # steps, TD step size, reward noise floor
def run(mode):
v0 = 0.0 # marginal estimate V(a0): moving + frozen
vc = {0: 0.0, 1: 0.0} # opponent-conditioned V(a0 | opp): centralized
td = np.empty(STEPS)
for t in range(STEPS):
if mode == "moving":
# Live opponent best-responds: it commits to the action that PUNISHES
# the learner's current preference. As v0 crosses zero the opponent
# flips, the reward flips, and the single target v0 never settles.
opp = 1 if v0 > 0 else 0
r = R[0, opp] + NOISE * rng.standard_normal()
err = r - v0; v0 += ALPHA * err
elif mode == "frozen":
# Self-play snapshot: opponent FROZEN at one action all run, so the
# environment is stationary and V(a0) settles.
r = R[0, 1] + NOISE * rng.standard_normal()
err = r - v0; v0 += ALPHA * err
elif mode == "centralized":
# Centralized critic: SAME flipping opponent, but the critic observes
# the opponent action and conditions on it. Each target is stationary.
opp = 0 if vc[1] > vc[0] else 1 # opponent punishes preference
r = R[0, opp] + NOISE * rng.standard_normal()
err = r - vc[opp]; vc[opp] += ALPHA * err
td[t] = err
return td
tail = slice(STEPS // 2, STEPS) # measure post warm-up
var = {m: np.var(run(m)[tail]) for m in ("moving", "frozen", "centralized")}
print(f"{'moving target (naive)':<32}{var['moving']:.4e}")
print(f"{'frozen opponent (self-play)':<32}{var['frozen']:.4e}")
print(f"{'centralized conditioned critic':<32}{var['centralized']:.4e}")
print(f"reduction frozen vs moving : {var['moving']/var['frozen']:.1f}x")
print(f"reduction centralized vs moving : {var['moving']/var['centralized']:.1f}x")
print(f"observation-noise floor (NOISE^2): {NOISE**2:.4e}")
moving opponent is a live best-responder (the moving target); frozen is a fixed self-play snapshot; centralized conditions the value on the observed opponent action. Variance of the temporal-difference error over the run's second half measures how unsettled the learner's value is.moving target (naive) 1.1518e+00
frozen opponent (self-play) 4.3983e-02
centralized conditioned critic 4.1268e-02
reduction frozen vs moving : 26.2x
reduction centralized vs moving : 27.9x
observation-noise floor (NOISE^2): 4.0000e-02
The numbers make the abstract argument tangible. A factor of nearly thirty in variance is the difference between a value estimate that wanders and one that has converged to the noise floor, and it is bought by the two structural moves the chapter has been building toward: snapshot the opponent so the target stops moving, or condition the critic on what the opponent does so the moving part is no longer hidden inside the target. Neither move changes the game; both change what the learner treats as stationary, which is precisely the lever non-stationarity offers.
In Code 30.9.1 we froze the opponent by hand to make its policy a constant. Production self-play libraries do this with a policy snapshot you take and store, then sample from. In RLlib, a fresh decentralized environment becomes stationary for a training phase by mapping the opponent's agent id to a frozen policy copy:
# RLlib self-play: freeze a snapshot of the current policy as the opponent.
import ray.rllib # pip install "ray[rllib]"
def policy_mapping_fn(agent_id, *args, **kwargs):
# "learner" trains; "opponent" is served by a frozen past snapshot,
# so the learner sees a STATIONARY environment for this phase.
return "learner" if agent_id == 0 else "opponent_frozen_v3"
# When the learner improves enough, snapshot it into the pool and advance the
# league; the opponent the learner faces only changes at controlled checkpoints,
# never mid-update. This is the moving target, slowed to a walk.
Who: A reinforcement-learning team at an ad exchange training two competing auto-bidding agents in a shared simulated auction.
Situation: Both agents trained continuously against each other, each adjusting its bid policy from the rewards it observed in the live simulator.
Problem: Returns plateaued, then began oscillating: every few thousand episodes one agent would surge, the other would adapt, and the surge would reverse, with neither policy actually improving against a fixed benchmark.
Dilemma: Slow the learning rates to damp the oscillation, which made convergence glacial, or keep both agents fully live and accept a system that cycled without progress.
Decision: They diagnosed the oscillation as non-stationarity, the limit-cycle failure from Section 1, and restructured training around frozen snapshots rather than chasing it with smaller step sizes.
How: One agent trained against a pool of frozen snapshots of the other, snapshots advanced only when the live agent beat the whole pool, mirroring Code 30.9.2; a centralized critic that observed both bids was added so each value target was conditioned rather than marginalized.
Result: The oscillation flattened into monotone improvement against a held-out benchmark, and the team could finally read a learning curve that meant something, because the target had stopped moving during each training phase.
Lesson: When multi-agent returns oscillate without improving, the cause is usually the moving target, not the learning rate. Snapshot the opponent and condition the critic before you reach for a smaller step size.
4. The Frontier and the Fundamental Tension Advanced
None of the remedies abolish non-stationarity; they relocate it. A frozen snapshot is stationary only until you advance the checkpoint, at which point the target jumps. A centralized critic is stationary in the joint view but must still be trained, and at execution time each agent acts on its own observations, where the others' policies are once again unobserved and moving. This is the fundamental tension restated: agents must co-adapt to be useful, and co-adaptation is what makes learning unstable, so every method is a negotiation about how much movement to allow and where to absorb it. There is no setting of the knobs that makes the tension disappear, only settings that make it manageable for a given game.
The non-stationarity problem remains a live research front. League-based self-play, descended from the AlphaStar architecture, has been refined into population-based schemes that deliberately maintain diverse frozen opponents so a learner cannot overfit to a single moving adversary; recent open frameworks generalize this into automatic curricula over opponent pools. On the analysis side, work on the convergence of multi-agent learning continues to map when independent and decentralized learners provably reach equilibrium and when they cycle, sharpening the failure taxonomy of Section 1. A third thread brings sequence models to opponent modeling: transformer-based agents that condition on a history of others' behavior to predict the next joint move, turning the unobserved moving quantity into an explicitly forecast one, and there is growing interest in zero-shot coordination, where agents must cooperate with partners they never trained against, the hardest version of facing a target whose movement you cannot observe. The unifying message of the 2024 to 2026 literature is that non-stationarity is not a bug to be patched but a property to be engineered around, by snapshotting, conditioning, or forecasting the change.
The purest demonstration of a multi-agent limit cycle is two learners in rock-paper-scissors. Each one, observing it loses too often to rock, shifts toward paper; the other, observing the same, shifts toward scissors; and the pair chase each other around the triangle of strategies without end, never settling, never improving, perfectly rational and perfectly stuck. The unique equilibrium is to play all three at random, which no greedy best-responder ever quite reaches on its own. It is the smallest possible reminder that in multi-agent learning, doing the locally smart thing forever can be exactly the trap.
With non-stationarity named and its remedies in hand, the remaining question of this chapter is operational: how do we actually run multi-agent reinforcement learning at scale, across many actors and learners, when each training step already carries the coordination burden this section described. That is where the distributed-systems machinery of the book returns, the actor-learner architectures of Chapter 20 reappearing in a multi-agent setting, and it is the subject of Section 30.10.
An independent $Q$-learner in a two-agent game stores transitions in a replay buffer and samples uniformly from it during training, exactly as a single-agent DQN would. Using the effective-reward expression $R_i^{\text{eff}}(s, a_i)$ from Section 1, explain precisely why a transition stored ten thousand steps ago can be actively misleading rather than merely uninformative, and contrast this with single-agent DQN, where old transitions are stale only in the harmless sense of being off-policy. What does this imply about the right size of a replay buffer in MARL, and why might a smaller buffer outperform a larger one here?
Extend Code 30.9.1 with a fourth setting in which the opponent best-responds as in moving, but the learner uses a centralized critic that conditions on the opponent's action with a delay of $k$ steps (it sees the opponent's action from $k$ steps ago, not the current one). Sweep $k$ from $0$ to $50$ and plot the tail variance of the temporal-difference error against $k$. At $k = 0$ you should recover the stationary, low-variance centralized result; as $k$ grows the conditioning becomes increasingly wrong. Report the value of $k$ at which the centralized critic loses its advantage over the naive moving-target learner, and explain what this says about the freshness requirement on the information a centralized critic conditions on.
A self-play training run freezes the opponent for $B$ episodes between checkpoint advances. Argue qualitatively how the choice of $B$ trades off two errors: a small $B$ keeps the opponent close to the live policy (low staleness) but allows the target to move often (more non-stationarity), while a large $B$ holds the environment stationary for long stretches (stable learning) but trains the agent against an increasingly outdated opponent (high staleness). Relate this trade-off explicitly to the staleness bound of bounded-asynchronous distributed optimization in Section 10.6, and propose a measurable criterion, computable during training, for deciding when to advance the checkpoint rather than fixing $B$ in advance.