"I trained beautifully against the partner I imagined. Unfortunately the real one kept learning too, and my replay buffer is now a museum of policies nobody plays anymore."
An Independent Learner, Surprised Again
The simplest way to do multi-agent reinforcement learning is to refuse to do anything multi-agent at all: hand each agent its own single-agent learner, point it at its own rewards, and let it treat every other agent as part of the environment. This is independent learning, and its appeal is structural rather than algorithmic. Because no agent reads another's parameters, gradients, or experience, training is embarrassingly parallel; you can run $n$ agents as $n$ ordinary RL jobs with no collective communication during learning, scaling to large populations on commodity infrastructure. The catch is equally structural. From any one learner's point of view the world is no longer stationary, because the other learners are quietly rewriting the dynamics it is trying to fit, so the convergence guarantees of single-agent RL evaporate and a stored replay buffer slowly fills with transitions generated by opponents who no longer exist. This section shows why independent learning is simultaneously the strongest available baseline and a method built on a cracked foundation, and when that crack matters enough to pay for the coordination of Section 30.5.
The previous sections built the formal stage: a Markov game (Section 30.2) in which several agents act at once, and the cooperative, competitive, and mixed reward structures (Section 30.3) that decide whether those agents are allies, rivals, or both. The honest question now is how to actually learn a policy in that setting. Before reaching for any machinery designed for many agents, it is worth asking what happens if we ignore the multi-agent structure entirely and simply deploy the single-agent reinforcement-learning algorithms we already trust. The answer is more interesting than it sounds, and it sets the baseline that every more elaborate method in this chapter must beat.
Independent learning is exactly that move. Each agent runs a standard single-agent algorithm, Q-learning or PPO, against the stream of observations and rewards it happens to receive, and it never represents the existence of the others. The two canonical instances have names: Independent Q-Learning (IQL), where each agent maintains its own action-value function, and Independent PPO (IPPO), where each agent runs its own clipped policy-gradient update. The rest of this section is about why this naive idea scales so beautifully, why it is theoretically broken, and why, despite being broken, it keeps winning benchmark tables.
1. The Embarrassingly Parallel Ideal Beginner
The reason independent learning deserves the first word is that it is the cheapest thing to distribute in this entire chapter. Recall the data-parallel identity from Section 1.1: distributing training was attractive precisely because the combine step (an all-reduce of gradients) was exact and bounded. Independent learning goes one step further and removes the combine step altogether. There is no gradient to all-reduce across agents, no parameter server to consult, no shared replay to synchronize. Each agent's learner is a self-contained process that reads its own observations, computes its own loss, and updates its own parameters. The only thing the agents share is the environment they happen to inhabit together.
Concretely, agent $i$ keeps an action-value estimate $Q_i(s, a_i)$ over its own action set and applies the ordinary single-agent Q-learning update,
$$Q_i(s, a_i) \;\leftarrow\; Q_i(s, a_i) + \alpha \Big[\, r_i + \gamma \max_{a_i'} Q_i(s', a_i') - Q_i(s, a_i) \,\Big],$$using only the local reward $r_i$ and its own next-state value. Nothing in this update references another agent's action, policy, or value function; the joint action of the others is folded silently into the transition from $s$ to $s'$ and into the reward $r_i$. That is the whole trick, and it is what makes the scheme trivially distributed: the update for agent $i$ has the identical shape as a single-agent update, so any existing single-agent RL stack, including the distributed actor-learner infrastructure of Chapter 20, runs it unchanged, once per agent.
Figure 30.4.1 makes the structural bargain visible. The wiring is identical to a single-agent loop, which is exactly why it distributes for free: replicate the orange box $n$ times, give each its own slice of the world's feedback, and you have an $n$-agent learner with zero inter-agent communication during training. The blue boxes inside the dashed boundary are the price of that simplicity, and the next sections are about what they cost.
Data-parallel training (Chapter 15) keeps the learning problem stationary and pays for it with an all-reduce every step. Independent learning deletes the all-reduce, so it scales to many agents with no collective communication, but it pays a different and subtler tax: each learner now optimizes against a target that the other learners are actively moving. You do not get distribution for free; you choose which price to pay. Independent learning is the option that pushes the entire cost out of the network and into the statistics of learning.
2. Why the Foundation Is Cracked: Non-Stationarity Intermediate
Single-agent Q-learning converges to the optimal value function because it is fitting a fixed object: the environment's transition and reward functions do not change while you learn. Independent learning quietly violates that assumption. From agent $i$'s perspective the effective transition dynamics are obtained by marginalizing out the other agents' actions under their current policies,
$$P_i^{\text{eff}}(s' \mid s, a_i) \;=\; \sum_{a_{-i}} \Big[\textstyle\prod_{j \neq i} \pi_j(a_j \mid s)\Big]\, P(s' \mid s, a_i, a_{-i}),$$where $a_{-i}$ denotes the joint action of everyone except $i$. The true Markov game $P$ is fixed, but $\pi_j$ is being updated by every other learner, so $P_i^{\text{eff}}$ drifts over the course of training. Agent $i$ is therefore solving a non-stationary problem with a tool that assumes stationarity, and the comforting convergence theorem of single-agent Q-learning simply does not apply. We treat the full theory of this drift, and the methods built to tame it, in Section 30.9; here we only need the consequence: learning can oscillate, chase its own tail, or settle into a poor joint policy, and which of these happens can depend on something as innocent as the random initialization.
The damage is worst for off-policy methods that reuse old experience. A replay buffer (the workhorse of deep Q-learning) stores transitions $(s, a_i, r_i, s')$ for later reuse, on the assumption that an old transition still describes the current environment. Under independent learning that assumption is false: a transition recorded ten thousand steps ago was generated when the other agents played a policy they have since abandoned, so its reward and next state reflect a world that no longer exists. The buffer becomes an inconsistent mixture of experience drawn from many vanished opponent policies, and replaying it teaches the agent to be good against ghosts. This is why on-policy methods such as IPPO, which discard experience after each update, tend to behave far better in practice than off-policy IQL with a long replay buffer.
Picture two people learning to tango with their eyes closed, each treating the other purely as "the floor's behavior." Whenever one improves a step, the other's idea of how the floor moves silently changes, so it adjusts, which changes the floor again for the first. Sometimes they luck into a rhythm and lock onto it. Sometimes they oscillate forever, each forever correcting for the partner's last correction. Nobody is doing anything wrong by single-agent standards; the problem is that there is no fixed floor to learn.
3. Watching Non-Stationarity Decide the Outcome Intermediate
The cleanest way to feel the effect is the smallest possible cooperative game. Two agents each choose one of two actions and receive a shared reward: the joint action $(1,1)$ pays the most ($5$), the joint action $(0,0)$ is a safe-but-worse equilibrium ($3$), and any miscoordination pays nothing. This is a one-state coordination game (a Stag-Hunt-flavored payoff), and it is cooperative in the sense of Section 30.3: both agents want the same outcome. The code below runs two independent tabular Q-learners on it, with no model of each other, and asks a single question: starting from $200$ different random initializations, how often do the two learners actually converge to the optimal joint policy rather than the mediocre one?
import numpy as np
# A 2x2 cooperative coordination game (a "Stag Hunt"-like joint-reward payoff).
# Both agents pick action 0 or 1 and get a SHARED reward for the joint action.
# The best joint action is (1,1)=5; (0,0)=3 is a safe but worse equilibrium.
PAYOFF = np.array([[3.0, 0.0], # this agent plays 0: partner 0 -> 3, partner 1 -> 0
[0.0, 5.0]]) # this agent plays 1: partner 0 -> 0, partner 1 -> 5
def run(seed, episodes=4000, alpha=0.1, eps=0.1):
rng = np.random.default_rng(seed)
Q = rng.uniform(0, 1, size=(2, 2)) # Q[agent, action]; random init per seed
for _ in range(episodes):
a = [int(np.argmax(Q[i])) if rng.random() > eps else rng.integers(2)
for i in range(2)] # epsilon-greedy, each agent independent
r = PAYOFF[a[0], a[1]] # shared reward from the joint action
for i in range(2): # independent Q-update, no partner model
Q[i, a[i]] += alpha * (r - Q[i, a[i]])
greedy = (int(np.argmax(Q[0])), int(np.argmax(Q[1])))
return greedy, PAYOFF[greedy[0], greedy[1]]
outcomes = {}
for seed in range(200):
joint, val = run(seed)
outcomes[joint] = outcomes.get(joint, 0) + 1
print("joint policy reached (over 200 random initializations):")
for joint in sorted(outcomes, key=lambda k: -outcomes[k]):
label = "optimal (1,1)=5" if joint == (1, 1) else (
"safe (0,0)=3" if joint == (0, 0) else "miscoordinated =0")
print(f" {joint} -> {outcomes[joint]:3d} runs [{label}]")
opt = outcomes.get((1, 1), 0)
print(f"reached the optimal joint policy: {opt}/200 = {opt/200:.0%} of seeds")
joint policy reached (over 200 random initializations):
(0, 0) -> 104 runs [safe (0,0)=3]
(1, 1) -> 96 runs [optimal (1,1)=5]
reached the optimal joint policy: 96/200 = 48% of seeds
The lesson of Output 30.4.1 is stark precisely because the game is so easy: a single state, two actions, fully shared reward, perfect information. A coordinating pair could trivially agree on $(1,1)$. Yet two independent learners reach it only $48\%$ of the time, with the outcome decided by initialization, because each agent is climbing its own value estimate while the partner shifts the ground underneath. When agent $1$ tentatively favors action $1$ but agent $0$ has not yet, the miscoordination reward of $0$ punishes action $1$, reinforcing the retreat to the safe $(0,0)$ basin. Whether the pair escapes that basin is a matter of which agent commits first, which is set by the initial Q-values. This is the non-stationarity penalty of Section 30.9 in miniature, and it is the entire argument for the centralized training of Section 30.5.
This book's spine is that you distribute work across machines and then pay to recombine the pieces correctly. Independent learning is the limit case where the recombination cost during training is zero: no all-reduce, no shared replay, no coordinator, $n$ agents as $n$ untethered single-agent jobs. That makes it the most scalable training scheme in the chapter and, by the same token, the one that pushes the entire coordination problem out of the system layer and into the learning dynamics. Every more sophisticated method ahead, centralized critics (Section 30.5), value decomposition (Section 30.6), and multi-agent policy gradients (Section 30.7), spends a measured amount of communication or shared structure to buy back the coordination that Output 30.4.1 shows independence cannot guarantee on its own.
4. Why the Cracked Baseline Keeps Winning Advanced
Given Output 30.4.1, you might expect independent learning to be a strawman that real methods crush. The empirical record refuses to cooperate with that story. On many standard cooperative benchmarks, Independent PPO matches or beats considerably more elaborate algorithms that were specifically designed to handle the multi-agent structure. There are two reasons, and they pull in the same direction. First, IPPO is on-policy, so it sidesteps the worst of the stale-replay problem from Section 2; each agent always learns from experience generated under the current opponent policies, which keeps its effective environment locally close to stationary. Second, PPO's trust-region clipping keeps each policy update small, which slows the rate at which any one agent moves the others' environment, giving the population time to track a shared trajectory rather than chasing wild swings.
The practical upshot is a strong default: before deploying a method with a centralized critic, a mixing network, or a learned communication channel, run IPPO and make the complicated method prove it is worth the added coupling. Independent learning is the right answer more often than its broken theory suggests, especially when rewards are dense, the number of agents is large enough that centralized training is expensive, and the task does not demand tight, brittle coordination. It is the wrong answer when the optimal policy requires agents to commit to a joint action no individual would choose alone, exactly the basin-escape failure of Output 30.4.1, where the moving-target penalty is not a nuisance but the whole difficulty.
The rehabilitation of independent learning began with de Witt et al.'s study showing that Independent PPO is competitive with, and sometimes superior to, centralized-critic methods like MAPPO on the StarCraft Multi-Agent Challenge, a result that reframed IPPO from a strawman into a baseline you must beat. Follow-on benchmark work (the EPyMARL and later JaxMARL evaluation suites, 2023 to 2025) hardened this finding: when hyperparameters are tuned with equal care, the gap between independent and centralized methods shrinks dramatically across many cooperative tasks, and several "improvements" fail to survive a fair comparison. The 2024 to 2026 frontier asks the sharper question of where independence provably fails, isolating the classes of coordination problem (tight joint commitments, sparse joint rewards, adversarial non-stationarity) for which centralized training earns its cost, and building diagnostics that predict in advance which regime a task falls into. The honest takeaway for a system designer: treat independent learning as the null hypothesis, and require evidence before paying for coordination.
Code 30.4.1 hand-rolled the per-agent loop. In production you express independent learning declaratively: Ray RLlib's multi-agent API lets you map each agent to its own policy, and that single mapping decides whether you are running independent learners or a shared-parameter variant. The framework handles the per-agent rollout collection, the separate optimizer state, and the parallel sampling across a worker fleet (the actor-learner machinery of Chapter 20):
# pip install "ray[rllib]"
from ray.rllib.algorithms.ppo import PPOConfig
# Independent learners: every agent gets its OWN policy (separate weights + optimizer).
def per_agent(agent_id, *args, **kwargs):
return agent_id # one policy id per agent -> IPPO
config = (
PPOConfig()
.environment("my_marl_env") # any MultiAgentEnv (PettingZoo, custom, ...)
.multi_agent(
policies={"agent_0", "agent_1", "agent_2"}, # independent policies
policy_mapping_fn=per_agent, # no sharing, no central critic
)
.env_runners(num_env_runners=8) # 8 parallel samplers; agents never sync grads
)
algo = config.build()
for _ in range(100):
algo.train() # each policy runs its own PPO update
policy_mapping_fn turns the job into IPPO with no shared parameters and no centralized critic; pointing several agents at one policy id instead would give the shared-policy variant. The roughly thirty lines of manual rollout and update logic collapse to one config object, and RLlib supplies the parallel sampling and per-policy optimizer state.Who: A robotics engineer building a control policy for a fleet of forty warehouse picking robots.
Situation: Each robot navigates aisles and fetches items; the team wanted a learned policy that improved throughput over the hand-tuned planner without rewriting the control stack.
Problem: A centralized-training method needed every robot's observations funneled to one learner, which strained the network and coupled the training job to all forty robots being healthy at once.
Dilemma: Pay for centralized training with a joint critic over all forty agents, giving in-principle better coordination but heavy communication and a fragile single training job, or run independent PPO per robot, embarrassingly parallel and robust to dropouts but exposed to the non-stationarity penalty of Output 30.4.1.
Decision: They started with independent PPO, because the reward was dense (throughput credited continuously) and the task rarely required two specific robots to commit to a joint action, the regime where independence is known to hold up.
How: Each robot ran its own PPO policy via the RLlib mapping of Code 30.4.2, sampled in parallel across the fleet, with on-policy updates so no stale replay accumulated; a single benchmark task with a tight handoff was held back as the test for whether centralization would later be needed.
Result: Independent PPO beat the hand-tuned planner on aggregate throughput and matched a centralized-critic prototype within noise on all but the one handoff task, at a fraction of the training communication and with graceful degradation when robots went offline.
Lesson: Make independence the default and the centralized method earn its coupling. The one task where IPPO lagged was the signal for where to spend the coordination budget of Section 30.5, not a verdict against independent learning everywhere.
5. When Independence Suffices and When It Does Not Intermediate
The decision rule that emerges is concrete enough to apply before writing any code. Reach for independent learning first when the agents' rewards are dense and well aligned with progress, when the population is large enough that centralized training would be expensive or fragile, when on-policy updates are acceptable so stale replay never accumulates, and when the optimal behavior does not hinge on a brittle joint commitment that no single agent would risk alone. In those regimes the moving-target penalty stays small, the embarrassingly parallel structure of Section 1 pays off directly, and the simplicity is a feature rather than a liability.
Move to the centralized training with decentralized execution of Section 30.5 when the opposite holds: when rewards are sparse or shared in a way that demands credit assignment across agents (Section 30.8), when the optimal policy requires the basin-escaping coordination that Output 30.4.1 showed independence fumbles, or when adversarial non-stationarity makes each learner's target swing too violently to track. The recurring narrative of distributed RL infrastructure (Chapter 20) returns here transformed: the same actor-learner fan-out that scaled single-agent RL now scales a population of agents, and independent learning is the configuration where that fan-out needs no synchronization at all. The next section keeps the decentralized execution but reintroduces a carefully bounded amount of centralized information at training time, buying back exactly the coordination that this section showed independence cannot guarantee. That story, centralized training with decentralized execution, begins in Section 30.5, and it will reappear at fleet scale in the multi-robot case study of Chapter 39.
In Code 30.4.1 the safe equilibrium $(0,0)$ pays $3$ and the optimal $(1,1)$ pays $5$, yet independent learners reach the optimal one only $48\%$ of the time. Explain, in terms of the effective non-stationary dynamics $P_i^{\text{eff}}$ from Section 2, why a tentative move toward action $1$ by one agent gets punished when the partner has not yet moved, and why that punishment pushes the pair back into the $(0,0)$ basin. Then predict qualitatively how the optimal-policy rate would change if the miscoordination payoff were raised from $0$ to $2.5$, and say which property of the game (rather than the algorithm) your prediction depends on.
Extend Code 30.4.1 into a small two-state game and give each independent Q-learner a replay buffer of the last $K$ transitions, sampling a minibatch from it on every update instead of learning only from the current step. Sweep $K \in \{1, 50, 500, 5000\}$ over the same $200$ seeds and report the optimal-policy rate for each. Show that a longer buffer degrades the outcome, and explain the result using the inconsistency argument of Section 2 (old transitions were generated under opponent policies that no longer exist). Then replace the off-policy replay update with an on-policy update and confirm the IPPO-style intuition that discarding stale experience helps.
You are choosing a training scheme for three systems: (a) a $200$-agent traffic-light controller with dense per-intersection delay rewards and no need for two lights to commit jointly; (b) a two-agent cooperative task where the only reward arrives when both agents simultaneously occupy specific cells; (c) a competitive two-player game where each opponent deliberately exploits the other's current policy. For each, decide whether to start with independent learning (IQL or IPPO) or go straight to centralized training (Section 30.5), and justify the choice using the suffices-versus-fails rule of Section 5 and the non-stationarity argument of Section 30.9. State, for the systems where you chose independence, what observable signal during training would tell you to switch.