Part VI: Distributed AI and Multi-Agent Systems
Chapter 30: Multi-Agent Reinforcement Learning

Distributed MARL Training

"I trained against my mirror image for a thousand games and learned to beat exactly one player: yesterday's me. Then they put me in a league, and for the first time I had to be good against everyone at once."

A Policy That Outgrew Its Own Reflection
Big Picture

Multi-agent reinforcement learning is the most experience-hungry training problem in this book, and it is solved by pouring the multi-agent rollouts of this chapter into the distributed actor-learner machinery of Chapter 20. Every method you have met so far in this chapter, independent learners, centralized critics, value decomposition, multi-agent policy gradients, learns from interaction data, and a multi-agent environment produces that data slowly: one episode at a time, with many agents acting inside it. To learn a strong policy you need millions to billions of such episodes, which only a fleet of machines can generate. So distributed MARL is two systems bolted together: a swarm of distributed actors running multi-agent rollouts, and a set of learners updating the (often shared) policies. On top of that sits a second, harder system for competitive games, the self-play league, a distributed population of agents and frozen snapshots that play each other to manufacture a robust meta-strategy and to tame the non-stationarity of Section 30.9. This closing section connects the algorithms of Chapter 30 to the systems of Part IV, shows why a league beats a lone self-play pair, and then steps back to summarize the whole chapter.

In the previous section we saw why non-stationarity is the defining hazard of multi-agent learning: every other agent is itself a moving target, so the environment a learner faces is never twice the same. The methods of this chapter (centralized training with decentralized execution in Section 30.5, value decomposition in Section 30.6, multi-agent policy gradients in Section 30.7) tame that hazard at the level of the learning rule. This final section tames it at the level of the system. The reason both levels are needed is throughput: a single multi-agent rollout is expensive, the policies are noisy, and the only way to average out that noise within a human's patience is to run thousands of rollouts in parallel and feed them to a shared learner. That is precisely the distributed reinforcement learning infrastructure built in Chapter 20, now carrying many agents inside each episode instead of one.

1. The Actor-Learner Architecture, Now Multi-Agent Intermediate

Recall the actor-learner split from Section 20.2: a large pool of actor processes runs the environment and collects experience using a recent copy of the policy, while a small set of learner processes consumes that experience and performs the gradient updates, periodically broadcasting fresh weights back to the actors. The split exists because environment simulation is CPU-bound and embarrassingly parallel, whereas the gradient update is GPU-bound and wants the data gathered in one place. MARL inherits this architecture wholesale; the single change is what happens inside one actor. Instead of a single agent stepping the environment, the actor now drives a full multi-agent episode: $n$ policies (which may be $n$ distinct networks or, under parameter sharing, $n$ copies of one network) act simultaneously, the environment returns a joint transition, and the actor emits the per-agent experience tuples that the learners will consume.

This single change multiplies the throughput pressure. A two-team game with five agents per side produces ten action selections per environment step, and a competitive match must be played to completion before its outcome is known. The experience volume needed for a stable policy therefore scales with both the number of agents and the length of an episode, which is why landmark MARL systems ran on the order of tens of thousands of CPU cores feeding a comparatively small bank of GPUs. The communication pattern is the same asynchronous actor-to-learner stream of Chapter 10's asynchronous SGD, with the same staleness trade-off: actors run on weights a few updates behind the learner, and the system accepts that staleness in exchange for never blocking a fast actor on a slow one. Figure 30.10.1 shows the full picture, including the self-play league we build in Section 3.

Distributed actors (multi-agent rollouts) Actor 1 episode: agents 1..n act jointly emits per-agent experience Actor 2 another multi-agent episode ... Actor M (tens of thousands of cores) Learner(s) gradient update on shared / decomposed policy experience broadcast fresh weights to actors Self-play league (Section 3) Shared snapshot pool frozen policies pi_0, pi_1, ... replicated across the cluster; grows as learners publish Matchmaker picks opponents from pool; assigns matchups to actors publish snapshot matchmaker tells each actor which frozen opponent to play replay / experience buffer From rollout to update: actors generate, the buffer holds, the learner consumes, the pool preserves opponents.
Figure 30.10.1: Distributed MARL as two coupled systems. On the left, the actor-learner architecture of Section 20.2: many actors each run a full multi-agent rollout and stream per-agent experience to the learner, which updates the shared (or value-decomposed) policy and broadcasts new weights back. On the right, the self-play league of Section 3: a shared, cluster-replicated pool of frozen snapshots and a matchmaker that assigns each actor an opponent drawn from the pool. The dashed orange arrows are the league's control plane; the solid blue arrows are the experience and weight data plane.
Key Insight: MARL Reuses the RL System and Only Changes the Rollout

Distributed MARL does not invent a new distributed-systems pattern; it adopts the actor-learner architecture of single-agent distributed RL and changes exactly one thing, what an actor does per episode. The actor now steps a multi-agent environment, drives $n$ policies through one joint trajectory, and emits per-agent experience. Everything downstream, the asynchronous experience stream, the replay buffer, the GPU learner, the weight broadcast, is the Chapter 20 system unchanged. This is why the right mental model for "training a MARL team at scale" is "Chapter 20's infrastructure with a multi-agent environment plugged into each actor," not a fresh design.

Thesis Thread: A Distributed System Returns, Carrying Many Agents

This section is where the chapter advances the book's spine most directly. The actor-learner architecture introduced for single-agent RL in Section 20.2 returns here essentially unchanged, now with a multi-agent environment plugged into each actor and a self-play league wrapped around the population. The pattern the book keeps making, build a primitive once and then scale it out and reuse it, applies to whole systems and not only to collectives: distributed RL infrastructure is the substrate that turns the learning rules of Sections 30.5 to 30.7 into something a fleet of machines can actually train, and it returns once more on physical robots in Chapter 39.

2. Parameter Sharing as the Scaling Lever Intermediate

When the agents are homogeneous, that is, interchangeable units drawn from the same role such as a fleet of identical drones or a squad of identical units, parameter sharing (introduced in Section 30.7) is the lever that keeps distributed MARL affordable as the agent count grows. Instead of training $n$ separate networks, every agent acts with a single shared policy $\pi_\theta$, distinguished only by its own observation and, usually, an agent-identity feature appended to the input. The shared policy turns the per-agent experience from all $n$ agents into training data for one network, so the effective batch size for the learner grows linearly with the number of agents while the parameter count and the weight-broadcast cost stay fixed.

The systems consequence is direct. Under parameter sharing the learner holds one set of weights, the broadcast back to the actors moves one model regardless of $n$, and the experience from a ten-agent episode is ten times the gradient signal of a one-agent episode at no extra parameter cost. A shared policy with parameters $\theta$ collecting experience from agents $i = 1, \ldots, n$ optimizes the pooled objective $$J(\theta) = \mathbb{E}\!\left[\frac{1}{n}\sum_{i=1}^{n} \sum_{t} \gamma^{t}\, r^{i}_{t}\right],$$ where each agent contributes its own reward stream $r^{i}_{t}$ to the same expectation. Homogeneity is the condition that makes this sound: if the agents truly play interchangeable roles, one policy can serve all of them, and the more agents there are, the faster the shared policy learns. When roles differ (a goalkeeper and a striker), you fall back to a small number of shared policies, one per role, which is still far cheaper than one network per agent. We make the throughput contrast concrete in Exercise 30.10.3.

3. The Self-Play League: A Distributed Population Against Non-Stationarity Advanced

Competitive games add a problem that no amount of raw throughput solves on its own. If you train an agent only against a copy of itself, the classic self-play loop, the two policies chase each other around the strategy space and can cycle forever: the agent becomes excellent at beating its current opponent and, in doing so, forgets how to beat the opponents it defeated ten thousand games ago. This is non-stationarity (Section 30.9) in its sharpest competitive form, and it is exactly the rock-paper-scissors trap, where chasing the best response to the latest opponent leads in a circle.

The AlphaStar-style answer is a league: a distributed population of agents together with a growing pool of frozen snapshots of past policies. Learners do not train against a single mirror; they train best responses against opponents sampled from the whole pool, including old versions of themselves and specialized "exploiter" agents whose only job is to find and punish a main agent's weaknesses. Because the pool preserves history, a learner cannot win by forgetting; it must stay strong against everything the population has ever produced. The result is a robust meta-strategy that no single opponent can exploit. Structurally this is a distributed evolutionary and self-play system: it needs matchmaking (which snapshot does each actor play?), a shared opponent pool replicated across the cluster, and enough rollout throughput to keep the population improving. These three are the real infrastructure challenges of competitive MARL, and they map cleanly onto the league control plane in Figure 30.10.1.

The demonstration below makes the contrast measurable in pure Python, with no learning framework. We define a cyclic game on a circle of strategies (a scaled-up rock-paper-scissors) in which every pure strategy is fully exploitable and only a well-spread mixture is robust. We then run two trainers: a single self-play pair that best-responds to its latest opponent, and a distributed league that keeps a shared snapshot pool and trains best responses against samples of that pool. We track each trainer's strength, defined as one minus its exploitability (the best score any fixed opponent can score against it), where $0.5$ is a perfectly unexploitable mixture.

import random, math
random.seed(7)
N = 30          # strategies on a circle
W = N // 2 - 1  # how many clockwise neighbours a strategy beats

def payoff(a, b):                     # score of a vs b in {0, 0.5, 1}
    d = (b - a) % N
    if d == 0: return 0.5
    if d <= W: return 1.0             # a beats b
    if (a - b) % N <= W: return 0.0   # b beats a
    return 0.5

def best_response_to_mixture(w):      # best pure strategy vs an opponent mixture
    tot = sum(w) or 1.0
    return max(range(N), key=lambda s: sum(w[o]*payoff(s, o) for o in range(N))/tot)

def exploitability(w):                # worst opponent score vs the agent mixture
    tot = sum(w) or 1.0
    return max(1.0 - sum(w[s]*payoff(s, o) for s in range(N))/tot for o in range(N))

def single_self_play(rounds):         # one agent, one opponent, each chases the other
    agent, opp, hist = random.randrange(N), random.randrange(N), []
    for _ in range(rounds):
        wo = [0]*N; wo[opp] = 1;   agent = best_response_to_mixture(wo)
        wa = [0]*N; wa[agent] = 1; opp   = best_response_to_mixture(wa)
        cur = [0]*N; cur[agent] = 1
        hist.append(1.0 - exploitability(cur))     # current policy is one pure strategy
    return hist

def league(rounds, learners=5, sample=12):         # shared pool + distributed matchmaking
    pool = [random.randrange(N) for _ in range(learners)]
    hist = []
    for _ in range(rounds):
        new = []
        for _l in range(learners):
            opps = random.sample(pool, min(sample, len(pool)))   # sample the SHARED pool
            w = [0]*N
            for o in opps: w[o] += 1
            new.append(best_response_to_mixture(w))              # train, then freeze
        pool.extend(new)                                         # publish snapshots
        if len(pool) > 150: pool = pool[-150:]
        meta = [0]*N
        for s in pool: meta[s] += 1
        hist.append(1.0 - exploitability(meta))                  # league meta-strategy
    return hist

ROUNDS = 60
single, lg = single_self_play(ROUNDS), league(ROUNDS)
print(f"game: cyclic payoff on a circle of N={N} (uniform mixture is unexploitable)")
print(f"rounds: {ROUNDS}   (strength = 1 - exploitability; 0.5 = unexploitable)\n")
print("round |  single self-play  |  distributed league")
print("-" * 50)
for r in (0, 4, 14, 29, 44, 59):
    print(f"  {r+1:3d} |       {single[r]:.3f}        |       {lg[r]:.3f}")
print(f"\nsingle self-play  mean over last 15 : {sum(single[-15:])/15:.3f}")
print(f"distributed league mean over last 15: {sum(lg[-15:])/15:.3f}")
Code 30.10.1: A self-play league versus a lone self-play pair, in pure Python. The league keeps a shared, growing pool of frozen snapshots and assigns each learner opponents sampled from that pool (distributed matchmaking); the single pair only ever best-responds to its latest mirror. Strength is one minus exploitability, so higher is more robust and $0.5$ is unexploitable.
game: cyclic payoff on a circle of N=30 (uniform mixture is unexploitable)
rounds: 60   (strength = 1 - exploitability; 0.5 = unexploitable)

round |  single self-play  |  distributed league
--------------------------------------------------
    1 |       0.000        |       0.150
    5 |       0.000        |       0.233
   15 |       0.000        |       0.338
   30 |       0.000        |       0.437
   45 |       0.000        |       0.340
   60 |       0.000        |       0.280

single self-play  mean over last 15 : 0.000
distributed league mean over last 15: 0.292
Output 30.10.1: The single self-play pair never escapes strength $0.000$: its policy is always one pure strategy, which some opponent beats outright, and the pair just cycles around the circle. The distributed league climbs to a peak strength of $0.437$ and sustains a robust mixture near $0.3$, because its shared snapshot pool fills the strategy space and forces each learner to be good against the whole population at once. Distribution plus a league buys robustness that a lone self-play pair, at any throughput, cannot.

The numbers tell the whole story of why competitive MARL is distributed by necessity rather than convenience. The lone pair stays pinned at zero strength no matter how long it trains; throughput cannot rescue a trainer that is structurally stuck in a cycle. The league, by keeping a population and a shared opponent pool, manufactures a broad meta-strategy that climbs well above zero. A real system replaces our best-response oracle with a neural-network learner and our circle game with StarCraft or Dota, but the architecture is identical: a distributed population, a shared snapshot pool, a matchmaker, and the actor-learner throughput to keep them all fed.

Practical Example: Standing Up a League Without Drowning the Network

Who: A reinforcement-learning systems engineer at a games-AI lab training agents for a 5v5 competitive title.

Situation: A naive self-play setup had plateaued; the agent beat its own latest checkpoint but lost badly to checkpoints from a week earlier, the textbook cycling failure of Section 30.9.

Problem: Moving to a league meant every one of roughly 4,000 actor processes needed to load opponent snapshots, and a growing pool of hundreds of frozen policies threatened to saturate the storage fabric if every actor pulled every snapshot.

Dilemma: Keep one global snapshot store (simple, but a bandwidth hotspot as the pool and actor count grow) or replicate the pool with a matchmaker that hands each actor only the one opponent it needs for its next match (more moving parts, far less traffic).

Decision: They built a matchmaker as the league control plane: it sampled an opponent from the pool per match and told the actor which single snapshot to fetch from a replicated cache, so each actor held two policies at a time, its trainee and one opponent, never the whole pool.

How: Snapshots were published to a content-addressed store and replicated to per-rack caches; the matchmaker emitted lightweight matchup assignments (a few bytes) while the heavy weight transfers stayed local to a rack, reusing the topology-aware placement ideas of Chapter 4.

Result: Snapshot traffic fell by more than an order of magnitude versus the all-actors-pull-everything design, the pool grew to hundreds of policies without a bandwidth wall, and within days the main agent stopped losing to its own past selves.

Lesson: In a distributed league the control plane (who plays whom) is tiny and the data plane (moving policy weights) is enormous; scaling the league is mostly the engineering of keeping snapshot transfer local while matchmaking decisions stay global.

Library Shortcut: RLlib Multi-Agent Runs the Actors, Learners, and League for You

Code 30.10.1 hand-rolled the population loop to expose the mechanism. In practice you describe the multi-agent environment and the policy mapping, and Ray RLlib stands up the distributed actor-learner system, the rollout workers, the replay or sample buffers, the learner, and the weight broadcast, for you. A self-play setup is a few lines: map every agent to a shared policy, register frozen snapshots as additional (non-trained) policies, and supply a callback that periodically copies the trainee into the pool and samples opponents from it.

# Ray RLlib multi-agent self-play sketch (rllib, new API stack).
from ray.rllib.algorithms.ppo import PPOConfig

config = (
    PPOConfig()
    .environment("competitive_arena")                 # your multi-agent env
    .multi_agent(
        policies={"main", "snapshot_pool"},           # trainee + frozen opponents
        # homogeneous agents share ONE policy network (parameter sharing):
        policy_mapping_fn=lambda agent_id, ep, **kw:
            "main" if agent_id.startswith("learner") else "snapshot_pool",
        policies_to_train=["main"],                    # only the trainee updates
    )
    .env_runners(num_env_runners=64)                  # 64 distributed rollout workers
)
algo = config.build()
for i in range(1000):
    algo.train()                                      # actors roll out, learner updates
    if i % 20 == 0:
        add_snapshot_of_main_to_pool(algo)            # league callback: freeze + publish
Code 30.10.2: The roughly forty lines of population, matchmaking, and rollout bookkeeping in Code 30.10.1 collapse to a configuration object. RLlib's num_env_runners spins up the distributed actors of Figure 30.10.1, policies_to_train freezes the snapshot pool, and the new-API-stack learner handles the gradient updates and weight broadcast; MARLlib and PyMARL offer the same multi-agent rollout-and-learner loop with libraries of MARL algorithms (QMIX, MAPPO, MADDPG) ready to drop in.
Fun Note: The Agent That Trained Against a Museum

The deep reason a league works is that it refuses to let the agent forget. Every frozen snapshot is a small museum exhibit, a preserved version of a past self, and the matchmaker keeps dragging the trainee back through the gallery: "remember when you lost to this one? Beat it again, now, while also beating everyone in the next room." An agent that only spars with its reflection learns to beat a single moving target. An agent that must hold the whole museum at bay learns something closer to actual skill. The price is a building full of old champions you can never quite throw away.

4. The Landmark Systems and the Frameworks Intermediate

Two systems are the existence proofs that distributed MARL works at scale. OpenAI Five trained a team of five agents to play Dota 2 at world-champion level using large-scale distributed PPO with parameter sharing across the five agents and a thin, partially centralized value signal; its scale came from running thousands of game instances in parallel and feeding a single learner, the actor-learner pattern of Figure 30.10.1 taken to industrial size. AlphaStar reached Grandmaster level at StarCraft II by adding the league: a population of main agents, past-self snapshots, and dedicated exploiters, with a matchmaker (prioritized fictitious self-play) deciding who played whom, all running on a large distributed cluster. Both systems combined the algorithmic ideas of this chapter (centralized-ish critics, parameter sharing, self-play against a population) with the distributed RL infrastructure of Chapter 20, and neither would have been possible on a single machine by many orders of magnitude.

For your own work you reach for a framework rather than rebuilding either system. Ray RLlib provides production-grade distributed multi-agent training with self-play support, as in Code 30.10.2. MARLlib, built on RLlib, packages a large library of cooperative and competitive MARL algorithms behind a unified interface. PyMARL (and its successor PyMARL2) is the reference research codebase for value-decomposition methods such as QMIX on the StarCraft Multi-Agent Challenge benchmark. The common shape across all three is the one in Figure 30.10.1: distributed actors running multi-agent rollouts, a learner updating shared or decomposed policies, and, for competitive settings, a population with a snapshot pool.

Research Frontier: Scaling and Stabilizing Distributed MARL (2024 to 2026)

Three threads are active. First, scalable MARL libraries: JAX-based, end-to-end-on-accelerator frameworks such as JaxMARL and Mava (2024) run thousands of vectorized multi-agent environments directly on GPUs and TPUs, collapsing the CPU-actor bottleneck of the classic architecture and reporting order-of-magnitude wall-clock speedups for MARL training. Second, open-endedness and automatic curricula: work descending from AlphaStar's league and from population-based training (the lineage of open-ended learning and PSRO) is formalizing how a population and its matchmaker should grow so that the pool keeps producing novel, harder opponents rather than collapsing to a single style. Third, large models as multi-agent learners: 2024 to 2026 has seen rapid interest in training and coordinating teams of LLM-based agents with reinforcement learning, where the "policy" is a language model and the distributed-rollout cost is dominated by inference, pulling distributed MARL toward the LLM-serving systems of Chapter 24. The common pressure across all three is the one this section opened with: multi-agent experience is expensive, and the frontier is about generating and learning from more of it, faster.

5. Chapter Summary: Multi-Agent Reinforcement Learning Beginner

This chapter took the single-agent reinforcement learning of classical RL and confronted it with the hardest complication in distributed AI: the other agents are also learning. We began by framing the problem as a Markov game (Section 30.2), the multi-agent generalization of a Markov decision process, and separated the three regimes that shape every design choice, cooperative, competitive, and mixed (Section 30.3). The simplest approach, independent learners (Section 30.4), treats every other agent as part of the environment; it is easy to implement and sometimes works, but it is fragile precisely because that "environment" is non-stationary. The cure that organizes the modern field is centralized training with decentralized execution (Section 30.5): let a critic see everything during training so the learning signal is stationary, while each agent still acts on its own local observation at deployment. Within CTDE, value decomposition (VDN and QMIX, Section 30.6) factors a joint value into per-agent pieces, and centralized-critic policy gradients (MADDPG and MAPPO, Section 30.7) extend actor-critic methods to many agents. Two challenges recurred throughout: credit assignment (Section 30.8), deciding which agent deserves the shared reward, and non-stationarity (Section 30.9), coping with co-adapting opponents. This final section showed that all of it is scaled on the distributed actor-learner systems of Chapter 20, with parameter sharing as the lever for many homogeneous agents and a self-play league as the distributed answer to competitive non-stationarity.

Key Takeaway: The Whole of Chapter 30 in One Breath

Multi-agent reinforcement learning is learning agents embedded in a Markov game, and the setting is cooperative, competitive, or mixed. Independent learners are the simple baseline but are fragile because each agent's environment is non-stationary. Centralized training with decentralized execution tames that non-stationarity by giving training a global view while keeping execution local; within it, value decomposition (VDN, QMIX) and centralized critics (MADDPG, MAPPO) are the workhorses. The two core difficulties are credit assignment (who earned the shared reward?) and non-stationarity (everyone is moving at once). And because multi-agent experience is enormously expensive, all of this is trained on distributed actor-learner systems, scaled by parameter sharing for homogeneous agents and stabilized by self-play leagues for competitive games, exactly the OpenAI Five and AlphaStar recipe.

The thread of this chapter, many learning agents coordinating under partial information, does not end here. Chapter 31 pushes the agent count to the hundreds or thousands and asks how simple local rules produce coordinated collective behavior, the swarm-intelligence end of the multi-agent spectrum where no agent learns a complex policy but the group still acts intelligently. And the robotics case study in Chapter 39 puts the MARL of this chapter onto physical multi-robot and drone-swarm systems, where the distributed actor-learner architecture meets real sensors, real latency, and real failure. The Markov game you learned to reason about here is the formal heart of both.

Exercise 30.10.1: Map the Rollout onto the Architecture Conceptual

Consider a cooperative MARL task with eight homogeneous agents trained with MAPPO under parameter sharing, run on the actor-learner architecture of Figure 30.10.1. (a) For a single episode of length $T$ on one actor, how many per-agent experience tuples are produced, and how does the learner's effective batch size depend on the number of agents? (b) Explain why parameter sharing leaves the weight-broadcast cost from learner to actors unchanged as the agent count grows from eight to eighty, while the experience volume per episode grows tenfold. (c) State one situation in which parameter sharing would hurt, and what you would do instead.

Exercise 30.10.2: Why the Pool Beats the Mirror Coding

Modify Code 30.10.1 to instrument the cycling failure directly. (a) Inside single_self_play, also record the agent's strategy index each round and plot or print it; confirm it walks steadily around the circle (the cycle) rather than settling. (b) In league, print the number of distinct strategies in the pool each round and show that the league's strength rises as that diversity grows. (c) Now sabotage the league by changing the opponent sampling to always pick only the single most recent snapshot; show that the league's strength collapses back toward the single-pair behavior, and explain in one or two sentences why sampling the whole pool, not just the latest snapshot, is what defeats cycling.

Exercise 30.10.3: Throughput Arithmetic for a League Analysis

Suppose each competitive match takes 5 minutes of wall-clock to simulate, you run 4,000 actors in parallel, and a stable league needs $2 \times 10^{8}$ matches. (a) Estimate the wall-clock training time, ignoring learner and matchmaking overhead. (b) The pool grows by one snapshot per learner every 20 updates; if each snapshot is 200 MB and you keep 500 of them, how much storage does the replicated pool consume, and why does the matchmaker design in the Practical Example matter for network cost rather than storage? (c) Argue from your numbers whether actor throughput or snapshot transfer is the more likely bottleneck, and connect your answer to the communication-versus-computation trade-off first quantified in Chapter 3.

Project Ideas

These projects turn Chapter 30 into running systems. Each one can start on a single machine and scale out with the frameworks named in Section 4.

1. Train a cooperative MARL team with MAPPO. Pick a standard cooperative benchmark (the StarCraft Multi-Agent Challenge, or a simpler grid-world predator-prey or cooperative-navigation environment) and train a team of homogeneous agents with MAPPO under parameter sharing, using RLlib or MARLlib. Measure how sample efficiency and final return change as you (a) turn parameter sharing on and off and (b) add or remove the centralized critic of Section 30.7. Report wall-clock and total environment steps, and relate the speedup from more rollout workers to the actor-learner architecture of Figure 30.10.1.

2. Build a self-play league for a competitive game. Starting from the pure-Python skeleton of Code 30.10.1, replace the best-response oracle with a small neural-network policy and the circle game with a real two-player game (a simplified fighting game, a card game, or even Connect Four). Implement a shared snapshot pool, a matchmaker that samples opponents, and a periodic freeze-and-publish callback. Demonstrate the same contrast Output 30.10.1 shows: the league produces an agent robust to its own past selves while a lone self-play pair cycles or overfits to its latest opponent.

3. Stress-test the league control plane. Take the league from project 2 (or RLlib's self-play) and scale the actor count up while logging snapshot-transfer bytes and per-actor memory. Compare an all-actors-pull-the-whole-pool design against the matchmaker-assigns-one-opponent design from the Practical Example, and quantify the network-traffic reduction. Use the communication-cost reasoning of Chapter 3 to predict where the design hits a bandwidth wall, then verify your prediction empirically.