Part IV: Parallel Deep Learning and Large Models
Chapter 20: Distributed Reinforcement Learning Infrastructure

Why RL Is a Distributed-Systems Problem

"The learner asked the actors for a billion experiences. The actors asked the environment. The environment, being a simulator, simply asked for more CPUs."

A Replay Buffer That Is Always Hungry
Big Picture

Reinforcement learning is a distributed-systems problem because, unlike supervised learning, it generates its own training data by interacting with an environment, so a single run must interleave two workloads with opposite hardware appetites: many CPU-heavy actors simulating environments to collect experience, and one GPU-heavy learner optimizing the policy from that experience. Supervised learning reads a fixed dataset from disk; reinforcement learning has no dataset until its current policy goes out and produces one, and the policy is only as good as the experience it has already collected. That loop, act to learn, learn to act better, is cheap in arithmetic but ruinously expensive in samples: state-of-the-art results consume millions to billions of environment steps. The only way to produce that many steps in a reasonable time is to run thousands of environment instances in parallel, which is why reinforcement learning gets its own infrastructure rather than reusing the data-parallel training loop of Chapter 15. This section establishes that thesis and measures the bottleneck that the rest of the chapter exists to manage.

By Chapter 19 we knew how to train an enormous model on an enormous fixed dataset, splitting data, the model, and the cluster across thousands of accelerators at once. Reinforcement learning breaks the one assumption that made all of that work: that the training data exists before training starts. In supervised learning the corpus is a given; the system reads it, computes gradients, and updates parameters. In reinforcement learning an agent must act in an environment, observe the consequences, and learn a policy that maximizes a long-run reward, and the data that teaches the policy is produced by the policy itself. The dataset is not read from disk; it is manufactured on the fly by interaction, and it shifts every time the policy improves. That single difference reorganizes the whole training system.

Formally, an agent in state $s_t$ takes an action $a_t \sim \pi_\theta(\cdot \mid s_t)$ under a policy with parameters $\theta$, the environment returns a reward $r_t$ and a next state $s_{t+1}$, and the goal is to maximize the expected discounted return $J(\theta) = \mathbb{E}_{\pi_\theta}\!\left[\sum_{t \ge 0} \gamma^t r_t\right]$ with discount $\gamma \in [0,1)$. The expectation is taken over trajectories drawn from the current policy, which is exactly why the data distribution moves as $\theta$ changes. Estimating $\nabla_\theta J(\theta)$ requires fresh samples from $\pi_\theta$, and producing those samples means running the environment, possibly millions of times. The arithmetic of the gradient is modest; the simulation behind it is the cost.

Many CPU actors (rollout) Actor 1 policy copy + environment Actor 2 policy copy + environment Actor 3 policy copy + environment . . . Actor M policy copy + environment Replay buffer collected experience (state, action, reward, next state) experience One GPU learner samples experience, computes policy gradient, updates θ batches updated policy weights broadcast back to actors
Figure 20.1.1: The actor-learner shape of a distributed reinforcement-learning system, which Section 20.2 makes precise. Many CPU actors (left, orange) each pair a copy of the policy with an environment and stream experience into a shared replay buffer (center). A single GPU learner (right, green) draws batches from the buffer, computes the policy gradient, and updates $\theta$; the dashed arrow returns the new weights so actors keep collecting under an improving policy. The two workloads are physically and economically distinct: rollout is embarrassingly parallel CPU simulation, optimization is one accelerator.

Figure 20.1.1 already exposes the defining tension of this chapter. The left half of the diagram and the right half want different machines. Environment simulation is branch-heavy, often single-threaded per instance, and rarely touches a matrix multiply large enough to justify a GPU; it scales by running more independent copies on more CPU cores. Policy optimization is the opposite: dense linear algebra over large batches, exactly what an accelerator is built for, but fundamentally one shared computation because there is one policy to improve. A reinforcement-learning run therefore interleaves a many-CPU workload with a one-GPU workload, and the engineering problem is keeping both busy without either starving the other.

1. Two Workloads in One Loop Beginner

It helps to name the two halves precisely, because the rest of the chapter is about the seam between them. The first workload is rollout, also called experience collection: an actor holds a copy of the current policy, steps the environment forward, and records the resulting transitions. Each actor is independent of every other actor; they share no state during a rollout, so adding a thousandth actor costs nothing in coordination, only in CPUs. Rollout is, in the language of Chapter 6, embarrassingly parallel: the work partitions cleanly and the parts never talk to each other. The second workload is learning, the policy optimization step: a learner consumes collected experience, computes a gradient of the objective, and updates the single shared set of policy parameters. There is one policy, so there is one authoritative update path, and that is the part of the system that does not trivially parallelize across machines.

These two workloads have genuinely different throughput characteristics, which is the entire reason reinforcement learning earns dedicated infrastructure rather than reusing the data-parallel SGD loop of Chapter 15. In data-parallel supervised training every worker runs the identical computation on a different shard of a fixed dataset and they all synchronize through an all-reduce; there is one workload, replicated. In reinforcement learning the actors and the learner run different code on different hardware at different rates, connected by a buffer of experience rather than by a collective. The whole-cluster question changes from "how do I average gradients across identical workers?" to "how do I keep a GPU learner fed by a swarm of CPU actors whose data goes stale the moment the policy moves?"

Key Insight: One Run, Two Hardware Profiles, One Seam

Reinforcement learning is distributed not because the model is large but because the data is manufactured. A run interleaves rollout (CPU-heavy, embarrassingly parallel, throughput scales with the number of actors) and optimization (GPU-heavy, one shared policy, throughput capped by a single learner). The actors and the learner are different programs on different hardware running at different speeds, joined by a buffer of experience. Almost every design decision in this chapter, synchronous versus asynchronous collection, replay-buffer placement, off-policy correction, is about managing that one seam so neither side starves the other.

2. Why You Cannot Avoid Distributing the Rollout Beginner

The reason this is a scale-out problem and not a single-machine one is sample efficiency, or rather the lack of it. A supervised model may converge after a handful of passes over a fixed corpus. A reinforcement-learning agent learns from sparse, noisy, delayed reward, discards most of what it collects as the policy shifts, and must explore before it can exploit; the result is that landmark agents consume environment steps in quantities that have no analogue in supervised learning. Game-playing and continuous-control agents are routinely trained on hundreds of millions to tens of billions of frames. At a single environment instance running even a few thousand steps per second, ten billion steps is months of wall-clock time. Run ten thousand environment instances in parallel and the same experience budget collapses to hours. The distribution of rollout is not an optimization; it is the only thing that makes the run finish.

This is the same logic that drove the data-parallel decision in Chapter 15, where throughput was the binding ceiling, but here the parallel work is environment simulation rather than gradient computation, and it lands on CPUs rather than accelerators. The learner, by contrast, does not benefit from a thousand copies, because there is one policy to optimize. So the cluster is lopsided by design: a large pool of cheap CPU actors and a small set of expensive GPU learners, sized so the actors produce experience about as fast as the learners can consume it. Getting that ratio right is the central tuning problem, and it has a clean signature that we can measure directly.

The program below models exactly that signature in pure Python. It times one actor's environment-step throughput, then projects what a cluster of $M$ parallel actors would produce, because rollout is embarrassingly parallel and scales linearly. It then caps that throughput by a single learner's fixed ingest rate. The point of interest is the crossover: the actor count at which the system stops being limited by how fast experience is collected and starts being limited by how fast the one learner can consume it.

import time, math, random

# A toy "environment step": a small amount of CPU work standing in for a
# simulator advancing one timestep (physics, game logic, reward computation).
def env_step(state):
    acc = 0.0
    for _ in range(220):
        acc = math.sin(acc + state) * 1.000001 + 0.5
    return acc

ROLLOUT = 4000                      # env steps produced per actor per rollout

def actor_rollout():                # one actor: pure CPU simulation, no GPU
    s = random.random()
    for _ in range(ROLLOUT):
        s = env_step(s)
    return s

def one_rollout_seconds():          # best-of-three timing of a single rollout
    def _one():
        t0 = time.perf_counter(); actor_rollout(); return time.perf_counter() - t0
    return min(_one() for _ in range(3))

per_rollout_s = one_rollout_seconds()
steps_per_actor_per_s = ROLLOUT / per_rollout_s

# The single learner consumes experience at a FIXED rate (one GPU optimizing the
# one policy). Past this rate, extra actors only pile up experience the learner
# cannot ingest, so the learner becomes the system bottleneck.
LEARNER_CAPACITY = 9.0 * steps_per_actor_per_s     # learner keeps up with ~9 actors

print(f"per-actor rollout throughput : {steps_per_actor_per_s:11.0f} env-steps/s")
print(f"single-learner ingest cap    : {LEARNER_CAPACITY:11.0f} env-steps/s\n")
print(f"{'actors':>7} | {'rollout (steps/s)':>18} | {'system (steps/s)':>17} | bottleneck")
print("-" * 64)
for n_actors in [1, 2, 4, 8, 16, 32, 64]:
    rollout_rate = n_actors * steps_per_actor_per_s      # linear: parallel actors
    system_rate  = min(rollout_rate, LEARNER_CAPACITY)   # capped by the one learner
    who = "actors (rollout)" if rollout_rate <= LEARNER_CAPACITY else "learner (policy opt)"
    print(f"{n_actors:7d} | {rollout_rate:18.0f} | {system_rate:17.0f} | {who}")
Code 20.1.1: A pure-Python model of actor-learner throughput. It measures one actor's real environment-step rate, scales it linearly across parallel actors (valid because rollout is embarrassingly parallel), and caps the total by a single learner's fixed ingest rate to expose the crossover from a rollout-bound regime to a learner-bound regime.
per-actor rollout throughput :       53931 env-steps/s
single-learner ingest cap    :      485378 env-steps/s

 actors |  rollout (steps/s) |  system (steps/s) | bottleneck
----------------------------------------------------------------
      1 |              53931 |             53931 | actors (rollout)
      2 |             107862 |            107862 | actors (rollout)
      4 |             215724 |            215724 | actors (rollout)
      8 |             431447 |            431447 | actors (rollout)
     16 |             862894 |            485378 | learner (policy opt)
     32 |            1725788 |            485378 | learner (policy opt)
     64 |            3451577 |            485378 | learner (policy opt)
Output 20.1.1: Total environment-steps-per-second grows linearly while the actors are the bottleneck (1 to 8 actors here), then flattens hard once the rollout rate exceeds the single learner's ingest cap (from 16 actors on). Adding actors past the crossover buys collected experience the one learner cannot consume; the measured per-actor rate near 54,000 steps/s and the projected 485,000 steps/s ceiling are this machine's numbers, but the shape is universal.

Output 20.1.1 is the chapter in miniature. While the actors are the limiting factor, throughput rises in lockstep with the actor count, exactly the linear speedup that makes distributing the rollout worthwhile. The moment the combined rollout rate crosses the learner's fixed ingest capacity, the curve goes flat: more actors no longer help, because the single learner cannot absorb what they produce. Every subsequent section of this chapter is an attack on one side of that crossover, either pushing the learner ceiling up (sharing learner work, faster off-policy ingestion) or making each unit of collected experience more valuable so the learner needs less of it. Section 20.8 returns to this sampling-versus-learning bottleneck with the full machinery to diagnose which side binds in a real system.

Fun Note: The Buffet Problem

An actor-learner cluster is a buffet where the kitchen (the actors) can cook arbitrarily fast by hiring more cooks, but there is exactly one diner (the learner) with one stomach. Hire a thousand cooks and the counters overflow; the diner still eats at one diner's pace, and most of the food spoils before it is touched. Reinforcement-learning infrastructure is, to an unreasonable degree, the art of either growing the diner or making every plate count.

3. RLHF: The Environment Is a Reward Model Intermediate

If reinforcement learning felt like a niche corner reserved for game-playing and robotics, the rise of large language models moved it to the center, because reinforcement learning from human feedback (RLHF) is how modern assistants are aligned, and RLHF is reinforcement learning wearing the actor-learner shape from Figure 20.1.1. The policy is the language model itself; an action is generating a token (or a whole response); and the environment, the thing that returns a reward, is a learned reward model that scores how good a response is, standing in for human preferences. Section 19.8 developed RLHF as the final stage of foundation-model training; here we name what kind of system it is. Generating responses from the policy is rollout, scoring them with the reward model is the environment step, and optimizing the policy against those scores (with PPO or a related algorithm) is the learner.

This mapping is not a loose analogy; it changes where the cost lives. In classic reinforcement learning the environment is cheap (a game simulator) and the policy is small, so rollout is CPU-bound. In RLHF the "environment" is itself a multi-billion-parameter neural network, and generating a rollout means autoregressive decoding from a multi-billion-parameter policy, so both the actor side and the learner side now demand accelerators. The actor-learner seam is the same, but both halves moved onto GPUs, which is why production RLHF stacks look like a fusion of the serving systems of Chapter 24 (fast generation for rollout) and the training systems of this part. The infrastructure question of this chapter, keep the learner fed without letting experience go stale, is exactly the question an RLHF system must answer, now with rollout that costs as much as a serving fleet.

Research Frontier: Large-Scale RL for Reasoning (2024 to 2026)

The most visible recent use of distributed reinforcement learning is training language models to reason. DeepSeek-R1 (DeepSeek-AI, 2025) showed that reasoning ability can be elicited largely through reinforcement learning with verifiable rewards, where the "environment" is an automatic checker (a math grader, a unit-test harness) rather than a human-trained preference model, which makes the reward cheap and exact and shifts the bottleneck squarely onto rollout throughput. This reinforcement-learning-with-verifiable-rewards setting has driven a wave of open infrastructure, including verl (Sheng et al., 2024) and OpenRLHF, that co-locates a high-throughput inference engine for generation with a distributed trainer for optimization, precisely the two-workload seam of this section, and schedules them to keep expensive accelerators busy on both sides. The frontier question is no longer "can we do RL on an LLM" but "how do we balance generation and optimization throughput across a heterogeneous cluster so neither stalls," which is the scaling bottleneck of Section 20.8 at foundation-model scale.

4. Single-Agent Infrastructure, Not Multi-Agent RL Intermediate

One boundary needs drawing before we build anything, because two different ideas share the word "many." This chapter is about the infrastructure for training a single agent fast: one policy, one objective, many machines collecting experience for that one policy. The many actors in Figure 20.1.1 are not many agents; they are many copies of the same policy collecting data in parallel, the way many data-parallel workers in Chapter 15 are copies of the same model. The distribution is a systems trick to manufacture data faster, and the learning problem underneath is the ordinary single-agent one.

That is distinct from multi-agent reinforcement learning, where several agents with possibly different policies and possibly conflicting objectives interact in a shared environment and must learn while the environment itself is non-stationary because the other agents are also learning. That is a different learning problem (it brings in game theory and equilibria) and it is the subject of Chapter 30. The two connect: the distributed actor-learner machinery built in this chapter is the substrate that distributed multi-agent training in Chapter 30 runs on, which is why the cross-reference arc lists distributed RL infrastructure as introduced here and transformed there. Keep the distinction clean: this chapter scales the data pipeline for one learner; Chapter 30 changes what is being learned.

Library Shortcut: Ray RLlib Wires the Actor-Learner Cluster for You

Everything Figure 20.1.1 sketches, a pool of environment actors, a shared experience path, and a learner that broadcasts updated weights, is fiddly to build correctly: process management, weight synchronization, batching, fault handling. Ray RLlib provides it as configuration. You declare an algorithm and the number of environment runners, and the library spins up the distributed actors, routes their experience to the learner, and pushes new policy weights back, scaling from one machine to a cluster by changing two numbers:

# pip install "ray[rllib]"
from ray.rllib.algorithms.ppo import PPOConfig

config = (
    PPOConfig()
    .environment("CartPole-v1")
    .env_runners(num_env_runners=64)   # 64 parallel CPU actors collecting experience
    .learners(num_learners=1)          # one GPU learner optimizing the single policy
)
algo = config.build_algo()
for _ in range(50):
    result = algo.train()              # one iteration: collect rollouts, then optimize
    print(result["env_runners"]["episode_return_mean"])
Code 20.1.2: The actor-learner system of Figure 20.1.1 as a few lines of Ray RLlib. The num_env_runners and num_learners fields are the two sides of the crossover measured in Output 20.1.1; RLlib handles the process spawning, experience routing, weight broadcast, and fault tolerance that Section 20.9 unpacks. The hand-built throughput model of Code 20.1.1 becomes a tuning knob.
Practical Example: The Robotics Team Whose GPU Sat Idle

Who: A reinforcement-learning team training a locomotion policy for a quadruped in a physics simulator.

Situation: They ran training on a single beefy GPU box, with the environment simulator and the PPO learner sharing the same machine and the same eight CPU cores.

Problem: The expensive GPU sat below twenty percent utilization for most of every step, while the CPUs pegged at one hundred percent; training a usable gait took nine days.

Dilemma: Buy a second, even bigger GPU (a scale-up reflex), or separate the two workloads and scale only the side that was actually saturated, the CPU-bound rollout.

Decision: They scaled out the rollout, because the profile was unambiguous: the GPU was starved, not the bottleneck. The learner needed feeding, not replacing.

How: They moved to an actor-learner layout with 200 CPU environment actors across cheap nodes streaming experience to the one existing GPU learner, the arrangement of Figure 20.1.1, configured in a few lines like Code 20.1.2.

Result: GPU utilization rose past eighty percent, the learner was no longer waiting on simulation, and wall-clock training fell from nine days to under a day at a fraction of the cost of a second flagship GPU.

Lesson: In reinforcement learning the binding ceiling is usually rollout, not optimization. Profile the seam first; the cheap fix is almost always more CPU actors, not a bigger accelerator.

5. What the Rest of the Chapter Builds Beginner

We now have the thesis (reinforcement learning is distributed because it manufactures its own data, interleaving CPU rollout with GPU optimization), the measurement that defines the problem (the actor-versus-learner crossover of Output 20.1.1), and the boundary (single-agent infrastructure, not multi-agent learning). The chapter proceeds straight down the seam. Section 20.2 makes the actor-learner architecture precise, naming who holds the policy, how experience flows, and how weights propagate back. From there the chapter widens the two pipes in turn: distributed experience collection and replay buffers feed the learner faster, off-policy correction such as V-trace keeps stale experience usable, the Ape-X, R2D2, and SEED RL designs show full systems that balance the two sides, and the synchronous-versus-asynchronous choice (a return of the coordination question from Chapter 2) decides how tightly the actors track the learner. It all serves the one goal this section measured: keep the learner fed and the experience fresh, at the largest scale the budget allows.

Thesis Thread: A New Reason to Scale Out

Every part of this book has distributed work because a resource ran out: data too big for one disk (Part II), gradients too slow on one worker (Part III), a model too large for one accelerator (Part IV so far). Reinforcement learning adds a reason the earlier chapters never needed: the data itself does not exist until thousands of machines manufacture it by interacting with an environment. Scale-out here is not about fitting or speeding up a fixed computation; it is the only way to produce the training signal at all. The same actor-learner substrate returns, transformed, when many distinct agents learn together in Chapter 30, and again in the multi-robot swarm case study of Chapter 39.

Exercise 20.1.1: Why Not Just Reuse Data-Parallel SGD? Conceptual

The data-parallel training loop of Chapter 15 has every worker run the identical computation on a different shard of a fixed dataset and synchronize through an all-reduce. Explain, in terms of the two workloads named in Section 1, why this loop does not directly fit reinforcement learning. Address three differences specifically: (a) where the training data comes from, (b) whether the actors and the learner run the same code on the same hardware, and (c) why an all-reduce of gradients is not the natural connective tissue between actors and a learner. What replaces the all-reduce as the thing that joins the two halves?

Exercise 20.1.2: Move the Crossover Coding

Reproduce Code 20.1.1 and confirm the crossover from a rollout-bound to a learner-bound regime on your own machine (the actor count where it flips will differ from Output 20.1.1). Then change LEARNER_CAPACITY to model a learner that is three times faster (for example by sharding the optimization across three accelerators) and report how the crossover actor count moves. Separately, make each environment step cheaper (lower the inner loop count in env_step) so per-actor throughput rises, and explain which regime that change helps and which it does not. Summarize, in one sentence, the two distinct levers you have for raising total environment-steps-per-second.

Exercise 20.1.3: Size an RLHF Cluster Analysis

Using the RLHF mapping from Section 3, suppose a reinforcement-learning-from-human-feedback run must process $2 \times 10^{8}$ generated tokens of experience to converge, generation (rollout) produces tokens at 5,000 tokens per second per inference replica, and the PPO learner can ingest experience equivalent to 40,000 tokens per second. Estimate how many generation replicas are needed so that rollout is not the bottleneck, and state what the system is limited by once you provision that many. Then argue qualitatively how the answer shifts if the reward is a cheap automatic verifier (as in the reasoning-RL frontier above) rather than an expensive reward-model forward pass. Which side of the seam does a cheaper reward relieve?