"I stepped a thousand worlds at once so the learner would never go hungry. It still complained that the batch was late."
A Vectorized Actor Between Resets
A reinforcement learning system is starved or fed by one number: how many environment steps per second its actors can produce. Unlike supervised training, where the data already exists on disk, an RL agent must generate its own data by acting in environments, and that generation is the bottleneck for most modern systems. The previous section split the work into actors that collect experience and learners that update the policy. This section scales the actor side: it batches many environments into one process (vectorization), spreads thousands of environment instances across many cheap CPU machines, and works out the throughput arithmetic that tells you how many actors a single GPU learner needs. The recurring tension is a hardware mismatch: environment simulation usually wants CPUs while the policy's forward pass wants a GPU, and every rollout architecture is a different answer to where that split should fall.
In Section 20.2 we drew the actor-learner architecture: actors run the current policy inside environments and emit transitions, a learner consumes those transitions and updates the policy, and the updated weights flow back to the actors. That diagram hides the question that decides whether the whole system is fast or slow. A learner on a modern accelerator can apply gradient updates extremely quickly; if it sits idle waiting for transitions to arrive, the accelerator you paid for is wasted. The job of distributed experience collection is to produce transitions fast enough, and cheaply enough, that the learner never waits. Everything in this section is in service of that single goal.
We measure the rollout side in environment steps per second, the rate at which actors advance their environments and record transitions. This is the currency of RL infrastructure the way examples-per-second is the currency of supervised training. A step is one call to the environment's transition function: feed it an action, get back the next observation, a reward, and a done flag. Producing a step requires two things in sequence, an action from the policy and a transition from the environment, and the rest of this section is about making both of those happen at scale.
1. Vectorized Environments: Batch the Rollout in One Process Beginner
The first lever costs no extra machines at all. A naive rollout loop advances one environment at a time: call the policy for one observation, get one action, step one environment, repeat. In an interpreted language this is dominated by per-call overhead, the Python interpreter dispatches the same handful of operations millions of times, and each call processes a single tiny observation. A vectorized environment instead holds $B$ independent environment instances and advances all of them together, so one Python-level call processes a batch of $B$ observations and emits $B$ transitions. The policy's forward pass also becomes a single batched matrix multiply over $B$ observations instead of $B$ separate ones, which is exactly the shape that both CPUs and GPUs execute most efficiently.
The throughput gain is large and comes for free in machine count. If the per-step Python and dispatch overhead is the dominant cost, batching $B$ environments amortizes that fixed cost across $B$ transitions, so a single process can reach an order of magnitude more environment steps per second. The demonstration below measures this directly: it times a scalar single-environment loop against a vectorized loop over a batch, both doing the same per-step arithmetic, and reports the speedup.
import time, math
import numpy as np
STATE_DIM = 64
def env_step_scalar(state): # one transition, pure-Python loop
s = 0.0
for i in range(STATE_DIM):
s += math.sin(state * 0.001 + i) * 1.000001
return s
def run_single_env(num_steps): # advance ONE env, num_steps times
state, t0 = 1.0, time.perf_counter()
for _ in range(num_steps):
state = env_step_scalar(state)
return num_steps / (time.perf_counter() - t0) # env-steps per second
def run_vectorized_env(num_steps, batch): # advance B envs together each step
state = np.ones(batch); idx = np.arange(STATE_DIM, dtype=np.float64)
t0 = time.perf_counter()
for _ in range(num_steps):
state = np.sin(state[:, None] * 0.001 + idx).sum(axis=1) * 1.000001
return (num_steps * batch) / (time.perf_counter() - t0)
STEPS, BATCH = 20_000, 256
single = run_single_env(STEPS)
vec = run_vectorized_env(STEPS, BATCH)
print(f"single-env throughput : {single:,.0f} env-steps/s")
print(f"vectorized (B={BATCH}) throughput : {vec:,.0f} env-steps/s")
print(f"speedup from batching : {vec / single:,.1f}x")
# How many CPU actors keep one GPU learner fed? Model a realistic per-actor rate.
learner_demand = 512 / (8.0 / 1000.0) # 512 transitions every 8 ms learner step
realistic_actor_sps = 1_000.0 # one actor running a real game/sim
print()
print(f"learner demand : {learner_demand:,.0f} transitions/s")
print(f"per-actor supply (real) : {realistic_actor_sps:,.0f} env-steps/s")
print(f"actors to feed 1 learner: {math.ceil(learner_demand / realistic_actor_sps)}")
single-env throughput : 130,049 env-steps/s
vectorized (B=256) throughput : 1,728,380 env-steps/s
speedup from batching : 13.3x
learner demand : 64,000 transitions/s
per-actor supply (real) : 1,000 env-steps/s
actors to feed 1 learner: 64
The 13.3 times speedup is the headline, and notice that the toy benchmark's vectorized rate is far higher than any real simulator would reach, because the per-step work here is trivial. That is why the actor-count model uses a deliberately realistic 1,000 steps per second for a single CPU actor running an actual game or physics engine. With that grounded number, one learner needs dozens of actors, which is precisely why the rollout side must be distributed across machines and not merely vectorized within one.
An RL system's speed is governed by the slower of two rates: how fast actors produce transitions and how fast the learner consumes them. Vectorization raises the actor rate for free by amortizing interpreter overhead across a batch of environments; distribution raises it further by adding actor machines. The learner's accelerator is only worth its cost if the combined actor rate keeps it busy, so the first quantity to measure in any RL infrastructure is environment-steps-per-second, not floating-point operations or wall-clock per gradient step.
2. The Throughput Math: How Many Actors Feed One Learner Intermediate
The actor-count calculation in Code 20.3.1 deserves to be written out, because it is the back-of-the-envelope that sizes every rollout fleet. Let a single actor produce $r_a$ environment steps per second. With $M$ actors running in parallel, the system's transition supply rate is
$$S = M \cdot r_a \quad \text{(transitions per second produced)}.$$The learner consumes transitions in batches. If it processes $b$ transitions per gradient step and each step takes $t_\ell$ seconds, its demand rate is
$$D = \frac{b}{t_\ell} \quad \text{(transitions per second consumed)}.$$To keep the learner busy without starving it, supply must at least meet demand, $S \ge D$, which rearranges into the number of actors you must provision:
$$M \ge \frac{D}{r_a} = \frac{b}{t_\ell \, r_a}.$$With the numbers from Output 20.3.1, $b = 512$, $t_\ell = 8$ milliseconds, and $r_a = 1000$, this gives $M \ge 64$. The formula also tells you what each design lever does. A faster simulator (larger $r_a$) lowers the actor count linearly. A bigger learner batch (larger $b$) raises it. A slower learner step (larger $t_\ell$, perhaps because the policy grew) lowers the demand rate and so lowers the actor count, which is the counterintuitive observation that a heavier learner is easier to keep fed. Whenever supply exceeds demand, the surplus transitions pile up in a buffer, which is exactly the structure Section 20.4 builds out as the distributed replay buffer.
Who: An RL engineer training a control policy for a warehouse-robotics simulator on a single GPU learner plus a pool of CPU actors.
Situation: Training a policy that needed roughly two billion environment steps to converge, on a cloud cluster billed by the hour.
Problem: GPU utilization on the learner hovered around 15 percent; the expensive accelerator spent most of its time waiting for transitions to arrive.
Dilemma: Buy a second, faster learner GPU to "speed things up," or leave the learner alone and instead grow the cheap actor pool that was actually the bottleneck.
Decision: They measured first. One actor produced about 900 environment steps per second; the learner demanded near 60,000 transitions per second, so by $M \ge D / r_a$ they were running far too few actors.
How: They vectorized each actor to a batch of 64 environments and scaled the actor pool from 8 to 80 CPU instances, leaving the single learner untouched.
Result: Learner GPU utilization rose above 85 percent, wall-clock to convergence fell by roughly four times, and total cost dropped because the added capacity was cheap CPUs rather than a second GPU.
Lesson: Size the actor fleet from the throughput inequality, not from intuition. The accelerator is rarely the rollout bottleneck; the supply of environment steps almost always is.
3. The CPU-GPU Split: Where Does Policy Inference Run? Intermediate
Producing one environment step requires two pieces of work with different hardware appetites. The environment's transition function, a physics integrator, a game engine, a market simulator, is almost always CPU-bound and does not vectorize onto a GPU cleanly. The policy's forward pass, by contrast, is a neural-network inference that runs fastest on a GPU, especially when batched. These two preferences pull in opposite directions, and the architecture you choose is essentially a decision about where to place the policy inference relative to the environment simulation.
One option runs a small policy directly on the actor's CPU. The actor steps its vectorized environments and computes actions locally with no network round trip, which is simple and has low latency, but it wastes a GPU's inference speed and forces the policy to stay small enough to run acceptably on a CPU. This is the design implied by the cheap-CPU-actor fleet in Figure 20.3.1. The opposite option, central batched inference, gathers observations from many actors onto a dedicated inference GPU, runs one large batched forward pass, and returns actions; this keeps the policy on the hardware it prefers and lets a big model serve many actors, at the cost of a network hop on every step. That second design is the heart of the SEED RL architecture, which Section 20.6 develops in full alongside Ape-X and R2D2.
Engineers sometimes discover that shrinking the policy until it runs on the actor's CPU beats a fancy central-inference setup, not because CPU inference is fast, but because the network round trip on every single environment step was quietly dominating everything. The cheapest packet is the one you never send. Always measure the round trip before you architect around avoiding it.
There is no universally correct placement; the right answer depends on policy size, environment cost, and network latency. A heavy vision policy serving a slow, expensive simulator favors central batched inference, because the forward pass dominates and a GPU is worth the network hop. A tiny policy on a fast, cheap environment favors local CPU inference, because the round trip would cost more than the inference it saves. The throughput inequality from Section 2 still governs either way: whichever placement yields the higher sustained environment-steps-per-second per dollar is the one to pick.
4. Heterogeneous Resources and GPU-Accelerated Simulators Advanced
The classic rollout fleet is deliberately heterogeneous: many cheap CPU actors and a few expensive GPU learners, sized so the cluster spends its money where it produces value. This is a cost-aware placement problem, and it is exactly the kind of mixed-resource scheduling that Chapter 33 treats in general: the cluster scheduler must pack CPU actor jobs and GPU learner jobs onto the right node types and keep them co-located enough that the transition stream does not cross a slow network link. Getting this placement wrong, scheduling actors far from the learner, or starving the learner of CPU nodes, undoes the throughput math no matter how cleverly the code is vectorized.
A more recent shift rewrites the split entirely. GPU-accelerated simulators such as Isaac Gym and Brax run thousands of environment instances directly on the GPU, as a single massive batched simulation, so the environment step and the policy forward pass live on the same device and the CPU-to-GPU transfer of observations disappears. When the simulator itself is on the GPU, the rollout no longer waits on slow CPU physics or on the network, and a single GPU can produce environment steps at rates that previously required a large CPU cluster. The bottleneck moves: it is no longer the supply of transitions but the learner's ability to consume them and the memory needed to hold thousands of parallel environment states. We return to this frontier below.
Whichever simulator hardware you use, the transitions still have to travel from where they are produced to where they are consumed, and that travel is not free. Each transition carries an observation, an action, a reward, and bookkeeping; serializing those structures and shipping them over the network costs CPU time on both ends and bandwidth in between. This is the same communication tax that Chapter 4 quantifies for collective operations, applied here to the experience stream. When observations are large (raw frames, point clouds), the serialization and network cost can rival the simulation cost, and practical systems compress observations, send only deltas, or move the policy to the data rather than the data to the policy.
You do not write the vectorized stepping loop of Code 20.3.1 by hand in practice. Gymnasium ships a vector-environment API that batches many environment instances behind one step call, and EnvPool implements the same idea in C++ with a thread pool, reaching far higher environment-steps-per-second than a pure-Python loop:
# pip install gymnasium envpool
import gymnasium as gym
import numpy as np
# Gymnasium: B independent envs stepped together, one batched call.
envs = gym.make_vec("CartPole-v1", num_envs=256, vectorization_mode="sync")
obs, _ = envs.reset(seed=0)
actions = np.zeros(256, dtype=np.int64) # your policy would produce these
obs, rewards, terminated, truncated, info = envs.step(actions) # 256 transitions
print(obs.shape, rewards.shape) # (256, 4) (256,)
# EnvPool: the same batched stepping, implemented in C++ for much higher throughput.
# import envpool
# pool = envpool.make("CartPole-v1", env_type="gymnasium", num_envs=256)
gym.make_vec call. Gymnasium and EnvPool handle the batched reset, the per-environment auto-reset on episode end, and the conversion of $B$ environments into one batched observation tensor; EnvPool additionally moves the stepping loop out of Python into a C++ thread pool.5. When Collection Stops Being the Bottleneck Advanced
For most of RL's history the rollout was the bottleneck, which is why this section exists. The arithmetic of Section 2 assumed actor supply was the scarce quantity and the learner was easy to keep busy. GPU-accelerated simulators flip that assumption: when a single GPU produces millions of environment steps per second, the learner, not the actor, becomes the constraint, and the design question changes from "how many actors do I need" to "how do I make the learner consume transitions fast enough to use them all before they go stale." That shift reshapes the rest of this chapter, because off-policy correction (Chapter 10 covers the optimization background) and replay-buffer design both exist to let a learner safely consume experience that its current policy did not generate.
It also reframes the synchronous-versus-asynchronous choice that Chapter 30 revisits for multi-agent settings. When actors and learner run on the same GPU in lockstep, the rollout is naturally synchronous and there is no policy staleness; when thousands of CPU actors stream into a remote learner, some actors are always running a slightly old policy, and the system must tolerate that lag. The throughput math is the same in both regimes, but where the bottleneck sits decides which complications you must engineer around, and that is the thread Section 20.4 picks up by giving the surplus transitions a place to live.
The dominant rollout-scaling trend of the past few years moves the environment onto the accelerator. NVIDIA's Isaac Lab (the successor to Isaac Gym) and Google DeepMind's Brax and MuJoCo XLA (MJX) run tens of thousands of robotics environments as one batched GPU simulation, and the Madrona engine generalizes the idea to many environment types, reporting millions of environment steps per second from a single GPU. Recent work pushes this further with end-to-end JAX RL stacks where simulation, policy inference, and the learner update all stay on-device and compile into one graph, removing the CPU-to-GPU transfer and the network hop entirely for single-node training. The research questions have shifted accordingly: how to scale these on-GPU rollouts across multiple GPUs without reintroducing the serialization tax, how to keep thousands of parallel environment states in limited GPU memory, and how to balance a learner against a simulator that can now out-produce it. The actor-count inequality of Section 2 still holds; the frontier is making $r_a$ so large that $M$ collapses toward one very busy device.
A learner processes a batch of $b = 1024$ transitions per gradient step and each step takes $t_\ell = 20$ milliseconds. Each CPU actor sustains $r_a = 800$ environment steps per second. Using the inequality $M \ge b / (t_\ell\, r_a)$ from Section 2, compute the minimum number of actors needed to keep the learner fed. Now suppose you replace the CPU actors with a single GPU-accelerated simulator producing $2 \times 10^6$ steps per second; explain in words what becomes the new bottleneck and why adding more simulators would not help.
Extend Code 20.3.1 to sweep the batch size $B$ over the values 1, 4, 16, 64, 256, and 1024, recording environment-steps-per-second at each. Plot or tabulate throughput against $B$. Identify the point where the curve flattens, the batch size beyond which adding more environments per process stops helping, and explain what fixed cost has been fully amortized at that point. Relate the flattening to the CPU-GPU split of Section 3: would the flattening point move if the per-step work were a GPU forward pass instead of CPU arithmetic?
An actor produces transitions whose observations are raw $84 \times 84 \times 4$ uint8 image stacks (one byte per value). At 1,000 environment steps per second, estimate the bytes per second one actor must serialize and ship to the learner. For a fleet of 80 such actors, compare the aggregate transition bandwidth against a 10 gigabyte-per-second network link, using the same style of estimate as the communication-cost reasoning in Chapter 4. At what fleet size does the network, rather than the simulator, become the binding constraint, and what two changes from Section 4 would you make to push that limit back?