"I generated experience all night. The learner read one batch, looked at the other nine million, and quietly suggested I slow down."
An Actor That Outran Its Learner
A distributed reinforcement learning system is a two-stage pipeline, and like every pipeline it runs at the speed of its slower stage. Actors generate experience at some sampling throughput; the learner consumes experience at some learning throughput; the system as a whole advances only as fast as the smaller of the two. Adding actors to a learner-bound system buys nothing but a growing pile of stale experience, and adding learner GPUs to an actor-bound system buys nothing but idle silicon waiting for data. The job of scaling an RL system is therefore not "add workers" but "find which stage is starving and feed it", and the single number that ties the two stages together is the replay ratio, the count of gradient updates the learner performs per environment step the actors take. This section turns that balance into something you can measure, plot, and tune.
The previous section weighed synchronous against asynchronous actor-learner designs, a choice about when experience moves between the two halves of the system. This section asks a sharper and more practical question: once the design is fixed, how fast does the whole thing actually run, and what is holding it back? Almost every distributed RL system you will profile is limited by exactly one of two throughputs, and the symptoms of each are distinct enough that a few minutes of measurement tells you which one, and therefore where to spend money. The trap, repeated in production again and again, is to scale the stage that is already fast. Figure 20.8.1 frames the system as the two-stage pipeline whose narrower neck caps everything, the picture the rest of the section makes quantitative.
1. Two Throughputs, One Pipeline Beginner
Name the two quantities precisely, because the whole section hangs on them. Sampling throughput is the number of environment steps, or experience tuples, the actors collectively generate per second. With $N$ identical actors each running at $r_{\text{actor}}$ steps per second, the sampling throughput is $N \cdot r_{\text{actor}}$, and it grows linearly as you add actors until some shared resource (the network, the replay buffer, the inference server that serves actor policies) saturates. Learning throughput is the number of experience tuples the learner consumes per second through its gradient updates: a learner doing $u$ updates per second on batches of size $B$ consumes $r_{\text{learn}} = u \cdot B$ tuples per second. These two numbers are produced by entirely different hardware (actors are often CPU-bound simulators; the learner is a GPU), which is exactly why they rarely match by accident.
Because the actors feed the learner through a queue, the system is a pipeline, and a pipeline's steady-state throughput is the throughput of its slowest stage. This is the same arithmetic as the speedup ceilings of Section 3.5: a serial bottleneck caps the whole, and parallelizing the parts that are already fast cannot move the cap. Here the cap is the slower of sampling and learning,
$$T_{\text{system}} = \min\!\left(\underbrace{N \cdot r_{\text{actor}}}_{\text{sampling}},\; \underbrace{u \cdot B}_{\text{learning}}\right).$$Everything downstream follows from which term wins. If sampling is smaller, the learner periodically empties the queue and its GPU stalls waiting for the next batch; the system is actor-bound. If learning is smaller, experience arrives faster than it can be consumed, the queue grows without bound, and old experience either spills or is trained on long after the policy that produced it has moved on; the system is learner-bound. The two regimes call for opposite remedies, so diagnosing which one you are in is the first move.
A distributed RL system runs at $\min(\text{sampling rate}, \text{learning rate})$, not at the sum and not at the faster of the two. Doubling the actors in a learner-bound system leaves system throughput unchanged and only deepens the backlog; doubling the learner GPUs in an actor-bound system leaves throughput unchanged and only idles the new GPUs. Every scaling decision in RL begins by identifying the binding stage, because spending on the non-binding stage is spending on nothing.
2. The Replay Ratio Ties the Two Together Intermediate
The cleanest single knob connecting sampling and learning is the replay ratio (also called the update-to-data or UTD ratio): the number of gradient updates the learner performs per environment step the actors generate. If the learner consumes $r_{\text{learn}}$ tuples per second and the actors produce $r_{\text{sample}}$ tuples per second, the replay ratio is
$$\rho = \frac{r_{\text{learn}}}{r_{\text{sample}}} = \frac{u \cdot B}{N \cdot r_{\text{actor}}}.$$A replay ratio of $\rho = 1$ means each experience tuple is consumed, on average, exactly once: sampling and learning are balanced and the queue neither grows nor starves. A ratio $\rho < 1$ means the learner is too slow to keep up and the system is sampling-rich, learner-bound; experience accumulates and ages. A ratio $\rho > 1$ means the learner revisits each tuple multiple times (sample reuse via the replay buffer of Section 20.4) and the actors cannot keep the learner fed; the system is learner-rich, actor-bound. The replay ratio is therefore not just an efficiency statistic, it is the dial you turn to move the bottleneck from one stage to the other.
The ratio also carries a statistical cost that this systems view must respect. Pushing $\rho$ high to keep an expensive learner busy means each gradient step leans on increasingly reused, increasingly off-policy data, the staleness this chapter treats in Section 20.5 with V-trace and related corrections. So the replay ratio sits at the meeting point of a systems constraint (keep both stages busy) and a learning constraint (do not over-train on stale experience), and the right value is the one that respects both. High-replay-ratio off-policy methods are a live research area precisely because the systems incentive to crank $\rho$ up collides with the optimization damage it can cause.
The reflex of scale-out is "throw more machines at it", and for the exact-gradient data parallelism of Section 1.1 that reflex is nearly right: more workers, proportionally more throughput. Distributed RL breaks that reflex. It is two coupled pipelines with different hardware, and adding machines to the wrong stage moves nothing. The lesson this section folds into the book's spine is that scaling is not always multiplication; sometimes it is balance. The skill is reading where the system is starved and feeding precisely that stage, a discipline that returns when distributed multi-agent RL training stacks the same actor-learner imbalance across many agents at once in Chapter 30.
3. Profiling: Find the Starved Stage Intermediate
You cannot rebalance what you have not measured, and RL systems hide their bottleneck behind a misleadingly busy dashboard: actors can look fully utilized while the learner starves, or the learner can look pegged while actors idle on a slow simulator. Four measurements, taken together, locate the bottleneck unambiguously, and they connect directly to the evaluation discipline of Section 5.4 on communication-to-computation ratios. First, actor steps per second, summed across actors, gives the sampling rate. Second, learner samples per second (updates per second times batch size) gives the learning rate. Third, queue depth in the replay buffer or transport: a queue pinned at zero means the learner is starved (actor-bound); a queue pinned at its cap means the learner cannot drain it (learner-bound). Fourth, policy lag, the gap in update steps between the policy that generated a sampled tuple and the current learner policy, which measures how stale the consumed experience actually is.
The signatures are crisp. An actor-bound system shows a near-empty queue, a learner GPU utilization well below 100 percent, and low policy lag; the fix is more sampling (Section 4). A learner-bound system shows a full or growing queue, a learner GPU pegged near 100 percent, and rising policy lag; the fix is more learning throughput or a lower replay ratio. The code below makes this concrete by modeling the pipeline directly: it sweeps actor count against a fixed learner, reports which stage binds, and then tracks how the replay queue and policy lag explode once the actors overtake the learner.
def system_throughput(n_actors, actor_rate, learner_rate):
"""Sampling rate scales with actors; system runs at the slower stage."""
sampling = n_actors * actor_rate # experiences produced per second
learning = learner_rate # experiences consumed per second
return sampling, learning, min(sampling, learning)
def simulate_policy_lag(n_actors, actor_rate, learner_rate, seconds, batch=256):
"""Queued (unconsumed) experience and the resulting policy lag in updates."""
sampling, learning, _ = system_throughput(n_actors, actor_rate, learner_rate)
queue = 0.0
for _ in range(seconds):
queue += sampling # actors push new experience
queue -= learning # learner pulls and consumes
if queue < 0.0:
queue = 0.0 # learner starved, GPU idles
return queue, queue / batch # lag = backlog / batch size
learner, actor = 12_000, 1_000 # learner 12k/s; each actor 1k/s
print("=== Sweep actor count at a fixed learner rate (12000 samples/s) ===")
print(f"{'actors':>7} {'sampling/s':>11} {'learning/s':>11} {'system/s':>10} {'bottleneck':>11} {'util':>6}")
for n in (2, 6, 12, 18, 24):
s, l, sys = system_throughput(n, actor, learner)
side = "learner" if l < s else "actors"
print(f"{n:>7} {s:>11} {l:>11} {sys:>10} {side:>11} {sys / max(s, l):>5.0%}")
print("\n=== Policy lag after 60 s, sweeping actor count (learner fixed) ===")
print(f"{'actors':>7} {'sampling/s':>11} {'queue@60s':>10} {'lag(updates)':>13}")
for n in (6, 12, 16, 20, 24):
q, lag = simulate_policy_lag(n, actor, learner, seconds=60)
print(f"{n:>7} {n * actor:>11} {q:>10.0f} {lag:>13.1f}")
print("\n=== Replay ratio when the learner is the bottleneck (12 actors) ===")
n = 12; sampling = n * actor # 12000 experiences/s produced
for lr in (6_000, 12_000, 24_000):
bound = "learner-bound" if lr < sampling else "actor-bound"
print(f"learner={lr:>6}/s replay_ratio={lr / sampling:>4.2f} "
f"system={min(sampling, lr):>6}/s {bound}")
=== Sweep actor count at a fixed learner rate (12000 samples/s) ===
actors sampling/s learning/s system/s bottleneck util
2 2000 12000 2000 actors 17%
6 6000 12000 6000 actors 50%
12 12000 12000 12000 actors 100%
18 18000 12000 12000 learner 67%
24 24000 12000 12000 learner 50%
=== Policy lag after 60 s, sweeping actor count (learner fixed) ===
actors sampling/s queue@60s lag(updates)
6 6000 0 0.0
12 12000 0 0.0
16 16000 240000 937.5
20 20000 480000 1875.0
24 24000 720000 2812.5
=== Replay ratio when the learner is the bottleneck (12 actors) ===
learner= 6000/s replay_ratio=0.50 system= 6000/s learner-bound
learner= 12000/s replay_ratio=1.00 system= 12000/s actor-bound
learner= 24000/s replay_ratio=2.00 system= 12000/s actor-bound
Read the three blocks of Output 20.8.1 together and the section's whole argument is visible in the numbers. Throughput climbs with actors up to twelve, where sampling exactly matches the learner, then stops; the extra actors at 18 and 24 add only queue depth, and the "util" column shows the faster stage running at half capacity, money spent on nothing. The middle block is the cost of ignoring that: at sixteen actors the queue reaches 240000 tuples after a minute, and the freshest gradient update is training on experience nearly a thousand updates old. The final block makes the replay-ratio point precise: once the actors cap the system at 12000 per second, pushing the learner to a replay ratio of 2.0 leaves throughput flat and merely re-reads stale data.
4. Scaling Each Side Intermediate
Once the binding stage is known, the remedy is specific. To raise sampling throughput, add actors. Sampling scales almost embarrassingly well because actors are independent until they hit a shared resource: more actor processes, more environment replicas, or batched-inference actor servers in the SEED-RL style of Section 20.6 that pack many environments behind one accelerator. The ceiling on this side is usually the inference cost of the actor policy or the bandwidth into the replay buffer, not the actors themselves. To raise learning throughput, make the learner faster, and the canonical move is to turn the single learner into a data-parallel learner: shard each large batch across several GPUs, compute partial gradients, and combine them with the all-reduce of Chapter 15. A data-parallel learner consuming $G$ GPUs raises $r_{\text{learn}}$ by close to a factor of $G$, moving the cap upward so that more actors can be productively added.
The two remedies interact, and the goal is to walk both stages up together so the replay ratio stays in its healthy band. Add a data-parallel learner GPU, the learner rate rises, the system becomes actor-bound, so add actors until sampling catches up, at which point the learner binds again. Balanced scaling is this back-and-forth, and the largest RL systems are tuned so that neither the actor fleet nor the learner cluster is ever the clear loser. That is the sense in which RL scaling is pipeline balancing rather than simple worker multiplication: you are co-sizing two stages so that $\min(\cdot, \cdot)$ has no slack on either argument.
Who: A reinforcement learning team training a robotics control policy on a simulated-physics environment.
Situation: Throughput had plateaued at roughly 30000 environment steps per second despite a healthy budget, and training a policy to convergence took five days.
Problem: The instinct, backed by a vendor quote, was to double the learner from four GPUs to eight to "train faster".
Dilemma: Spend the budget on more learner GPUs, which speeds learning throughput, or on more actor machines, which speeds sampling throughput, with no clear evidence which stage was actually binding.
Decision: Before purchasing, they profiled the four numbers from Section 3 and found the replay queue pinned near empty, learner GPU utilization at 38 percent, and policy lag near zero, the textbook signature of an actor-bound system.
How: They left the learner untouched, tripled the CPU actor pool that ran the physics simulator, and added a batched-inference actor server so a single GPU served policy actions to many environments at once.
Result: Sampling throughput rose to match the learner, system throughput climbed to about 78000 steps per second, the learner GPUs reached 90 percent utilization, and convergence wall-clock fell from five days to under two, at a fraction of the cost of the eight-GPU learner.
Lesson: Profile before you provision. The expensive component was already fast; the cheap one was starving it. Buying more of the fast stage would have lowered utilization, not raised throughput.
The pipeline arithmetic above is exactly what a production RL framework exposes as configuration. In Ray RLlib you declare the sampling stage (how many rollout actors) and the learning stage (how many data-parallel learner GPUs) as two independent counts, and the framework runs the queue, the all-reduce across learners, and the policy-weight broadcast back to actors. Rebalancing a bottleneck is a two-line config change rather than a rewrite of the experience-transport code:
# pip install "ray[rllib]"
from ray.rllib.algorithms.ppo import PPOConfig
config = (
PPOConfig()
.env_runners(num_env_runners=64) # SAMPLING stage: 64 rollout actors
.learners(num_learners=4, # LEARNING stage: 4 data-parallel
num_gpus_per_learner=1) # learner GPUs, all-reduced for you
.training(train_batch_size=16_000) # tuples consumed per learner step
)
algo = config.build_algo() # framework runs the queue + all-reduce
for _ in range(100):
algo.train() # one balanced sample-then-learn cycle
num_env_runners and num_learners; tuning the bottleneck means changing those numbers, not the transport layer.5. Why Balance Is the Whole Game Advanced
It is worth saying plainly why RL resists the simple scale-out story that data-parallel supervised training enjoys. In supervised training every worker does the same kind of work, and the all-reduce that combines them is exact, so $K$ workers genuinely approximate $K$ times the throughput until communication costs bite. RL splits the work into two qualitatively different jobs, sampling and learning, run on different hardware at rates that are coupled only through a queue. The system's throughput is a $\min$, and a $\min$ has the unforgiving property that improving the larger argument does nothing. That is why a beautifully optimized learner can sit half-idle behind a slow simulator, and why a thousand actors can drown a learner that physically cannot keep up.
The practical consequence is a habit, not a formula: instrument both stages, watch the queue and the policy lag, and treat any large imbalance as a signal that you are paying for capacity you cannot use. The next section turns this diagnosis into practice, surveying the frameworks (Ray RLlib and the broader distributed-RL stacks) that implement the actor-learner pipeline, the replay transport, and the data-parallel learner as components you assemble rather than build, so that the balancing this section describes becomes a matter of configuration and profiling rather than custom systems code. That tour begins in Section 20.9.
The replay ratio has become a first-class object of study. A line of work on sample-efficient off-policy RL pushes the replay ratio to 8, 16, or higher to extract more learning from each costly environment step, using periodic network resets and regularization to counter the "primacy bias" and overfitting that high reuse otherwise causes (the lineage of Nikishin et al.'s reset work and the subsequent high-UTD continuous-control methods through 2024 and 2025). A complementary thread asks the compute-optimal version of this section's question directly: given a fixed budget, how should it be split between actors and learners, and how does the answer shift as accelerators get faster relative to simulators? With reinforcement learning now central to post-training large language models, the actor (generation) stage and the learner (policy-gradient) stage are both enormous, and 2024 to 2026 work on RLHF and reasoning-model infrastructure (frameworks in the lineage of OpenRLHF and large-scale verifier-based RL) treats the sampling-versus-learning balance of this section as a primary scaling axis, not an afterthought. The open question is no longer whether the pipeline must be balanced but how to balance it automatically as both stages grow.
For each profile, state whether the system is actor-bound or learner-bound and name the single most cost-effective fix: (a) replay queue pinned at its 1-million cap, learner GPU at 99 percent, policy lag rising by 400 update-steps per minute; (b) replay queue averaging 12 tuples, learner GPU at 35 percent, policy lag near zero; (c) sampling rate 50000/s, learning rate 50000/s, queue stable at half its cap, both stages above 90 percent utilization. For (c), explain why adding hardware to either stage alone would lower its utilization without raising system throughput.
Modify Code 20.8.1 so the learner is data-parallel: add a parameter num_learner_gpus that multiplies the learner rate by an all-reduce efficiency factor (model it as $G \cdot (1 - 0.04 \cdot (G-1))$ to reflect collective overhead growing with GPU count, per Chapter 15). Sweep num_learner_gpus from 1 to 8 at a fixed 20-actor sampling stage, print the resulting replay ratio and system throughput, and identify the GPU count at which the learner stops being the bottleneck. Explain why adding learner GPUs beyond that point no longer raises system throughput.
A cluster runs 40 actors at 1500 steps/s each and a single learner at 36000 samples/s. Compute the sampling rate, the system throughput, the replay ratio, and the steady-state utilization of each stage. Suppose actor machines cost \$0.50/hour and the learner GPU costs \$3.00/hour; compute the cost per million environment steps. Then propose a rebalanced configuration (changing only the actor count) that maximizes utilization of both stages, recompute the cost per million steps, and quantify the savings. State which of the four profiling signals from Section 3 would have revealed the original imbalance fastest.