"I generated a beautiful trajectory under the policy from three updates ago. The learner thanked me politely and corrected most of it away."
An Actor Rolling Out Against a Lagging Policy
Every distributed reinforcement-learning system must choose whether its actors wait for the learner or never wait, and that single choice sets both its throughput ceiling and how far its data drifts off-policy. A synchronous system stops all actors at a barrier, lets the learner update once, and resumes everyone with the fresh policy: the data is clean and on-policy, but the slowest actor sets the pace and the learner-update window is dead time. An asynchronous system lets actors run flat out against whatever policy they last grabbed: throughput climbs, but the experience is generated by stale policies and must be corrected before it can be trusted. This is the same synchronous-versus-asynchronous tension you met in distributed optimization (Section 10.10), transplanted from gradients to trajectories. This section pins the trade-off down with numbers and tells you which regime wins where.
The previous section built out the Ape-X, R2D2, and SEED RL designs, each of which makes a specific stance on a single question: when an actor finishes a trajectory, does it wait for the learner before continuing, or not? That question looks like an implementation detail and is in fact the most consequential architectural decision in distributed RL. It governs the throughput of the whole system on one side and the statistical quality of the data feeding the learner on the other, and those two pull in opposite directions. We will treat the choice as a single axis, measure both ends of it, and then map real systems onto it.
1. The Same Tension, Now in Reinforcement Learning Intermediate
In data-parallel supervised training the synchronous-asynchronous axis is about gradients. Synchronous SGD makes every worker compute a gradient on the current parameters, averages them at a barrier with an all-reduce, and steps once; asynchronous SGD lets each worker push its gradient to a parameter server and pull whatever parameters are there, so gradients are computed against slightly stale weights. Section 10.10 framed the result as a throughput-versus-staleness trade-off: dropping the barrier raises hardware utilization but injects gradient staleness that, past a point, slows or destabilizes convergence.
Reinforcement learning has the identical axis with one extra twist. In RL the workers are actors that generate experience by acting in an environment, and the learner trains on that experience. If actors never wait, they generate trajectories under an old policy while the learner has already moved on, so the data is not merely computed against stale parameters; it is distributed differently from the policy being optimized. That is off-policy drift, and unlike gradient staleness it changes the meaning of the data, not just its freshness. Correcting it is the job of importance sampling and V-trace, built in Section 20.5. The throughput-versus-bias trade-off of RL is thus the throughput-versus-staleness trade-off of SGD with a sharper penalty for going fast.
Removing the barrier between actors and learner always raises sample throughput, because no machine ever idles waiting for another. What it costs is on-policy purity: the experience now comes from policies that lag the learner by some number of updates, and that lag is a bias the algorithm must either tolerate or correct. Synchronous systems pay throughput to keep bias at zero; asynchronous systems pay bias to keep throughput high. Neither is free, and the right choice depends on which currency your problem can spare.
2. Synchronous On-Policy Systems: Clean Data, Idle Machines Intermediate
A synchronous RL system runs in rounds. All actors are handed the current policy parameters $\theta_t$. Each rolls out a fixed amount of experience, the system waits at a barrier until the slowest actor is done, the learner consumes the assembled batch and produces $\theta_{t+1}$, and the round repeats. This is how synchronous PPO and A2C (the synchronous, batched form of A3C) operate. Because every sample in the batch was generated by exactly the policy being updated, the data is on-policy: the policy-gradient estimator is unbiased and the update is as stable as RL updates get. For PPO this matters doubly, because its clipped objective is derived assuming the data comes from a policy close to the current one, and a barrier guarantees that closeness.
The cost is utilization. Two distinct stalls appear in every round. First, the barrier itself: actors finish at different times because environments have variable episode lengths and machines have variable speed, so all $N$ actors wait for the slowest, the straggler problem from Section 2.7. With heavy-tailed rollout times the expected wait is governed by the maximum of $N$ samples, which grows with $N$, so adding actors gives diminishing returns. Second, the learner-update window: while the learner computes $\theta_{t+1}$, the actors hold the old policy and have nothing to do, so they sit idle. Both stalls are dead time that no amount of fast actor compute can fill.
Figure 20.7.1 makes the two stalls visible: in the top timeline the dashed gaps and the learner block are pure waste, while in the bottom timeline they vanish. That visible waste is exactly what asynchrony reclaims, and exactly what it pays for.
3. Asynchronous Systems: No Idle Machines, Stale Data Intermediate
An asynchronous RL system removes the barrier. Each actor holds a local copy of the policy, rolls out a trajectory, ships the experience to the learner (or to a replay buffer, Section 20.4), pulls the most recent published policy, and immediately starts the next rollout. The learner trains continuously on whatever arrives. A3C did this with many CPU actors each updating shared parameters; IMPALA centralized the learner on a GPU and streamed trajectories from a fleet of actors; Ape-X did it with hundreds of actors feeding one prioritized replay buffer. No actor ever waits for another, and the learner never waits for a full synchronized batch, so the system runs near the throughput its slowest single component allows rather than near the speed of its slowest actor.
The price is policy lag. By the time an actor's trajectory reaches the learner, the learner has applied several more updates, so the behavior policy $\mu$ that generated the data is several versions behind the target policy $\pi$ being optimized. A naive policy gradient computed on this data is biased, because the trajectory is sampled from $\mu$ but the gradient is meant for $\pi$. The fix is an importance-sampling correction, weighting each sample by $\pi/\mu$, which IMPALA's V-trace estimator does with truncated weights for stability (Section 20.5). The correction works only while the lag stays bounded; if actors fall arbitrarily far behind, the importance weights explode or get clipped to uselessness, and the learning signal degrades. Asynchronous systems therefore live or die by keeping policy lag small, which is why they publish fresh policies to actors frequently and cap how stale a usable sample may be.
4. Measuring Both Ends of the Axis Intermediate
The trade-off is easy to state and easy to mismeasure, so we simulate it directly. The code below models eight actors whose rollout times are heavy-tailed (most quick, a few slow stragglers) over a fixed two-second wall-clock. In the synchronous regime, every round waits for the slowest of the eight actors and then for one learner update, and all data has policy lag zero. In the asynchronous regime, the simulation is event-driven: whichever actor finishes next delivers its sample, the learner applies an update as soon as it is free, and the actor restarts under the newest policy, so each consumed sample carries a measurable policy lag. We report samples per second and average policy lag for each.
import random
random.seed(0)
N_ACTORS = 8
HORIZON = 2.0 # seconds of simulated wall-clock
LEARN_TIME = 0.010 # learner update cost, seconds
MEAN_ROLLOUT = 0.040 # mean actor rollout time, seconds
def rollout_time():
# Heavy-tailed: most rollouts quick, a few slow stragglers.
return MEAN_ROLLOUT * random.expovariate(1.0)
def run_sync():
t, policy_version, samples, lag_sum = 0.0, 0, 0, 0
while t < HORIZON:
durations = [rollout_time() for _ in range(N_ACTORS)]
slowest = max(durations) # barrier ends with the straggler
samples += N_ACTORS
lag_sum += 0 * N_ACTORS # every sample is on-policy: lag 0
t += slowest + LEARN_TIME # actors idle through both stalls
policy_version += 1
return samples, lag_sum, policy_version
def run_async():
finish = [rollout_time() for _ in range(N_ACTORS)]
started_under = [0] * N_ACTORS
published_version, samples, lag_sum, learner_busy_until = 0, 0, 0, 0.0
while True:
i = min(range(N_ACTORS), key=lambda k: finish[k]) # next to finish
t = finish[i]
if t >= HORIZON:
break
lag_sum += published_version - started_under[i] # policy lag
samples += 1
learner_busy_until = max(t, learner_busy_until) + LEARN_TIME
published_version += 1
started_under[i] = published_version # restart, fresh policy
finish[i] = t + rollout_time() # never waits
return samples, lag_sum, published_version
s = run_sync(); a = run_async()
print(f"{'regime':<6} {'samples':>8} {'samples/s':>10} {'updates':>8} {'avg policy lag':>15}")
print(f"{'sync':<6} {s[0]:>8} {s[0]/HORIZON:>10.1f} {s[2]:>8} {s[1]/s[0]:>15.2f}")
print(f"{'async':<6} {a[0]:>8} {a[0]/HORIZON:>10.1f} {a[2]:>8} {a[1]/a[0]:>15.2f}")
print(f"throughput ratio async/sync : {a[0]/s[0]:.2f}x")
regime samples samples/s updates avg policy lag
sync 136 68.0 17 0.00
async 393 196.5 393 6.85
throughput ratio async/sync : 2.89x
The numbers in Output 20.7.1 are the whole trade-off in one table. Asynchrony nearly tripled throughput, and it did so by trading away on-policy purity: the average sample fed to the asynchronous learner came from a policy almost seven updates behind. A synchronous PPO learner would refuse that data, or rather its clipped objective would be invalid on it; an IMPALA learner accepts it precisely because V-trace reweights it back toward on-policy (Section 20.5). The 6.85-update lag is the bias that the correction must absorb, and it is finite only because the simulated learner keeps up with the actors. Let the actors outpace the learner and that number climbs without bound, which is the failure mode Section 20.8 studies as a sampling-versus-learning throughput imbalance.
People expect a synchronous round to cost the average rollout time. It costs the maximum. With eight actors whose times are heavy-tailed, the slowest is routinely several times the typical one, so most actors spend most of the round finished and waiting. The simulation above ran 17 synchronous rounds in two seconds while the asynchronous learner fired 393 updates: the barrier did not just slow things down, it left the learner almost idle. The waiting actors are not resting; they are burning money.
5. Where Each Regime Wins Intermediate
The choice is not a matter of taste; different problems put different weights on throughput and bias, and that determines the answer. Table 20.7.1 lays out the considerations and where each regime lands.
| Consideration | Synchronous (PPO, A2C) | Asynchronous (A3C, IMPALA, Ape-X) |
|---|---|---|
| Data policy lag | Zero; strictly on-policy | Positive; needs off-policy correction |
| Throughput ceiling | Set by the slowest actor plus learner stall | Set by the slowest single component |
| Update stability | High; estimator is unbiased | Depends on bounded lag and a working correction |
| Best when | Stability and reproducibility dominate | Raw sample throughput dominates |
| Typical home | LLM RLHF, continuous control | Massive-scale Atari, large actor fleets |
Synchronous PPO dominates two important settings. The first is RL from human feedback for large language models, where the policy is a multi-billion-parameter model, each update is expensive, and stability and reproducibility matter more than squeezing out the last unit of actor throughput; PPO's on-policy guarantee is worth the barrier. The second is many continuous-control tasks (robotics, locomotion), where sample efficiency and stable updates beat raw sample count. Asynchronous designs dominate the opposite regime: massive-throughput settings such as large-scale Atari or simulated environments where samples are cheap, the policy is small enough to update very fast, and the bottleneck is simply getting enough experience through the system. There IMPALA-style streaming with V-trace, or Ape-X-style prioritized replay, turns hundreds of actors into a firehose the learner can drink from.
Who: An applied-RL engineer standing up the reinforcement-learning stage of an instruction-tuned language model.
Situation: Generation (actors sampling completions from the 13-billion-parameter policy) was the slow part, and completion lengths varied wildly, so some actors finished a batch in seconds and others took far longer.
Problem: A synchronous PPO loop left the fast actors and the whole learner GPU idle while a handful of long-completion stragglers finished, wasting expensive accelerator time on a barrier.
Dilemma: Keep synchronous PPO for its on-policy stability and eat the straggler tax, or go asynchronous for throughput and risk the off-policy drift that PPO's clipped objective is not designed to tolerate.
Decision: They kept the synchronous PPO update for stability but attacked the straggler tax directly, bucketing prompts by expected completion length and capping generation length, so the barrier wait collapsed instead of the algorithm changing.
How: Length-grouped batching plus a generation-length cap made rollout times far more uniform, shrinking the maximum-of-N wait; the learner stayed strictly on-policy.
Result: Accelerator utilization rose sharply with no change to the on-policy guarantee, and training stayed as reproducible as the synchronous design promised.
Lesson: Asynchrony is not the only cure for the barrier. When stability is non-negotiable, attack the straggler distribution (the maximum in Output 20.7.1) before you trade away on-policy purity.
6. Scaling Synchronous PPO With Data-Parallel Learners Advanced
The RLHF example raises an obvious worry: if synchronous PPO must stay on-policy and the policy is enormous, how does it scale at all? The answer is that the two halves of the loop scale by different means. The actor side scales out by replication, more generation workers producing more completions per round, exactly the distributed experience collection of Section 20.3. The learner side scales by data parallelism, the same all-reduce that has run through this book since Chapter 15: the large on-policy batch is split across many learner replicas, each computes the PPO gradient on its shard, and an all-reduce averages the gradients before the optimizer step. Modern large-batch synchronous PPO thus keeps the barrier and the on-policy guarantee while making both the batch and the learner arbitrarily large, which is precisely why the same recipe that trains foundation models (Chapter 19) carries over to their RL fine-tuning stage.
The result is that the synchronous-versus-asynchronous choice is not the same as small-versus-large. A synchronous system scales to thousands of accelerators by enlarging the on-policy batch and data-parallelizing the learner; its limit is the straggler-and-stall overhead this section measured, not a hard size cap. An asynchronous system scales by decoupling, letting actors and learner run at their own speeds; its limit is how much policy lag the correction can absorb. The two scaling stories sit on the two ends of one axis, and Section 20.8 turns the overheads of both into the bottleneck analysis that decides how far each can actually go.
The synchronous-versus-asynchronous axis is one of the book's signature arcs, and you have now seen all three of its layers. It was introduced as a pure coordination question in Section 2.7 (do machines wait for each other?), deepened into sync-versus-async SGD and bounded staleness in Chapter 10, and transformed here into the actor-learner trade-off where asynchrony costs not just freshness but on-policy validity. The same barrier, the same staleness, the same all-reduce on the learner side: distributed RL is distributed optimization with the data distribution itself moving underfoot. The arc continues into distributed multi-agent RL training in Chapter 30.
Code 20.7.1 hand-built both regimes to expose the mechanism. In practice you do not write the actor-learner clock at all; a distributed-RL framework gives you both regimes behind a configuration switch and handles the policy publishing, the replay buffer, and the off-policy correction for you. In Ray RLlib, choosing synchronous PPO versus asynchronous IMPALA is essentially picking the algorithm and a knob:
from ray.rllib.algorithms.ppo import PPOConfig # synchronous, on-policy
from ray.rllib.algorithms.impala import IMPALAConfig # asynchronous, V-trace corrected
sync = (PPOConfig().environment("CartPole-v1")
.env_runners(num_env_runners=8)) # 8 actors, barrier each round
async_ = (IMPALAConfig().environment("CartPole-v1")
.env_runners(num_env_runners=64)) # 64 actors, never wait; V-trace on
PPOConfig (synchronous, on-policy) or IMPALAConfig (asynchronous, V-trace corrected) and setting the actor count; RLlib supplies the barrier or the streaming queue, the policy broadcast, and the importance-sampling correction. Section 20.9 builds out these frameworks.7. Research Frontier Advanced
The synchronous-asynchronous axis is being actively renegotiated by the rise of RL for LLM reasoning, where the cost of the barrier and the cost of staleness are both being re-measured at frontier scale.
The 2024 to 2026 wave of RL for LLM reasoning (DeepSeek-R1's RL training, and the GRPO objective that drops PPO's value network) has made synchronous, large-batch, on-policy RL the production default, because reasoning rewards are sparse and stability is paramount. Yet the very cost this section measured, generation stragglers idling expensive learner accelerators, has pushed the frontier back toward controlled asynchrony. Systems work such as fully asynchronous RLHF pipelines and disaggregated generation-and-training stacks (for example AReaL-style and OpenRLHF-style asynchronous designs reported in 2024 to 2025) decouples the rollout fleet from the learner and accepts a small, bounded policy lag in exchange for keeping both sides saturated, then leans on importance-sampling corrections to stay close to on-policy. The open question is exactly how much staleness a reasoning-RL objective can absorb before its accuracy gains erode, which is the throughput-versus-bias trade-off of Output 20.7.1 asked at frontier scale. We return to the throughput accounting behind these designs in Section 20.8.
You now hold the central design axis of distributed RL: synchronous for clean on-policy data at the cost of idle machines, asynchronous for saturated machines at the cost of off-policy bias, with large-batch data-parallel learners letting the synchronous side scale and bounded-lag corrections letting the asynchronous side stay valid. What this section deliberately left as a single number, the learner keeping up with the actors, becomes the main subject next: when sampling throughput and learning throughput fall out of balance, the system bottlenecks, and which side bottlenecks decides everything. That accounting begins in Section 20.8.
Asynchronous SGD (Section 10.10) computes gradients against stale parameters, and asynchronous RL computes trajectories under a stale policy. Both inject staleness, yet RL needs an explicit importance-sampling correction (V-trace) while async SGD often needs none. Explain, in terms of what the staleness changes, why a stale policy alters the distribution of the data and not merely the point at which a gradient is evaluated, and why that distinction forces RL to correct what SGD can sometimes ignore.
Reproduce Code 20.7.1 and add a sweep over N_ACTORS from 2 to 64. Plot synchronous and asynchronous samples-per-second on one axis and average asynchronous policy lag on another. Identify the actor count at which adding more actors stops helping the synchronous regime (the straggler tax saturates) and the actor count at which the asynchronous policy lag exceeds, say, 20 updates. Argue from your two crossover points which regime you would deploy if your off-policy correction is reliable only up to a lag of 10.
In the synchronous regime, a round costs $\mathbb{E}[\max_i T_i] + T_{\text{learn}}$, where $T_i$ are the actor rollout times and $T_{\text{learn}}$ is the learner-update time. Take $T_i$ exponential with mean $40$ ms and $T_{\text{learn}} = 10$ ms. Using the fact that the expected maximum of $N$ exponentials with mean $\mu$ is $\mu H_N$ where $H_N$ is the $N$-th harmonic number, compute the expected round time and the fraction of it that is straggler wait versus learner stall for $N \in \{8, 32, 128\}$. At which $N$ does the straggler wait dominate, and what does that imply for whether you should fix the barrier or abandon it?