Part IV: Parallel Deep Learning and Large Models
Chapter 20: Distributed Reinforcement Learning Infrastructure

Ape-X, R2D2, and SEED RL Designs

"They asked me to step the environment and also run the whole policy and also compute priorities. I am one CPU. I have opinions about this arrangement."

An Actor Carrying More Than Its Share
Big Picture

The famous distributed-RL systems are not a chronology to memorize; they are four answers to one design question: given the actor-learner split, where do you put inference, do you learn from replay or from a fresh on-policy stream, and how do you divide the work between cheap CPUs and scarce accelerators? IMPALA, Ape-X, R2D2, and SEED RL each fix one inefficiency in the others by moving a single piece of the pipeline. IMPALA pairs many actors with a central learner corrected by V-trace. Ape-X feeds hundreds of actors into one prioritized replay buffer. R2D2 carries that design into recurrent agents and confronts the systems problem of storing hidden state. SEED RL notices that running a big policy on every actor wastes accelerators, and moves inference off the actors entirely. This section reads the four as coordinates in a design space, then measures the exact tradeoff SEED exploits: when does centralizing inference beat running it on each actor?

The previous sections built the parts. Section 20.2 separated acting from learning into the actor-learner architecture, Section 20.3 scaled experience collection across many actors, Section 20.4 turned the replay buffer into a sharded distributed service, and Section 20.5 made off-policy learning correct at scale with V-trace. Now we assemble those parts into named systems. The point of studying them together is not historical completeness; it is that each system makes one different choice on one axis, so seeing them side by side teaches the axes themselves. A practitioner who internalizes the design space can place a new system on it in a minute, and can design a fifth point that fits a workload none of the four were built for.

1. One Design Space, Four Coordinates Beginner

Every distributed-RL system in this section is a loop in which actors generate experience and a learner improves a policy from it. The systems differ in three decisions that, once you name them, become a small coordinate grid. The first axis is where policy inference runs: on each actor (the actor holds a copy of the policy and computes its own actions) or centrally (actors send raw observations and a central service computes actions in a batch). The second is how experience reaches the learner: through a replay buffer that stores and re-samples past transitions (off-policy), or as a fresh near-on-policy stream consumed once and discarded. The third is what the actor itself must compute and store, which is where recurrence bites, because an LSTM agent has hidden state that has to be carried, stored, and replayed.

These are not independent knobs you can set at random; some combinations are natural and others fight themselves. Replay demands off-policy correction because stored transitions were generated by an older policy, the exact problem V-trace solves in Section 20.5. Centralizing inference only pays off when the policy is large enough that batching many observations on one accelerator beats running it many times on small actor devices, the tradeoff this section measures. Recurrence forces a choice about hidden state that neither feed-forward replay nor on-policy streaming had to make. Table 20.6.1 places the four systems on the grid, and the rest of the section walks each coordinate and explains why its neighbors are where they are.

Table 20.6.1: Four landmark distributed-RL systems as coordinates in one design space. Each row is the same actor-learner loop with one or two axes set differently; the right column names the single inefficiency that system removed from its neighbors.
SystemInference runsExperience pathAgent typeKey contribution
IMPALAOn each actorOn-policy stream, V-trace correctedFeed-forward / recurrentScalable actor-critic template; off-policy correction for a fast on-policy stream
Ape-XOn each actorShared prioritized replayFeed-forward (DQN/DDPG)Hundreds of actors, actor-side priority computation, decouples data scale from learning
R2D2On each actorShared prioritized replay of sequencesRecurrent (LSTM)Recurrence under replay: stored hidden state and burn-in for sequence replay
SEED RLCentralized, batched on learnerEither (V-trace or replay)Feed-forward / recurrentInference off the actors, batched on accelerators over fast networking
Key Insight: Each System Moved Exactly One Piece

The progress from IMPALA to SEED is not four unrelated inventions; it is four single-axis moves on a shared loop. Ape-X swapped IMPALA's on-policy stream for a prioritized replay buffer (the experience-path axis). R2D2 took Ape-X and made the agent recurrent (the agent-type axis), which forced a new answer on hidden-state storage. SEED took the inference location and moved it off the actors (the inference-location axis). When you meet a new RL system, do not ask "what is it"; ask "which of these axes did it change, and what inefficiency did that fix?" The answer locates it on Table 20.6.1 immediately.

2. IMPALA and Ape-X: Many Actors, One Learner Intermediate

IMPALA is the template the others vary. It runs many actors that each hold a recent copy of the policy, step their environments, and ship trajectory fragments to a single central learner. Because actors run slightly stale policy copies while the learner keeps updating, the stream that arrives is not quite on-policy, and IMPALA corrects that lag with the V-trace targets developed in Section 20.5. The contribution is the shape: a centralized learner with high accelerator utilization fed by a swarm of cheap actors, with a principled correction that keeps the off-by-a-few-updates staleness from biasing the gradient. Nearly every later actor-critic system at scale is a descendant of this layout.

Ape-X keeps the swarm of actors but changes the experience path. Instead of a once-consumed on-policy stream, actors write transitions into a shared prioritized replay buffer, the sharded service of Section 20.4, and the learner samples from it by priority. The decisive systems trick is that each actor computes the priority of its own transitions locally, from the temporal-difference error it already has in hand, so the buffer never has to score incoming data centrally. This decouples the scale of data generation from the scale of learning: you can run hundreds of actors filling the buffer while a single learner drains it at its own pace, and the priorities steer that single learner toward the most informative transitions. Ape-X showed that with feed-forward DQN-style agents, raw actor count alone buys large improvements, because the bottleneck was never the learner's math, it was the rate and quality of incoming experience.

Fun Note: The Buffer Does Not Care Who Filled It

An Ape-X actor that has computed a transition's priority has done a small favor for a learner it will never meet, on a shard it did not choose, for a sample that may be drawn a thousand updates later or never. The replay buffer is a beautifully indifferent intermediary: it accepts priorities from anyone and serves them to anyone, which is exactly why you can add a hundred more actors without telling the learner.

3. R2D2: Recurrence Turns Replay Into a Systems Problem Advanced

R2D2 (Recurrent Replay Distributed DQN) takes the Ape-X layout and makes the agent recurrent, an LSTM, so the policy depends on history through a hidden state $h_t$. For a feed-forward agent a stored transition is self-contained: state in, action and reward out. For a recurrent agent it is not, because the action at time $t$ depended on $h_t$, which summarizes everything before $t$. Replaying a stored sequence requires knowing what the hidden state was when the data was generated, and that hidden state was produced by a policy that has since changed. This is a storage and consistency problem, not a learning-theory problem, and it is the reason R2D2 belongs in a systems chapter.

R2D2's answer has two parts that every recurrent-replay system since has borrowed. First, it stores the actor's hidden state alongside the transition sequence in the buffer, so the learner has a starting point rather than a zero. Second, because that stored state is stale (it came from an older policy and the network weights have moved), the learner does not trust it directly; it runs a burn-in, replaying the first part of the sequence purely to warm up a fresh hidden state under current weights before computing any loss. The stored state seeds the warm-up; the burn-in repairs the staleness. The cost is concrete and worth stating plainly: the replay buffer now stores sequences plus hidden-state vectors instead of single transitions, which multiplies the bytes per sample and the network traffic the buffer service of Section 20.4 must move. Recurrence does not change the actor-learner shape; it inflates what flows through it.

4. SEED RL: Move Inference Off the Actors Advanced

IMPALA, Ape-X, and R2D2 all run the policy on the actors. SEED RL questions that. If the policy is a large neural network, running it on every actor means either putting an accelerator on each actor (most of them idle, since one environment step produces one observation and a batch of one wastes the hardware) or running the big policy on a CPU (slow). Either way, as the policy grows, the actors spend more of their time on inference and the expensive accelerators are starved by tiny batches. SEED RL's move is to take inference off the actors entirely: actors become thin clients that only step environments and exchange observations and actions with a central service over fast networking. The learner machine, which already has the accelerators, batches observations from all actors into one large forward pass, computes actions, and streams them back.

The result is a different CPU/GPU split from the one in Section 20.3. Actors are pure CPU environment-steppers with no policy and no accelerator. The central accelerator runs both learning and inference, and crucially it runs inference at a batch size equal to the number of actors, which is exactly the regime accelerators are built for. The cost SEED pays is two network hops per step, an observation out and an action back, which is why it depends on fast networking to keep that latency small. The diagram in Figure 20.6.1 contrasts the two layouts, and the demonstration that follows measures when the trade pays off.

Actor-side inference (IMPALA / Ape-X / R2D2) Actor 1 env + policy copy runs its own forward Actor 2 env + policy copy runs its own forward Actor A env + policy copy runs its own forward Learner accelerator learns only experience (small batches of 1 per forward) Centralized batched inference (SEED RL) Actor 1 env only (CPU) Actor 2 env only (CPU) Actor A env only (CPU) Inference + Learner accelerator batches all A obs into one forward observations out actions back (2 fast hops/step)
Figure 20.6.1: The inference-location axis. On the left, every actor carries a policy copy and runs its own forward pass on a batch of one, so accelerators sit idle or the big policy runs slowly on CPU. On the right, SEED RL strips the policy off the actors; they step environments and send observations (orange) to a central service that batches all $A$ of them into one forward pass and returns actions, paying two fast network hops per step. The demonstration in Code 20.6.1 measures which side wins as the policy grows.
Thesis Thread: Inference Is a Distribution Axis Too

SEED RL is a clean instance of the book's central move: an essential activity, here policy inference, is partitioned across machines and recombined, and the recombination (batching all actors' observations into one forward) is what makes it efficient. The same logic that made batched serving the right shape for LLM inference in Chapter 24 reappears here inside the training loop: gather many small requests, run them as one large batch on the accelerator, scatter the results back. When you see "batch the small requests centrally" in serving and again in RL, you are seeing one distribution pattern wearing two hats.

5. When Centralizing Inference Wins Intermediate

SEED's bet is only worth the two network hops when the policy is large enough that batched inference on an accelerator decisively beats per-actor inference. We can see the crossover with a pure-throughput model that needs no RL at all. Each environment step costs a fixed amount of CPU time plus one policy forward pass. In the actor-side layout, each actor runs the forward alone on a small device, so the cost grows roughly linearly with policy size and never enjoys batching. In the centralized layout, the single accelerator batches all $A$ actors' observations into one forward, so a fixed launch overhead is amortized across the batch and the per-observation cost is far lower, at the price of two network hops per step. The code below sweeps policy size and reports the aggregate step throughput of each layout.

ENV_MS = 4.0          # env.step() per actor, milliseconds (fixed)
NET_MS = 0.3          # one-way obs/action round-trip on fast networking
A = 256               # number of actors stepping in parallel

# Actor device is small: per-forward latency grows ~linearly in policy FLOPs
# with no batching benefit (each actor has a batch of one).
def actor_fwd_ms(p_mflops):
    return 0.05 * p_mflops

# The central accelerator has a fixed launch cost but is far faster per FLOP,
# and it amortizes that launch over a full batch of A observations.
def central_fwd_ms_per_obs(p_mflops, batch):
    launch = 0.20                     # kernel launch / fixed overhead, ms
    per_obs = 0.004 * p_mflops        # accelerator ~12x faster per FLOP
    return (launch + per_obs * batch) / batch

print(f"{'policy':>8} | {'actor-side':>22} | {'centralized (SEED)':>22} | winner")
print("-" * 72)
for p in [1, 5, 20, 80, 320]:        # policy size in MFLOPs per forward
    step_actor = ENV_MS + actor_fwd_ms(p)           # env + own forward, parallel
    thr_actor = A / step_actor * 1000.0             # steps/sec across all actors

    fwd_amort = central_fwd_ms_per_obs(p, A)        # batched over all A actors
    step_central = ENV_MS + 2 * NET_MS + fwd_amort  # env + 2 hops + batched fwd
    thr_central = A / step_central * 1000.0

    win = "centralized" if thr_central > thr_actor else "actor-side"
    print(f"{p:>5} MF  | {thr_actor:>12.0f} steps/s   | {thr_central:>12.0f} steps/s   | {win}")
Code 20.6.1: A from-scratch throughput model of the inference-location axis. The only difference between the two layouts is where the policy forward pass runs and whether it is batched; everything else (env time, actor count) is held equal so the crossover is attributable to inference alone.
  policy |             actor-side |     centralized (SEED) | winner
------------------------------------------------------------------------
    1 MF  |        63210 steps/s   |        55594 steps/s   | actor-side
    5 MF  |        60235 steps/s   |        55402 steps/s   | actor-side
   20 MF  |        51200 steps/s   |        54692 steps/s   | centralized
   80 MF  |        32000 steps/s   |        52024 steps/s   | centralized
  320 MF  |        12800 steps/s   |        43532 steps/s   | centralized
Output 20.6.1: For a tiny policy, actor-side inference wins: there is nothing to batch, and SEED's two network hops are pure overhead. As the policy grows past roughly 20 MFLOPs per forward, centralized batched inference pulls ahead and the gap widens fast; at 320 MFLOPs the actor-side layout collapses to a quarter of SEED's throughput because every actor is grinding a big forward alone.

The numbers match the intuition exactly. With a 1-MFLOP policy there is nothing to amortize, so SEED's network hops only cost throughput and the actor-side layout wins. The crossover sits near 20 MFLOPs, and beyond it the actor-side throughput falls off a cliff while SEED's barely moves, because SEED was always running one efficient batch regardless of policy size. This is precisely the regime SEED RL was designed for: large policies, many actors, fast networking. It also explains why the earlier systems were content to run inference on the actors; with the small feed-forward networks of Ape-X-era benchmarks, the crossover had not been reached, so there was no inefficiency to fix. SEED did not contradict its predecessors; it identified the regime where their choice stopped paying off.

Practical Example: The Actor Fleet That Was Mostly Idle CPUs Holding GPUs

Who: An RL infrastructure engineer at a robotics lab training a large vision-based control policy.

Situation: Training used 512 actors, each on a machine with a small GPU so it could run the policy, feeding an Ape-X-style replay buffer and a single learner.

Problem: Profiling showed every actor GPU sat near 4% utilization; they ran one forward on a batch of one per environment step, and the policy was large enough that even that was slow, capping experience throughput.

Dilemma: Buy bigger actor GPUs (scale up 512 machines, expensive and still batch-of-one wasteful), or restructure to SEED-style centralized inference (cheaper CPU-only actors plus one inference accelerator, but adds two network hops per step and needs fast networking).

Decision: They moved to centralized batched inference, because the policy sat well past the crossover in Output 20.6.1, where batching all actors on one accelerator dominates per-actor forwards.

How: Actors became CPU-only environment steppers sending observations over a low-latency interconnect; one accelerator batched all 512 observations per cycle into a single forward, returned actions, and ran the learner in the same box.

Result: Experience throughput rose several-fold at lower total hardware cost, the 512 idle actor GPUs were eliminated, and the single inference accelerator ran near peak batch utilization.

Lesson: Where inference runs is a distribution decision with a measurable crossover. Past it, stripping the policy off the actors and batching centrally is both faster and cheaper; before it, the two hops are not worth paying.

Library Shortcut: Ray RLlib Gives You These Designs as Config

You do not implement Ape-X, IMPALA, or a SEED-style central-inference loop from scratch. Ray RLlib ships them as named algorithms behind a few lines of configuration; switching the design space coordinate is changing one string, and the framework handles the actor fan-out, the sharded replay buffer, the priority plumbing, and the policy-weight broadcast that Section 20.4 built by hand:

# pip install "ray[rllib]"
from ray.rllib.algorithms.apex_dqn import ApexDQNConfig

config = (
    ApexDQNConfig()
    .environment("CartPole-v1")
    .env_runners(num_env_runners=64)        # 64 actors filling the buffer
    .training(num_steps_sampled_before_learning_starts=50_000)
)
algo = config.build()                        # learner + sharded prioritized replay
for _ in range(100):
    algo.train()                             # actors sample, learner drains by priority
Code 20.6.2: The Ape-X layout of Section 2 as RLlib configuration. The hundreds of lines of actor management, prioritized replay sharding, and actor-side priority computation collapse to a config object; selecting IMPALA instead is swapping ApexDQNConfig for IMPALAConfig, the same design space, one coordinate moved.

6. Reading a New System Off the Grid Intermediate

The payoff of treating these four as a design space rather than a reading list is that the grid keeps working for systems built after them. A modern large-scale RL stack for training language-model policies, for instance, almost always centralizes inference (the policy is enormous, so it sits far past the crossover), almost always streams near-on-policy data corrected like V-trace rather than replaying from a buffer (the policy changes fast and stale samples hurt), and inherits R2D2's lesson that any state the actor carries must be shipped and warmed up. Placing it is a matter of reading three coordinates, not learning a new system from zero. The grid also tells you what is missing: there are unfilled cells, such as centralized inference paired with prioritized replay for a recurrent agent, that a workload might call for, and the design space tells you exactly what that system would have to handle.

The thread continues. These actor-learner infrastructures return in Chapter 30, where many learning agents share an environment and the single-learner picture splits further, and the next section confronts the question that has hovered under all four designs: should the actors and the learner march in lockstep or run free? That synchronous-versus-asynchronous choice, foreshadowed by the staleness that V-trace and R2D2's burn-in both repair, is the subject of Section 20.7.

Research Frontier: Distributed RL for LLM Post-Training (2024 to 2026)

The hottest current use of this infrastructure is reinforcement learning from human or verifiable feedback to align and reason-tune large language models, and it has revived every axis in Table 20.6.1 at a new scale. Open stacks such as OpenRLHF, NeMo-Aligner, and Hugging Face TRL, together with veRL (HybridFlow) and the inference-batching ideas behind SEED, all wrestle with the same split: generation (policy rollout, which is inference-bound and benefits enormously from centralized batched serving via engines like vLLM) versus learning (the gradient update). The 2024 to 2026 systems literature on PPO and GRPO for LLMs is, read through this section's lens, a re-derivation of SEED's "batch the inference centrally" insight, now with the policy at hundreds of billions of parameters and the rollout engine and trainer often placed on separate accelerator pools. The design space did not change; the policies got large enough that centralized inference is no longer optional.

Exercise 20.6.1: Place a Fifth System Conceptual

A new RL stack trains a large recurrent policy by streaming near-on-policy data (no replay buffer) with V-trace-style correction, and it runs inference centrally on a batched accelerator. Using the three axes of Section 1 (inference location, experience path, agent type), give its coordinates and say which one system in Table 20.6.1 it is closest to on each axis. Then name the one R2D2 lesson it must still implement even though it has no replay buffer, and explain why.

Exercise 20.6.2: Find the Crossover Yourself Coding

Modify Code 20.6.1 to find the exact policy size (in MFLOPs) where centralized inference overtakes actor-side, for actor counts $A \in \{16, 64, 256, 1024\}$. Plot or print the crossover policy size against $A$. Explain the trend: why does adding more actors move the crossover toward smaller policies, and what does that say about when SEED-style centralization is worth it for a small policy with a very large actor fleet?

Exercise 20.6.3: The Hidden-State Storage Bill Analysis

An R2D2-style buffer stores sequences of length $L = 80$ transitions, each with an observation of $4{,}096$ bytes, plus one LSTM hidden state of $512$ floats (4 bytes each) per stored sequence for the burn-in seed. Compute the bytes per stored sequence with and without the hidden state, and the percentage overhead the hidden state adds. Then argue, using the distributed-replay traffic model of Section 20.4, whether storing the hidden state or recomputing it from a longer burn-in is the better trade when buffer network bandwidth, not buffer capacity, is the binding constraint.