Part IV: Parallel Deep Learning and Large Models
Chapter 20: Distributed Reinforcement Learning Infrastructure

The Actor-Learner Architecture

"I collected a thousand perfectly good rollouts. By the time the learner thanked me, it had already become a different policy and politely informed me my rollouts were off-policy."

An Actor Two Versions Behind
Big Picture

Distributed reinforcement learning splits one tightly coupled loop into two roles that run on different machines: many actors that run the current policy in parallel environment instances and emit experience, and a learner that consumes that experience, updates the policy, and broadcasts the new weights back. Experience flows up from actors to the learner; weights flow down from the learner to actors. The split is what lets RL scale, because environment simulation and gradient computation have completely different resource profiles and can now be sized independently. The split also creates the defining headache of distributed RL: while the learner moves on, the actors keep acting under a slightly old policy, so the experience they return is generated by a policy the learner has already abandoned. This section builds the architecture, names the two flows, and quantifies the resulting policy lag with a runnable loop.

In the previous section we argued that reinforcement learning is fundamentally a distributed-systems problem because a single process must both interact with an environment and learn from that interaction, and those two activities starve each other for resources. The actor-learner architecture is the answer the field converged on. It takes the monolithic RL loop, gather experience, then improve the policy, then gather more, and cuts it along the seam between gathering and improving. Everything in the rest of this chapter, the replay buffers of Section 20.4, the off-policy corrections of Section 20.5, the IMPALA and Ape-X and SEED designs of Section 20.6, is a refinement of this one pattern. Get the pattern right and the rest is engineering.

Actors: run the policy in parallel environments Actor 1 + Env policy v6 Actor 2 + Env policy v6 Actor 3 + Env policy v5 Actor K + Env policy v6 Learner consumes experience, updates policy now at v7 experience (state, action, reward) up updated weights down (broadcast) Policy lag learner at v7, slowest actor at v5: lag = 2
Figure 20.2.1: The actor-learner architecture and its two opposing flows. Experience trajectories travel up from $K$ actors (orange, solid) into the learner; updated policy weights travel down from the learner back to the actors (blue, dashed). Because the broadcast is not instantaneous and not every actor refreshes on the same step, actors act under older policy versions than the learner currently holds. The gap between the learner's version and the slowest actor's version is the policy lag, quantified in Output 20.2.1.

1. Cutting the RL Loop in Two Beginner

A reinforcement learning agent improves by trial and error: it acts in an environment, observes the reward and the next state, and adjusts its policy so that high-reward actions become more likely. Run serially on one machine, this loop alternates between two phases that want opposite hardware. Acting in the environment is mostly CPU work (stepping a simulator, a game engine, a physics model) and is dominated by latency, one step at a time. Learning is mostly accelerator work (a forward and backward pass over a neural network) and is dominated by throughput, large batches at once. When both share one process, the accelerator sits idle while the simulator steps, and the simulator sits idle while the accelerator trains. The actor-learner architecture breaks that lockstep.

An actor is a process that holds a copy of the policy and one or more environment instances. Its job is to act: feed the current observation to the policy, sample an action, apply it to the environment, record the resulting transition (a tuple of state, action, reward, next state), and repeat. A sequence of such transitions is a trajectory, and trajectories are the actor's only output. A learner is a process that holds the authoritative copy of the policy and an optimizer. Its job is to learn: consume the trajectories the actors produce, compute a policy-update gradient, apply it, and produce a new version of the weights. With the roles split, you run many actors per learner (environment simulation is cheap and embarrassingly parallel) and concentrate the expensive gradient work on one or a few accelerator-backed learners, the arrangement Figure 20.2.1 draws as experience flowing up and weights flowing down. Section 20.3 develops the actor side, the distributed collection of experience, in full.

Key Insight: Decoupling Lets You Size Two Resources Independently

The whole point of the actor-learner split is that experience generation and policy learning have different bottlenecks, CPU-bound latency versus accelerator-bound throughput, so coupling them wastes both. Once decoupled, you scale actors to saturate the learner's appetite for data and scale learners (or their batch size) to keep up with the actors' supply. The ratio of actors to learners becomes a tuning knob, not a fixed property of the algorithm. This is the same decoupling that Section 20.8 revisits as the sampling-versus-learning throughput balance, the central scaling bottleneck of distributed RL.

2. Two Flows: Experience Up, Weights Down Beginner

The architecture has exactly two data flows, moving in opposite directions, and keeping them straight is most of understanding distributed RL. The first flow is experience, travelling from the actors up to the learner. Each actor streams its trajectories to the learner (directly, or through the replay buffer of Section 20.4), which gathers experience from all of them into training batches. This is a gather: many producers, one consumer, partial results combined into a whole, exactly the collective family introduced in Section 4.7. The second flow is weights, travelling from the learner down to the actors. Each time the learner updates the policy, it must get the new parameters into the actors' hands so they act under an improved policy. This is a broadcast: one producer, many consumers, the identical tensor delivered to each.

These two flows are the engine of the whole system, and they are the same broadcast-and-gather pair that runs through every parallel method in this book. In data-parallel training (Chapter 15) the gather is the gradient all-reduce and the broadcast is implicit in the symmetric result. In RL the two halves come apart and run at different rates: experience flows continuously and in bulk, weights flow in discrete version-stamped bursts. The asymmetry in their rates is precisely what produces the staleness this section is about.

Thesis Thread: Broadcast and Gather Return, Pulled Apart in Time

The broadcast-and-gather pair you first met as a single fused collective in Section 4.7, and saw fused again as gradient all-reduce in data-parallel deep learning, reappears here as the two flows of the actor-learner loop, but now unfused. The gather (experience, actors to learner) and the broadcast (weights, learner to actors) run on separate clocks, at separate rates, over separate links. Distributed RL is in large part the study of what happens to a fused communication primitive once you let its two halves drift apart in time, and the policy lag below is the first consequence you will measure.

3. The Policy-Lag Problem Intermediate

Because the weight broadcast is neither instantaneous nor perfectly synchronized, an actor almost always acts under a policy that is a few versions behind the learner's current one. Call the learner's current policy version $v_L$ and the version cached on actor $k$ at the moment it generated a transition $v_k$. The policy lag of that transition is

$$\text{lag}_k = v_L - v_k \;\ge\; 0,$$

the number of learner updates that happened between when the actor's cached policy was frozen and now. A lag of zero means the actor is perfectly current; a lag of two means the learner has taken two gradient steps that the actor has not yet seen. This lag is unavoidable in any decoupled design: the instant the learner applies an update, every transition already in flight from the actors was generated by an older policy.

The consequence is statistical, not just bookkeeping. Reinforcement-learning algorithms care deeply about which policy generated the data. An on-policy method assumes the data it learns from came from the very policy it is updating; experience generated by an older policy is off-policy data, and feeding it to an on-policy update silently biases the gradient. Policy lag is therefore an RL-specific form of staleness, the same hazard you met as stale gradients in asynchronous optimization (Section 10.6), where workers compute gradients against parameters the server has already moved past. There the staleness was measured in gradient steps; here it is measured in policy versions, and it corrupts not just the magnitude of the update but the very distribution the data was drawn from. Correcting for it (with importance sampling and methods such as V-trace) is the subject of Section 20.5; for now we only need to measure it.

Fun Note: The Polite Fiction of On-Policy at Scale

Strictly, the moment you put one actor on a different machine from the learner, "on-policy" becomes a polite fiction: there is always some lag, even if it is a fraction of a version. Large-scale RL systems do not pretend otherwise. They embrace a bounded amount of off-policy-ness and pay for it with a correction term, much as a parameter server embraces bounded staleness rather than insisting on perfect synchrony. Purity is a single-machine luxury.

The loop below makes the lag concrete. It implements a minimal actor-learner system in pure Python: four actors each run a trivial environment under a cached policy version, the learner consumes their batch every round and bumps its version, and the actors pull fresh weights only every third round. We track, each round, the learner's version, every actor's cached version, and the maximum lag across actors.

import random
random.seed(0)

class Learner:                                  # owns the authoritative policy
    def __init__(self):
        self.weight, self.version = 0.0, 0
    def update(self, experiences):
        if experiences:                         # tiny step toward mean reward
            signal = sum(r for (_, _, r, _) in experiences) / len(experiences)
            self.weight += 0.1 * signal
        self.version += 1                       # every update mints a new version
        return self.weight, self.version

class Actor:                                    # holds a CACHED copy of the policy
    def __init__(self, weight, version):
        self.weight, self.version = weight, version
    def rollout(self):                          # act under the (stale) cached policy
        action = self.weight + random.gauss(0.0, 0.5)
        reward = 1.0 - abs(action - 1.0)        # reward peaks when action ~ 1.0
        return (0.0, action, reward, 1.0)       # (state, action, reward, next_state)

NUM_ACTORS, ROUNDS, PULL_EVERY = 4, 12, 3       # actors refresh weights every 3 rounds
learner = Learner()
actors = [Actor(learner.weight, learner.version) for _ in range(NUM_ACTORS)]

print(f"{'round':>5} | {'learner_v':>9} | {'actor_vers':>18} | {'max_lag':>7}")
print("-" * 52)
for rnd in range(1, ROUNDS + 1):
    batch = [a.rollout() for a in actors]                 # experience flows UP (gather)
    new_w, new_v = learner.update(batch)
    if rnd % PULL_EVERY == 0:                             # weights flow DOWN (broadcast)
        for a in actors:
            a.weight, a.version = new_w, new_v
    versions = [a.version for a in actors]
    max_lag = learner.version - min(versions)            # policy lag = version gap
    print(f"{rnd:>5} | {learner.version:>9} | {str(versions):>18} | {max_lag:>7}")
Code 20.2.1: A minimal actor-learner loop from scratch. The learner mints a new policy version on every update (the gather of experience); actors broadcast-refresh their cached weights only every PULL_EVERY rounds, so their version trails the learner's and the gap is the measured policy lag.
round | learner_v |         actor_vers | max_lag
----------------------------------------------------
    1 |         1 |       [0, 0, 0, 0] |       1
    2 |         2 |       [0, 0, 0, 0] |       2
    3 |         3 |       [3, 3, 3, 3] |       0
    4 |         4 |       [3, 3, 3, 3] |       1
    5 |         5 |       [3, 3, 3, 3] |       2
    6 |         6 |       [6, 6, 6, 6] |       0
    7 |         7 |       [6, 6, 6, 6] |       1
    8 |         8 |       [6, 6, 6, 6] |       2
    9 |         9 |       [9, 9, 9, 9] |       0
   10 |        10 |       [9, 9, 9, 9] |       1
   11 |        11 |       [9, 9, 9, 9] |       2
   12 |        12 |   [12, 12, 12, 12] |       0
----------------------------------------------------
Output 20.2.1: Policy lag follows a sawtooth: it climbs by one each round the learner updates without a broadcast, then drops to zero the round actors pull fresh weights. The slowest actor here trails the learner by up to two versions, so a third of the experience the learner consumes is two versions off-policy. Broadcasting more often (smaller PULL_EVERY) flattens the sawtooth at the cost of more weight traffic.

The sawtooth is the signature of every decoupled actor-learner system. The lag's ceiling is set by how often you broadcast weights: pull every step and the lag stays at one, pull rarely and it grows without bound. That single knob trades freshness against network traffic, and choosing it well is a recurring design decision across the chapter. Notice that the learner's actual weight here drifts rather than converging, because the toy reward is noisy and uncorrected; that drift is exactly the off-policy bias Section 20.5 repairs.

Library Shortcut: Ray RLlib Runs Actors and Learners for You

Code 20.2.1 hand-rolls the version bookkeeping, the broadcast schedule, and the gather. A production RL framework supplies all three. In Ray RLlib you declare how many actors (rollout workers) to run, and the framework spawns them as remote processes, streams their experience to the learner, and broadcasts updated weights on its own schedule, version tracking included:

from ray.rllib.algorithms.impala import IMPALAConfig

config = (
    IMPALAConfig()
    .environment("CartPole-v1")
    .env_runners(num_env_runners=4)     # four actor processes (experience UP)
    .training(lr=0.0005)                # one learner (weights DOWN, broadcast)
)
algo = config.build()
for _ in range(10):
    algo.train()                        # gather + update + broadcast, all handled
Code 20.2.2: The same actor-learner pattern as Code 20.2.1, now four lines of configuration. RLlib spawns the four actor processes, pipes their trajectories to the learner, broadcasts refreshed weights, and applies V-trace off-policy correction internally, replacing roughly the entire loop above and the correction machinery of Section 20.5. Section 20.9 returns to RLlib and other distributed RL stacks in depth.

4. Centralized and Decentralized Topologies Intermediate

The two flows can be wired together in more than one shape. In a centralized topology, a single learner owns the authoritative policy: all actors send experience to it and all receive weights from it. This is the simplest and most common arrangement, and it is the one in Code 20.2.1. Its virtue is a single source of truth for the policy version; its limit is that the lone learner is a throughput bottleneck and a single point of failure, so once the actors out-produce one learner you must either grow the learner's batch or shard the learning itself. In a decentralized topology, the learning is spread across several learner replicas that keep their copies of the policy in sync with each other (typically by all-reducing their gradients, exactly the data-parallel pattern of Chapter 15) while each serves a subset of the actors. This removes the single learner bottleneck at the cost of an extra synchronization among learners and a subtler version story, since "the" policy version is now an agreement among replicas.

This pair of choices should feel familiar, because it is the parameter-server question in new clothing. A centralized learner that holds the master weights, receives updates, and pushes refreshed parameters back to workers is structurally a parameter server (Section 11.9): push-up, pull-down, with bounded staleness tolerated in between. The resemblance is real and worth leaning on. The difference is what travels up. In supervised parameter-server training the workers push gradients, already-digested learning signal computed against a known loss. In RL the actors push raw experience, undigested trajectories whose learning value depends on which policy produced them, which is exactly why the policy version must ride along with the data and why the off-policy correction of Section 20.5 has no counterpart in the supervised parameter server.

Practical Example: The Robotics Lab Whose Simulator Starved the GPU

Who: A reinforcement-learning engineer at a robotics lab training a locomotion policy in a physics simulator.

Situation: A single-process PPO loop on one workstation trained a quadruped to walk, but each policy iteration took most of a day.

Problem: Profiling showed the GPU idle roughly 80 percent of the time, blocked waiting for the CPU physics simulator to produce the next batch of rollouts.

Dilemma: Buy a faster GPU, which would only sit idle longer, or restructure into actor-learner so the simulator could run on many cores in parallel while the GPU trained continuously.

Decision: They restructured, running thirty-two actor processes across the workstation's CPU cores feeding one GPU-backed learner, accepting a small policy lag in exchange for a saturated accelerator.

How: Actors stepped the simulator and emitted trajectories tagged with their policy version; the learner gathered them into large batches, updated, and broadcast new weights every few iterations, with a V-trace correction for the lag.

Result: GPU utilization rose above 85 percent and wall-clock time per policy iteration fell from most of a day to under an hour, with final policy quality unchanged because the bounded lag was corrected rather than ignored.

Lesson: When the simulator starves the accelerator, the fix is not a bigger accelerator but the actor-learner split, paying a measured policy lag to keep both resources busy.

5. Why This Pattern, and What It Is Not Advanced

It is worth stating plainly what the actor-learner architecture does and does not buy you, because it is easy to over-credit. It buys decoupling and therefore scale: you can throw cheap CPU machines at experience generation until the learner is saturated, which is the single largest lever in scaling RL throughput. It does not, by itself, make learning faster per unit of experience; in fact it makes each unit slightly less valuable, because lag turns fresh on-policy data into stale off-policy data. The architecture is a throughput trade, more experience per second at the price of each sample being a little less on-policy, and it only pays off when you can correct the staleness cheaply enough that the extra volume wins. That is why this section and the correction section that follows are inseparable: the pattern without the correction is just a faster way to compute a biased gradient.

Research Frontier: Actor-Learner at LLM Scale (2024 to 2026)

The actor-learner split has become the backbone of reinforcement learning from human feedback and, more recently, RL for reasoning models, where the "environment" is text generation and the "reward" comes from a learned or rule-based scorer. The dominant bottleneck flips: the actor's rollout is now an expensive autoregressive generation, so 2024 to 2026 systems such as OpenRLHF, NeMo-Aligner, and the open-source frameworks behind DeepSeek-R1-style training (the GRPO family) co-locate fast inference engines like vLLM as the actor and a separate trainer as the learner, with deliberately bounded policy lag between generation and update. The live research questions are how stale the generating policy may be before the off-policy correction of Section 20.5 breaks down, and how to schedule the weight broadcast (a multi-billion-parameter tensor now) without stalling either side. The two flows are the same as Code 20.2.1; only the byte counts and the cost of each rollout have grown by orders of magnitude.

The pattern also reshapes failure. A crashed actor in a centralized topology costs only the in-flight trajectories it had not yet sent, and the learner continues with one fewer producer, a graceful degradation the monolithic loop cannot offer. A crashed learner, by contrast, halts all learning until it recovers, which is why decentralized topologies and the elastic, fault-tolerant techniques of Chapter 18 matter as the actor and learner counts grow. We now have the architecture (two roles, two flows), the defect it introduces (policy lag, an RL-specific staleness), and the two topologies that wire it together. The next section descends into the actor side to ask how thousands of environment instances actually produce, batch, and ship their experience without drowning the network, beginning in Section 20.3.

Exercise 20.2.1: Name the Flow and the Collective Conceptual

For each event in an actor-learner system, state whether it belongs to the experience flow or the weight flow, and which collective from Section 4.7 it most resembles: (a) the learner sends identical updated parameters to all forty actors; (b) the learner assembles a training batch from the trajectories of all forty actors; (c) in a decentralized topology, four learner replicas combine their gradients so their policy copies stay identical. Then explain in one sentence why the experience flow and the weight flow can run at different rates, when in data-parallel training (Chapter 15) the two halves are locked together.

Exercise 20.2.2: Tune the Lag Sawtooth Coding

Starting from Code 20.2.1, add a per-actor random refresh: instead of all actors pulling on the same round, give each actor an independent probability $p$ of pulling fresh weights after each round. Run for 50 rounds at $p = 0.2$, $0.5$, and $0.9$, and for each report the average and maximum policy lag across actors and rounds. Plot or tabulate average lag against $p$. Explain the shape of the curve and connect it to the freshness-versus-traffic trade-off: what does a real system pay to push $p$ toward 1?

Exercise 20.2.3: Centralized Learner as a Bottleneck Analysis

Suppose each actor produces $1{,}000$ transitions per second and the centralized learner can consume $200{,}000$ transitions per second. (a) How many actors saturate one learner? (b) If you run twice that many actors, what happens to the experience that cannot be consumed, and how does the replay buffer of Section 20.4 change the answer? (c) Argue from these numbers when you should move from a centralized to a decentralized topology, and relate the learner's role here to the parameter server of Section 11.9, naming one thing the RL learner must track that a supervised parameter server does not.