Part IV: Parallel Deep Learning and Large Models
Chapter 20: Distributed Reinforcement Learning Infrastructure

Frameworks and Practice

"I have been a hand-rolled actor loop, a Ray RLlib config, and a six-thousand-GPU RLHF pipeline. The architecture never changed; only the size of the broadcast did."

An Actor-Learner That Has Worn Many Frameworks
Big Picture

Every distributed-RL framework in production, from a classic Atari agent to a frontier RLHF run, is the same actor-learner machine you built across this chapter: actors generate experience, a replay or rollout buffer holds it, a learner consumes it, and fresh weights broadcast back. What differs between Ray RLlib, Acme, SampleFactory, and the LLM-RL stacks is not the architecture but which stage dominates the cost and therefore which knob the framework exposes. Choosing a framework is choosing where your bottleneck lives. This closing section maps the major stacks onto the architecture, gives a decision guide for the three forks that matter (single-machine versus distributed, on-policy versus off-policy, classic-RL versus LLM-RL), and assembles the entire chapter into one runnable minimal stack so you can measure the three numbers, sampling throughput, replay ratio, and policy lag, that the whole chapter taught you to balance.

The previous section, Section 20.8, ended on a diagnosis: a distributed-RL system is a pipeline of two interleaved workloads, sampling and learning, and its throughput is set by whichever stage is slower. You do not, in practice, write that pipeline from raw sockets. You reach for a framework that has already solved process-group setup, replay sharding, weight broadcast, and fault recovery, and that lets you spend your attention on the one stage that binds. The skill this section builds is reading a framework as an implementation of the actor-learner architecture, so that when you open Ray RLlib or OpenRLHF you already know what every component is and which dial changes which number.

Vectorized actors (CPU/GPU fleet) Actor 1 local weights v, rollout batch Actor 2 local weights v, rollout batch Actor K local weights v, rollout batch Replay buffer (sharded ring) transitions tagged with policy version v Learner gradient step on sampled minibatch bumps version v + 1 experience sample (replay ratio) weight broadcast: push fresh params back to every actor (sync or async) policy lag = learner version minus the version that produced the consumed sample sampling stage experience store learning stage one cluster, two interleaved workloads, balanced by the framework
Figure 20.9.1: The full distributed-RL stack that every framework in this section implements. Vectorized actors (orange) roll out with a local copy of the weights and a version tag $v$; the sharded replay buffer (blue) stores version-tagged transitions; the learner (green) samples a minibatch, takes a gradient step, and bumps the version; the dashed broadcast arrow returns fresh weights. The replay ratio is set by how often the learner samples per fresh transition, and the policy lag is the gap between the learner's current version and the version that produced a consumed sample. Sections 20.2 through 20.8 built each box; this section runs them together.

1. Ray RLlib: The Architecture as a Library Beginner

Ray RLlib is the most direct mapping of this chapter's architecture onto a framework, because RLlib is built on Ray, and Ray's distributed primitives are exactly the actor model the architecture needs. A Ray actor is a stateful worker process addressable across the cluster; an RLlib rollout worker is one such actor that holds an environment and a policy copy and returns batches of experience. The learner is another Ray actor (or a group of them, for multi-GPU learning) that owns the optimizer. Ray's object store moves experience and weights between them without you touching a socket, and Ray's placement groups pin actors to the right machines. The actor-learner diagram you have been reading all chapter is, almost line for line, RLlib's internal object graph. We meet Ray as a cluster-computing substrate in its own right in Chapter 33, and Ray Serve as an inference fabric in Chapter 23; here we use only its actor primitive, the unit that makes RLlib's rollout-worker fan-out a few lines instead of a networking project.

Key Insight: A Framework Is a Default Choice for Every Knob You Studied

Across this chapter you exposed a set of dials: number of actors, replay capacity, replay ratio, sync versus async weight broadcast, and the off-policy correction that pays for lag. A framework does not remove those dials; it ships sensible defaults for them and a config surface to change them. Reading a framework well means locating each dial. In RLlib, num_env_runners is your actor count, train_batch_size and replay_buffer_config set the learning side, and the algorithm class (PPO versus APPO versus DQN) picks on-policy or off-policy and therefore whether lag must be corrected. When a run scales badly, you do not invent a fix; you find the dial from this chapter that the framework named differently.

Library Shortcut: Ray RLlib Builds the Whole Stack From a Config

The end-to-end stack you assemble by hand later in this section, vectorized actors, a replay buffer, a learner, and weight broadcast, is in RLlib a configuration object and one .build() call. The framework wires the Ray actors, the experience transport, and the synchronized weight push for you:

# pip install "ray[rllib]"
from ray.rllib.algorithms.appo import APPOConfig

config = (
    APPOConfig()                                  # async actor-learner (IMPALA-style)
    .environment("CartPole-v1")
    .env_runners(num_env_runners=8)               # 8 vectorized actors = the fan-out
    .training(
        train_batch_size=2000,                    # learner minibatch
        vtrace=True,                              # off-policy correction for policy lag
    )
)
algo = config.build()                             # wires actors + learner + broadcast
for _ in range(50):
    result = algo.train()                         # one round: sample -> learn -> sync
    print(result["env_runners"]["episode_return_mean"])
Code 20.9.1: The actor-learner architecture of Section 20.2 as eight lines of RLlib. num_env_runners is the actor count from Section 20.3, vtrace=True is the V-trace correction of Section 20.5, and APPO is the asynchronous design of Section 20.7. The roughly two hundred lines of process-group setup, experience transport, and weight broadcast you would otherwise write collapse into the config.

2. The Other Stacks, and Where Each One Wins Intermediate

RLlib is general; several other stacks specialize, and the specialization is always a bet on which stage of Figure 20.9.1 dominates. Stable Baselines3 is the single-machine baseline: clean reference implementations of PPO, SAC, and DQN that run in one process. It is the right starting point and the correct answer whenever no resource ceiling from Chapter 1 actually binds, and you should reach for it before any distributed stack so that you have a correctness yardstick. Acme, from DeepMind, is a research framework that separates actors, learners, and replay into composable modules, and it pairs with Reverb, a dedicated distributed replay-buffer service that implements the prioritized, version-tagged storage of Section 20.4 as a standalone server other processes call into. When the replay buffer is your bottleneck, Reverb is the specialist.

When the sampling stage dominates, two stacks attack throughput directly. SampleFactory pushes a single machine to extreme rollout rates (tens to hundreds of thousands of environment steps per second) by keeping actors, a policy-inference worker, and the learner in tightly coupled processes with shared-memory queues, the asynchronous design of Section 20.7 taken to its single-node limit. EnvPool attacks the same bottleneck from the environment side: it runs many environment instances in a C++ thread pool behind a batched, vectorized interface, so a Python learner sees one fast vectorized environment instead of thousands of slow ones. EnvPool composes with the others; it is the sampling-stage accelerator you bolt onto whatever learner you already have. Table 20.9.1 places these stacks on the architecture.

Table 20.9.1: Distributed-RL frameworks read as implementations of the actor-learner architecture, with the stage of Figure 20.9.1 each one specializes in. "Classic RL" means environment-driven agents; the LLM-RL row is developed in Section 3.
StackScopeStage it specializesReach for it when
Stable Baselines3Single machine, classic RLNone (one process)No ceiling binds; you need a correctness baseline
Ray RLlibDistributed, classic RLGeneral; all stagesYou want one config-driven stack across many machines
Acme + ReverbDistributed, classic RLReplay bufferReplay is the bottleneck or you need prioritized storage as a service
SampleFactorySingle node, high throughputSampling (async, shared memory)One big machine must hit extreme rollout rates
EnvPoolEnvironment layerSampling (vectorized envs)The environment, not the policy, is your sampling bottleneck
OpenRLHF / veRL / TRL / NeMo-AlignerDistributed, LLM-RLInference + learning fusionYou are doing RLHF or RL for reasoning on an LLM
Fun Note: The Same Diagram, Three Orders of Magnitude Apart

A SampleFactory Atari agent and a frontier RLHF run draw the identical actor-learner diagram, yet the "actor" in one is a few kilobytes of policy weights rolling out a game frame, and in the other it is a multi-billion-parameter language model generating a paragraph. The broadcast that is a trivial pointer copy in the first becomes, in the second, a sharded weight transfer across hundreds of GPUs that can take longer than the gradient step itself. When people say RLHF is "just RL," they are right about the diagram and wrong about the bill.

3. The LLM-RL Stacks: Inference and Training, Fused Advanced

Reinforcement learning from human feedback, and its newer sibling, reinforcement learning for reasoning, are the reason distributed RL has become a frontier concern again. The architecture is unchanged: a policy generates experience, a learner improves it, weights broadcast back. What changes is that the policy is a large language model, so the sampling stage is itself a distributed-inference problem, the very subject of Chapter 24. An RLHF actor does not step a cheap simulator; it runs autoregressive generation through a sharded model, often served by an inference engine such as vLLM. The learner, meanwhile, runs a sharded training step using the foundation-model machinery of Section 19.8 in Chapter 19. The hard engineering is the fusion: keeping a fast inference deployment and a heavy training deployment on the same cluster, and broadcasting updated weights from the trainer into the inference engine quickly enough that policy lag stays bounded.

The stacks that solve this fusion are the current center of gravity. TRL (Hugging Face) is the accessible entry point, with PPO, DPO, and GRPO trainers that run on a single multi-GPU node and integrate with the wider Hugging Face ecosystem. OpenRLHF and veRL are the scale-out stacks: both place the inference (rollout) engines and the training engines as separate Ray-scheduled deployments and engineer the weight-broadcast path between them, which is precisely the broadcast arrow of Figure 20.9.1 turned into a multi-hundred-GPU operation. NeMo-Aligner (NVIDIA) brings the same fusion to the Megatron-Core training stack for the largest models. In all four, the off-policy correction of Section 20.5 reappears: because generation and training run asynchronously, the samples the learner consumes were produced by a slightly older policy, and the importance-weighting that pays for that lag is the same idea you met for IMPALA, now applied to token sequences.

Research Frontier: RL for Reasoning and the Rollout-Bound Era (2024 to 2026)

Two shifts define the current frontier. First, RL for reasoning: methods in the lineage of GRPO (Shao et al., 2024) and the open reasoning-model efforts of 2025 use rule-based or verifiable rewards instead of a learned reward model, which makes the learner cheaper and pushes nearly all the cost onto generation. These runs are rollout-bound: the sampling stage of Figure 20.9.1 dominates so completely that the central optimization is co-locating and overlapping a vLLM-class inference engine with the trainer. Second, the stacks themselves have consolidated; veRL (Sheng et al., 2024) introduced a hybrid-controller design that flexibly maps the rollout and training workloads onto devices, and OpenRLHF popularized Ray-scheduled separation of the two, both reporting large throughput gains over naive co-location. The open question is the weight-broadcast path: as models grow, getting fresh parameters from trainer to inference engine without stalling either side is the bottleneck that the next generation of frameworks is built to attack, the multi-hundred-GPU descendant of the simple broadcast you measure below.

4. A Decision Guide: Three Forks Intermediate

Choosing a stack reduces to three forks, taken in order. Each one is a question this chapter taught you to answer with numbers, not taste.

Fork one: single-machine or distributed? This is the Chapter 1 question applied to RL. If your environment is cheap, your model fits on one accelerator, and one machine's rollout rate keeps the learner fed, you are done: use Stable Baselines3 and do not pay the communication and failure taxes of distribution. You distribute only when a specific ceiling binds, when the environment is too slow (reach for EnvPool or SampleFactory), when the model does not fit (reach for a sharded learner), or when one machine cannot generate experience fast enough (fan out actors with RLlib). Measure the imbalance from Section 20.8 before you scale; an unbalanced single machine does not become balanced by adding machines.

Fork two: on-policy PPO or off-policy replay? On-policy methods (PPO) require that the data used for an update came from the current policy, which forces tight synchronization and caps the replay ratio near one; they are stable and are the default for RLHF. Off-policy methods (DQN, SAC, Ape-X, IMPALA) reuse a replay buffer, allowing high replay ratios and large actor fleets, but they pay for the resulting policy lag with the off-policy correction of Section 20.5. The fork is a throughput-versus-stability trade: choose off-policy when sampling is expensive and you must reuse experience, on-policy when stability matters more than sample reuse.

Fork three: classic RL or LLM-RL? If your policy is a small network stepping an environment, the classic stacks (RLlib, Acme, SampleFactory) apply directly. If your policy is a language model, the sampling stage becomes distributed inference and you need an LLM-RL stack (TRL for one node, OpenRLHF or veRL for many, NeMo-Aligner for the largest), because only those engineer the inference-training fusion and the weight-broadcast path that classic stacks never had to. Misreading this fork is the most expensive mistake: running an LLM policy through a classic stack's rollout worker ignores the distributed-inference problem that is most of the cost.

Practical Example: The RLHF Run That Was Bottlenecked in the Wrong Place

Who: An applied-research team fine-tuning a 13-billion-parameter model with RLHF on a 32-GPU cluster.

Situation: Their first pipeline co-located generation and training in one process group and reported a dismal 8 percent GPU utilization on the training side.

Problem: Engineers assumed the gradient step was slow and requested more training GPUs, the instinctive scale-up move.

Dilemma: Add training GPUs, which were already mostly idle, or re-examine which stage of the actor-learner pipeline actually bound the run.

Decision: They profiled the two stages separately, as Section 20.8 prescribes, and found generation, not training, consumed 85 percent of wall-clock; the trainers sat idle waiting for rollouts.

How: They migrated to an OpenRLHF-style layout, placing vLLM rollout engines on their own GPUs as a separate deployment, overlapping generation with training, and tightening the weight-broadcast cadence to keep policy lag bounded.

Result: End-to-end throughput rose roughly fourfold with no extra training GPUs, because the freed capacity went to generation, the stage that actually bound the pipeline.

Lesson: In LLM-RL the bottleneck is usually sampling, not learning. The framework choice that matters is the one that lets you scale the binding stage independently, exactly the imbalance lesson of this chapter applied at frontier scale.

5. The Whole Chapter, Running Intermediate

To close, we assemble the entire chapter into one minimal stack and run it: vectorized actors that roll out with a local copy of the weights and a version tag, a ring replay buffer that stores version-tagged transitions, a learner that samples minibatches and bumps the policy version, and a weight broadcast that pushes fresh parameters back to the actors. The point is not the toy task (a linear contextual bandit, so the math stays out of the way) but the instrumentation: the same code reports the three numbers this chapter is about, the sampling throughput in samples per second, the replay ratio (how many times the average sample is reused by the learner), and the mean policy lag (how many learner updates stale the consumed experience is). Code 20.9.2 is the structure of Figure 20.9.1 in pure Python.

import time, collections
import numpy as np
rng = np.random.default_rng(0)

D, N_ACTIONS = 4, 2
W_ENV = rng.standard_normal((N_ACTIONS, D))               # hidden environment weights

def reward(states, actions):                              # contextual-bandit reward
    return (np.einsum("bd,bd->b", states, W_ENV[actions])
            + 0.05 * rng.standard_normal(states.shape[0]))

class Policy:                                             # linear Q(s,a) = 
    def __init__(self): self.theta = np.zeros((N_ACTIONS, D)); self.version = 0
    def act(self, s, eps=0.0):
        greedy = np.argmax(s @ self.theta.T, axis=1)
        mask = rng.random(s.shape[0]) < eps
        return np.where(mask, rng.integers(0, N_ACTIONS, s.shape[0]), greedy)

class VectorActor:                                       # holds a LOCAL weight copy + version
    def __init__(self, batch): self.batch = batch; self.theta = np.zeros((N_ACTIONS, D)); self.ver = 0
    def sync(self, p): self.theta = p.theta.copy(); self.ver = p.version   # weight broadcast
    def rollout(self):
        s = rng.standard_normal((self.batch, D))
        greedy = np.argmax(s @ self.theta.T, axis=1)
        mask = rng.random(self.batch) < 0.1
        a = np.where(mask, rng.integers(0, N_ACTIONS, self.batch), greedy)
        return s, a, reward(s, a), np.full(self.batch, self.ver)          # tag with version

class Replay:                                            # version-tagged ring buffer
    def __init__(self, cap): self.buf = collections.deque(maxlen=cap)
    def add(self, s, a, r, v):
        for it in zip(s, a, r, v): self.buf.append(it)
    def sample(self, n):
        idx = rng.integers(0, len(self.buf), size=n); items = [self.buf[i] for i in idx]
        return (np.stack([i[0] for i in items]), np.array([i[1] for i in items]),
                np.array([i[2] for i in items]), np.array([i[3] for i in items]))

class Learner:                                           # gradient step, bumps version
    def __init__(self, p, lr=0.05): self.p = p; self.lr = lr
    def step(self, s, a, r):
        for act in range(N_ACTIONS):
            m = a == act
            if m.any():
                grad = 2.0 * (s[m].T @ (s[m] @ self.p.theta[act] - r[m])) / m.sum()
                self.p.theta[act] -= self.lr * grad
        self.p.version += 1

# ---- assemble the stack and run it
N_ACTORS, ACTOR_BATCH, LEARNER_BATCH, REUSE, SYNC_EVERY, ROUNDS = 4, 256, 512, 4, 2, 200
policy = Policy(); learner = Learner(policy); replay = Replay(50_000)
actors = [VectorActor(ACTOR_BATCH) for _ in range(N_ACTORS)]
for ac in actors: ac.sync(policy)

samples = updates = lag_sum = lag_n = 0
t0 = time.perf_counter()
for rnd in range(ROUNDS):
    for ac in actors:                                    # sampling stage
        s, a, r, v = ac.rollout(); replay.add(s, a, r, v); samples += len(r)
    if len(replay.buf) >= LEARNER_BATCH:                 # learning stage
        for _ in range(REUSE):
            s, a, r, v = replay.sample(LEARNER_BATCH)
            lag_sum += int((policy.version - v).sum()); lag_n += len(v)   # policy lag
            learner.step(s, a, r); updates += 1
    if rnd % SYNC_EVERY == 0:                            # weight broadcast stage
        for ac in actors: ac.sync(policy)
elapsed = time.perf_counter() - t0

ts = rng.standard_normal((10_000, D))
v_greedy = reward(ts, policy.act(ts)).mean()
v_rand = reward(ts, rng.integers(0, N_ACTIONS, 10_000)).mean()
print(f"actors                : {N_ACTORS} x batch {ACTOR_BATCH}")
print(f"samples collected     : {samples:,}")
print(f"learner updates       : {updates:,}")
print(f"wall-clock (s)        : {elapsed:.3f}")
print(f"samples/sec           : {samples / elapsed:,.0f}")
print(f"replay ratio (reuse)  : {updates * LEARNER_BATCH / samples:.2f}")
print(f"mean policy lag       : {lag_sum / max(lag_n, 1):.2f} learner updates")
print(f"value greedy vs random: {v_greedy:+.3f} vs {v_rand:+.3f}")
Code 20.9.2: The complete actor-learner stack of Figure 20.9.1 in one runnable file: vectorized actors with version-tagged rollouts, a ring replay buffer, a learner that bumps the policy version, and a periodic weight broadcast. The instrumentation reports the three balance numbers of Section 20.8 directly.
actors                : 4 x batch 256
samples collected     : 204,800
learner updates       : 800
wall-clock (s)        : 0.894
samples/sec           : 229,148
replay ratio (reuse)  : 2.00
mean policy lag       : 87.46 learner updates
value greedy vs random: +0.542 vs -0.007
Output 20.9.2: The minimal stack collected 204,800 transitions at about 229,000 samples per second, the learner reused each sample twice (replay ratio 2.0), and the consumed experience was on average 87 learner updates stale, the policy lag that the off-policy correction of Section 20.5 exists to absorb. The learned greedy policy scores $+0.542$ against a random baseline near zero ($-0.007$), so the whole pipeline works end to end.

The three balance numbers in Output 20.9.2 are exactly the quantities Section 20.8 told you to watch, and three knobs in Code 20.9.2 move them; together they are the chapter in miniature. Raising REUSE lifts the replay ratio and the sample efficiency but also raises the policy lag, because old samples are reused more before fresh weights arrive. Lowering SYNC_EVERY broadcasts weights more often and cuts the lag, at the cost of more synchronization, the sync-versus-async trade of Section 20.7. Adding actors raises samples per second until the learner cannot keep up, at which point the imbalance of Section 20.8 returns and more actors stop helping. You can now turn every dial in this chapter and watch the three numbers move.

6. Chapter Summary Beginner

This chapter treated reinforcement learning as a distributed-systems problem and built its infrastructure from the ground up. The through-line is a single architecture refracted through eight concerns. RL is two interleaved workloads, sampling (actors generating experience) and learning (a learner improving the policy), and the actor-learner architecture of Section 20.2 separates them so each can scale on its own. Experience collection (Section 20.3) fans out across many actors, and the replay buffer (Section 20.4) decouples the rate at which experience is produced from the rate at which it is consumed, storing version-tagged transitions for reuse. Because actors run on slightly stale weights, the learner consumes off-policy data, and the off-policy correction of Section 20.5 (V-trace and its relatives) reweights that data so the lag does not bias the update. The landmark designs of Section 20.6 (Ape-X, R2D2, SEED RL) are points in this design space; the synchronous-versus-asynchronous choice of Section 20.7 trades determinism against throughput; and the scaling analysis of Section 20.8 showed that the whole system runs at the speed of its slower stage, so balancing the sampling and learning pipeline is the central engineering task. This section closed the loop by mapping every framework onto that one architecture and running the full stack to measure the three numbers, throughput, replay ratio, and policy lag, that the balance comes down to.

Thesis Thread: The Broadcast and Gather Return, One More Time

Distributed RL is, at its core, the same scale-out pattern this book has followed since Chapter 1: split a workload across machines, move the necessary information between them, recombine correctly, and keep the cost of that movement under control. The actor-learner architecture is a broadcast (fresh weights pushed to every actor) paired with a gather (experience collected back into replay), the exact dual of the gradient all-reduce in data-parallel training. RL adds one twist the earlier chapters did not have: the data is generated by the model being trained, so the broadcast and gather run in a loop, and the staleness between them, policy lag, becomes a quantity to engineer. That same actor-learner machine returns once more, distributed across competing and cooperating agents, in multi-agent reinforcement learning (Chapter 30), where the gather must now reconcile experience from many policies at once.

Key Takeaway: Chapter 20 in One Breath

Reinforcement learning at scale is two interleaved workloads, sampling and learning, wired together by an actor-learner architecture that broadcasts fresh weights out and gathers experience back. Distributed experience collection and a version-tagged replay buffer decouple the two rates; off-policy correction absorbs the policy lag that decoupling creates; the synchronous-versus-asynchronous choice trades determinism for throughput; and because the pipeline runs at the speed of its slower stage, the master skill is balancing sampling against learning. Every framework, from Stable Baselines3 to veRL, is this one architecture with a different stage scaled out, so choosing a framework is choosing where your bottleneck is allowed to live.

7. Exercises Intermediate

Exercise 20.9.1: Map the Framework to the Architecture Conceptual

For each stack, name which stage of Figure 20.9.1 it specializes in and state one resource ceiling from Chapter 1 that would make you reach for it over Stable Baselines3: (a) EnvPool, (b) Acme with Reverb, (c) SampleFactory, (d) OpenRLHF. Then explain why running an LLM policy through a classic RLlib rollout worker, ignoring the distributed-inference nature of generation, misreads fork three of the decision guide and wastes most of the compute.

Exercise 20.9.2: Trade Replay Ratio Against Policy Lag Coding

Take Code 20.9.2 and sweep REUSE over $\{1, 2, 4, 8, 16\}$ while holding everything else fixed. Plot or tabulate the resulting replay ratio and mean policy lag, and record whether the final greedy-versus-random value still shows learning. Then sweep SYNC_EVERY over $\{1, 2, 4, 8\}$ at fixed REUSE and show that more frequent broadcast lowers the lag. Explain, using Section 20.5, why a high replay ratio without off-policy correction would eventually bias the learner, and why this toy linear task tolerates more lag than a deep policy would.

Exercise 20.9.3: Find the Imbalance Point Analysis

Instrument Code 20.9.2 to time the sampling stage and the learning stage separately within each round, as Section 20.8 prescribes. Add an artificial per-rollout delay to simulate an expensive environment, then increase N_ACTORS and report at what actor count the system stops gaining samples per second because the learner has become the bottleneck. Argue from your measured stage times which framework from Table 20.9.1 you would adopt to relieve the binding stage, and why adding machines past the imbalance point buys nothing.

Project Ideas

Three larger builds that turn this chapter into a system you can measure and defend.

1. Build an actor-learner system and measure its balance. Extend Code 20.9.2 into a real multi-process system using Ray actors (or Python multiprocessing): run the actors as separate processes feeding a shared replay buffer, with the learner in its own process and an explicit weight-broadcast step. Instrument the three numbers, samples per second, replay ratio, and mean policy lag, and produce a balance plot showing throughput versus actor count for a cheap environment and an artificially expensive one. Deliverable: a short report identifying the imbalance point for each and the framework you would adopt to push past it.

2. Reproduce the off-policy correction effect. On a standard environment (CartPole or LunarLander via Gymnasium), implement an asynchronous actor-learner with a tunable policy lag, then run it twice, once with a naive uncorrected update and once with a V-trace-style importance correction from Section 20.5. Measure how the learning curve degrades with lag in each case, and quantify how much lag the correction buys back. Deliverable: a learning-curve comparison across three lag levels with and without correction.

3. Profile a real RLHF or RL-for-reasoning stack. Stand up a small TRL or OpenRLHF run on a single multi-GPU node with a modest model, and profile the split between generation (sampling) and training (learning) wall-clock. Vary the rollout batch size and the weight-broadcast cadence, and report how each moves the generation-training balance and the policy lag. Deliverable: a profile confirming (or refuting) the rollout-bound claim of the research-frontier callout for your model size, and a one-paragraph recommendation on where to add GPUs.