"The learner kept sending me the same weights as everyone else, then acted surprised when the experience I sent back was nothing like my neighbor's. That, it turns out, is the entire point."
An Actor Rolling Out a Slightly Stale Policy
Not every collective is symmetric. The two operations in this section are deliberately lopsided: broadcast sends one rank's data, unchanged, to every other rank, and gather pulls each rank's distinct data into one rank. All-reduce, the star of the previous sections, treats every worker identically because the question it answers ("what is the average gradient?") is symmetric. But a great deal of distributed AI is not symmetric at all. One machine often holds the authoritative model and must push it out to many workers; those workers each produce something different (gradients, rollouts, partial sums) that one machine must pull back in. Broadcast and gather are the primitives for exactly that one-to-many and many-to-one movement, and they are the communication skeleton of two large families of AI systems: parameter servers and distributed reinforcement learning. This section introduces them through those two uses and shows precisely where the symmetry of all-reduce ends and the asymmetry of broadcast and gather begins.
The collectives in Section 4.3 through Section 4.6 share a quiet assumption: every participating rank plays the same role. In all-reduce every worker contributes a gradient and every worker receives the same averaged result. In all-gather every worker contributes a shard and every worker ends up with the full set. That symmetry is natural for data-parallel SGD, where the workers are interchangeable replicas of one model. Many AI systems break that symmetry on purpose. They designate one rank as special, a root, that either owns the data everyone needs or needs the data everyone owns. The two primitives that serve a root are broadcast (the root sends, everyone receives) and gather (everyone sends, the root receives), together with gather's mirror image, scatter (the root sends a different piece to each worker). This section is about that asymmetric family and the AI systems built on it.
1. Broadcast: One Model, Sent to Everyone Beginner
A broadcast takes a buffer that lives on one rank, the root, and delivers a byte-identical copy to every other rank in the group. Formally, if rank $r$ holds vector $v$ and is the root, then after $\text{broadcast}(v, \text{root}=r)$ every rank $j$ holds the same $v$. There is no arithmetic; nothing is combined or reduced. The only thing that happens is replication of one rank's data across the whole group. That sounds almost too simple to deserve a name, yet it is one of the most frequently issued collectives in real training systems, because distributed training is full of moments where one machine knows something authoritative that all the others must adopt exactly.
The first such moment happens before a single gradient is computed. When a data-parallel job starts, every worker must begin from identical model weights, otherwise the exact-gradient identity of Section 1.1 falls apart, because the workers would be computing gradients of different models. The standard recipe is to initialize the model on rank 0 and broadcast its parameters to all other ranks, guaranteeing a common starting point. The same pattern recurs whenever a new model version must be published to a fleet: one rank loads the new checkpoint and broadcasts it so that, from the next step onward, every worker runs the same model.
Broadcast and all-reduce are easy to confuse because both end with every rank holding the same buffer. The difference is where that buffer comes from. After a broadcast, every rank holds one chosen rank's data, copied verbatim; the other ranks' prior contents are discarded. After an all-reduce, every rank holds a combination of all ranks' data (typically the sum or average). Broadcast answers "everyone adopt this single source of truth"; all-reduce answers "everyone agree on the aggregate of what we all had". Picking the wrong one silently corrupts training: a broadcast where you needed a reduction throws away every worker's contribution but one.
2. Gather and Scatter: Collecting and Distributing Distinct Data Beginner
Gather is the inverse of broadcast in direction but not in content. In a gather, every rank holds a distinct buffer and the root ends up holding the concatenation of all of them, indexed by source rank. If rank $j$ holds $v_j$, then after $\text{gather}(\text{root}=r)$ the root holds the ordered collection $(v_0, v_1, \dots, v_{K-1})$ while the non-root ranks keep only their own piece. Crucially the pieces are not combined: the root receives every worker's data individually, which is the whole point when the root needs to inspect, store, or process each contribution separately rather than only their sum.
Scatter is gather run backward. The root holds $K$ distinct pieces and sends piece $j$ to rank $j$, so that each worker receives a different slice of the root's buffer. Where broadcast sends the same data to everyone, scatter sends different data to each. The canonical use is handing out work: a coordinator that holds a batch of tasks, or a parameter server that holds a partitioned model, scatters the relevant slice to each worker so that no worker receives more than it needs. Table 4.7.1 lays the four root-centric collectives side by side so the symmetry, or rather the asymmetry, is explicit.
| Collective | Direction | Data per rank | Combines values? | Typical AI use |
|---|---|---|---|---|
| Broadcast | root → all | same | no | publish initial or updated weights |
| Scatter | root → all | distinct | no | hand out data shards or task slices |
| Gather | all → root | distinct | no | collect gradients or experience batches |
| All-reduce | all ↔ all | same result | yes (sum/mean) | average gradients in data-parallel SGD |
Reading Table 4.7.1 top to bottom is the cleanest way to see the design choice these primitives encode. If you only ever need the aggregate of what the workers computed, all-reduce is both sufficient and efficient, because it never materializes the individual contributions anywhere. The moment you need the individual contributions preserved, because the root must store each gradient, replay each trajectory, or route each piece somewhere specific, you reach for gather instead, and you accept that the root becomes a traffic and memory bottleneck that all-reduce was carefully designed to avoid.
3. First AI Use: Parameter-Server Push and Pull Intermediate
The parameter-server architecture, which Chapter 11 develops in full, is the oldest large-scale distributed-training design and it is built almost entirely from broadcast-like and gather-like movement. One set of machines, the servers, holds the authoritative model parameters. Another set, the workers, holds shards of the data. Each training step is two asymmetric movements. First a pull: every worker fetches the current parameters from the servers, which is broadcast-shaped, one authoritative copy delivered to many workers. Then a push: every worker sends its locally computed gradient back to the servers, which is gather-shaped, many distinct gradients collected at the root, where the server applies them to update the parameters.
This is a genuinely different communication pattern from the all-reduce of data-parallel SGD, and the difference is instructive. In all-reduce there is no privileged rank; the averaged gradient emerges symmetrically and every worker updates its own copy of the model. In a parameter server there is a privileged rank, the server, that owns the model and is the sole writer. That asymmetry buys flexibility, the server can apply updates asynchronously, hold a model far larger than any one worker by sharding it across many servers, and let workers join and leave, which is why parameter servers dominate the enormous sparse embedding tables of recommendation systems. It also concentrates traffic and state on the servers, the bottleneck that ring all-reduce in Section 4.4 was invented to escape for dense gradients. The choice between the two is a recurring theme, traced explicitly in the parameter-server-versus-all-reduce arc that runs from here into Chapter 15.
Broadcast-out and gather-in are not a one-chapter curiosity; they are a structural pattern that returns scaled up throughout the book. Here they are pull and push against a parameter server. In Chapter 11 they become sharded across many servers so the model can exceed any single machine. In sharded data-parallel training (Section 4.5 and Chapter 16) the same one-to-many and many-to-one shapes reappear as all-gather and reduce-scatter, which are broadcast and gather generalized so that no single rank is the bottleneck. When you meet a new distributed-training method later, ask whether its weight movement is broadcast-shaped and its update movement gather-shaped; the answer is yes more often than not.
4. Second AI Use: Broadcasting Policies, Gathering Experience Intermediate
Distributed reinforcement learning makes the asymmetry even sharper, and it is the cleanest setting in which to see why all-reduce simply cannot do the job. The dominant architecture, the actor-learner pattern that Chapter 20 builds into a full system, separates two roles. A small number of learners hold the policy and improve it by gradient descent. A large number of actors, often hundreds, each run a copy of the policy against their own environment instance and produce experience: sequences of states, actions, and rewards. The two roles are coupled by exactly the primitives of this section. The learner broadcasts the latest policy weights to all actors, identical bytes to every actor, and then gathers the experience each actor collected, distinct batches from every actor.
The reason this cannot be an all-reduce is that the learner does not want the average of the actors' experience; it wants every actor's experience kept separate, because each trajectory is a distinct training example that the learner will replay, weight by importance, and learn from individually. Averaging rollouts would destroy the very signal reinforcement learning depends on. So the experience movement is irreducibly a gather, not a reduction. The demo below simulates one round of this loop with $K$ actors, makes the broadcast and gather concrete, and reports the distinct experience the learner collects.
import numpy as np
rng = np.random.default_rng(7)
K, d, steps_per_actor = 6, 4, 1000 # actors, policy dim, rollout length
# Learner holds the current policy weights (one vector).
policy = rng.standard_normal(d).round(3)
# --- BROADCAST: every actor receives a byte-identical copy of `policy`. ---
actor_copies = [policy.copy() for _ in range(K)]
broadcast_ok = all(np.array_equal(c, policy) for c in actor_copies)
print(f"broadcast: 1 learner -> {K} actors, weight dim = {d}")
print(f" every actor holds identical weights : {broadcast_ok}")
print(f" bytes sent by learner : {K} x {policy.nbytes} = {K*policy.nbytes} B")
def rollout(weights, env_seed):
# Each actor acts with its policy copy in its OWN environment instance,
# so the experience it produces is distinct from every other actor's.
r = np.random.default_rng(env_seed)
obs = r.standard_normal((steps_per_actor, d))
reward = obs @ weights + 0.3 * r.standard_normal(steps_per_actor)
return {"return": float(reward.sum()), "n": steps_per_actor}
experience = [rollout(actor_copies[k], env_seed=100 + k) for k in range(K)]
# --- GATHER: the learner collects the K distinct experience batches. ---
print(f"gather: {K} actors -> 1 learner (distinct experience each)")
for k, e in enumerate(experience):
print(f" actor {k}: return = {e['return']:+9.2f} over {e['n']} steps")
total_steps = sum(e["n"] for e in experience)
mean_return = sum(e["return"] for e in experience) / K
print(f" gathered batches : {len(experience)}")
print(f" total experience steps : {total_steps}")
print(f" mean episodic return across actors : {mean_return:+.2f}")
broadcast: 1 learner -> 6 actors, weight dim = 4
every actor holds identical weights : True
bytes sent by learner : 6 x 32 = 192 B
gather: 6 actors -> 1 learner (distinct experience each)
actor 0: return = +6.70 over 1000 steps
actor 1: return = +70.80 over 1000 steps
actor 2: return = +51.67 over 1000 steps
actor 3: return = -24.22 over 1000 steps
actor 4: return = +31.58 over 1000 steps
actor 5: return = +0.17 over 1000 steps
gathered batches : 6
total experience steps : 6000
mean episodic return across actors : +22.78
An all-reduce is a tidy operation; it hands everyone the average and forgets the rest. The gather in Code 4.7.1 is the opposite temperament: the learner refuses to let go of a single trajectory, insisting on all six batches even though it could compute the mean return in one line. In reinforcement learning that hoarding is correct behavior. The variance between actor 3's $-24.22$ and actor 1's $+70.80$ is not noise to be averaged out; it is the difference between a state worth avoiding and a state worth seeking, and the learner needs both, intact, to improve the policy.
5. Cost, Bottlenecks, and When Symmetry Wins Advanced
The asymmetry of broadcast and gather has a price, and the cost models of Chapter 3 name it precisely. A naive broadcast in which the root sends $K$ separate copies costs the root $K$ message transmissions; under the $\alpha + \beta n$ model, sending an $n$-byte buffer to $K$ ranks one at a time costs the root roughly $K(\alpha + \beta n)$, growing linearly in the number of workers. A naive gather is worse in a subtle way: the root must receive $K$ distinct buffers, so its inbound bandwidth, not the workers', sets the floor, and that floor is $K \beta n$ no matter how the messages are scheduled, because every byte every worker produced must physically arrive at the one root. This is the concentration problem: the root is a single point that all traffic squeezes through.
Good implementations soften the broadcast cost dramatically. A tree broadcast forwards the buffer through $\log_2 K$ levels, so the data reaches all $K$ ranks in $O(\log K)$ steps rather than $O(K)$, and a pipelined or ring broadcast can overlap transmission so that the bandwidth term, not the latency term, dominates. NCCL, which Section 4.8 covers, ships exactly these optimized broadcasts. Gather is harder to accelerate because its fundamental limit is the root's inbound bandwidth, which no algorithm can route around; the only real escape is to stop using a single root. That escape is precisely what all-gather and reduce-scatter (Section 4.5) provide, spreading the role of the root across every rank so the per-rank traffic stays flat as $K$ grows. The practical rule that falls out of these costs is the subject of the example below.
Who: A research engineer scaling a distributed reinforcement-learning agent for a robotics simulator.
Situation: A single learner broadcast policy weights to a growing pool of actors and gathered every actor's full trajectory each iteration.
Problem: Past about 200 actors, throughput stopped rising; profiling showed the learner's network card saturated on inbound experience while its GPU sat idle.
Dilemma: Add more actors to collect experience faster, which only worsened the gather bottleneck, or cap the actor pool, leaving compute on the table.
Decision: They kept the broadcast (cheap, tree-based, scaled fine) but restructured the gather: actors first compressed and pre-aggregated trajectories locally, and a small tier of intermediate aggregators gathered from subsets of actors before forwarding to the learner.
How: The hierarchy turned one $K$-way gather into a two-level tree of smaller gathers, so the learner's inbound traffic grew with the number of aggregators, not the number of actors.
Result: The actor pool scaled past 1,000 with the learner's link no longer saturated, and end-to-end sample throughput rose roughly fourfold at the same learner cost.
Lesson: Broadcast scales gracefully with good algorithms; a single-root gather does not, because its floor is the root's inbound bandwidth. When gather becomes the wall, break the single root into a hierarchy, the same instinct that motivates the rootless collectives of Section 4.5.
The weight-broadcast and experience-gather pattern has come under fresh pressure from the scale of recent reinforcement-learning-from-human-feedback and reasoning-model pipelines, where the policy being broadcast is now a full large language model of tens or hundreds of billions of parameters. Systems such as OpenRLHF and the post-training stack around verl and NeMo-Aligner (2024 to 2025) treat the periodic broadcast of fresh policy weights from the training engine to the generation actors as a first-class cost, and they overlap it with rollout generation or stream it shard by shard so that actors are never idle waiting for a full broadcast. On the gather side, the same systems lean on hierarchical and asynchronous experience collection so a single learner is not throttled by inbound rollouts, echoing the IMPALA and SEED RL lineage but at language-model scale. A parallel thread on geo-distributed and decentralized training (the DiLoCo-style line discussed in Section 1.1) revisits broadcast and gather over slow wide-area links, where their latency terms, not their bandwidth terms, dominate. The throughline is that as models grow, the once-cheap broadcast of weights becomes a scheduling problem in its own right, and Chapter 20 returns to these systems with the machinery to measure them.
6. The Library View Beginner
You almost never implement broadcast, gather, or scatter by hand. Every collective-communication library exposes them directly, with the root chosen by an argument, and handles the tree or pipeline algorithm, the buffer management, and the network transport internally. The shortcut below shows the three primitives of this section in torch.distributed, which dispatches to NCCL on GPUs and Gloo on CPUs (both covered in Section 4.8).
The hand-rolled replication, collection, and slicing of this section collapse to one call each. The library picks an $O(\log K)$ broadcast tree, manages the receive buffers the root needs for gather, and routes scatter's per-rank slices, work that would be dozens of lines of careful socket code:
# Run with: torchrun --nproc_per_node=4 thisfile.py
import torch, torch.distributed as dist
dist.init_process_group("nccl") # join the group of K workers
rank, world = dist.get_rank(), dist.get_world_size()
# BROADCAST: rank 0 owns the weights; all ranks adopt its copy in place.
weights = torch.empty(1024, device="cuda")
if rank == 0:
weights = load_initial_weights() # only the root has real values
dist.broadcast(weights, src=0) # now every rank holds rank 0's weights
# GATHER: every rank sends its own experience; only the root keeps them all.
local_exp = collect_local_experience() # this rank's distinct batch, a tensor
bucket = [torch.empty_like(local_exp) for _ in range(world)] if rank == 0 else None
dist.gather(local_exp, gather_list=bucket, dst=0) # root's `bucket` = all K batches
# SCATTER: the root hands a different task slice to each worker.
slices = list(task_slices) if rank == 0 else None # K distinct tensors on the root
my_slice = torch.empty(256, device="cuda")
dist.scatter(my_slice, scatter_list=slices, src=0) # each rank receives its own slice
src and dst arguments naming the root, the feature that makes these operations asymmetric; all-reduce in Section 4.3 has no such argument because it has no privileged rank.With the asymmetric collectives in hand, the chapter has now named every primitive the rest of the book relies on: the symmetric all-reduce, all-gather, reduce-scatter, and all-to-all of the preceding sections, and the root-centric broadcast, gather, and scatter of this one. What remains is to see how a production library actually implements them across real hardware, choosing algorithms and topologies on your behalf. That is the subject of Section 4.8, which opens the lid on NCCL, MPI, and Gloo.
For each task, name the single collective from Table 4.7.1 that fits best and say why the other three are wrong: (a) at job start, every data-parallel worker must begin from the same randomly initialized weights held on rank 0; (b) a parameter server must apply the sum of all workers' gradients to its parameters, and no worker needs the individual gradients afterward; (c) a learner must store every actor's full trajectory for later replay; (d) a coordinator holds a list of 64 distinct hyperparameter configurations and must give one to each of 64 workers. For case (b), explain why all-reduce, not gather, is the better choice even though both could work.
Extend Code 4.7.1 so each actor returns not a summary but a full trajectory array of shape (steps_per_actor, d), and have the learner concatenate all $K$ of them (a true gather). Time the concatenation and the total bytes the learner receives as you sweep $K$ from 2 to 64. Plot received bytes against $K$ and confirm the linear $K\beta n$ growth predicted in Section 5. Then implement a two-level hierarchical gather (group actors into $\sqrt{K}$ buckets, gather within each bucket, then gather the bucket results) and report how the learner's inbound byte count changes.
A learner must send a policy of $P = 5 \times 10^8$ parameters (4 bytes each) to $K = 256$ actors every iteration. Using the $\alpha + \beta n$ model with $\alpha = 5\,\mu\text{s}$ and $\beta = 0.1\,\text{ns/byte}$ (10 GB/s), estimate the time of a naive linear broadcast (root sends $K$ copies) and of a tree broadcast ($\log_2 K$ rounds). Compare both to the per-iteration gather of experience if each actor returns a $10^6$-byte batch. From these three numbers, argue which movement, weight broadcast or experience gather, is the scaling bottleneck for this RL system, and how your conclusion shifts if the policy grows to $5 \times 10^{10}$ parameters as in the frontier systems of Section 5. We make this style of estimate rigorous in Chapter 3.