"They asked me to aggregate my neighbors. I asked which ones. They said all of them. I have eight thousand neighbors. We compromised on fifteen."
A Vertex With Far Too Many Friends
Neighbor sampling is the single idea that turns graph neural networks from a beautiful theory into a system you can actually run on a billion-edge graph across many machines. A graph neural network builds each node's representation by aggregating its neighbors, then its neighbors' neighbors, and the receptive field grows multiplicatively with depth. On a graph with high-degree hubs, a two-layer model can pull nearly the entire graph into the computation for a single node, which is both a compute blow-up and, once the graph is partitioned across machines, a remote-fetch storm. Neighbor sampling caps the fan-out at each layer to a fixed number, so the computation graph and the cross-machine feature traffic per mini-batch are bounded no matter how popular any node is. This section derives that bound, surveys the families of sampling that achieve it, and shows why feature fetching, not arithmetic, is the real bottleneck the distributed system has to engineer around.
In Section 13.5 we built a distributed graph neural network and met its defining hazard: the neighborhood explosion. Because each layer of a GNN aggregates a node's neighbors, an $L$-layer model aggregates the node's full $L$-hop neighborhood, and on a real graph that neighborhood can be enormous. A single celebrity account in a social graph, a single popular product in a purchase graph, or a single common token in a citation graph has a degree in the millions, and any node within two hops of it inherits that node into its receptive field. Full-neighbor aggregation therefore makes the cost of a single forward pass depend on the most connected vertex anywhere nearby, which is exactly the property a distributed system cannot tolerate, because it makes per-mini-batch cost and per-mini-batch network traffic both unbounded and wildly skewed.
The fix is disarmingly simple and is the reason GraphSAGE (Hamilton, Ying, and Leskovec, 2017) became the template for scalable GNNs: do not aggregate all neighbors, aggregate a fixed-size random sample of them. If every node, at every layer, contributes representations from at most $b$ sampled neighbors instead of all of them, the receptive field stops depending on the graph's degree distribution and starts depending only on the numbers you choose. That one change converts an unbounded, skew-sensitive workload into a bounded, predictable one, and bounded predictable workloads are the only kind a cluster can schedule, cache for, and load-balance.
1. The Receptive Field, Bounded Intermediate
To see why the bound matters, count vertices. Consider a node $v$ and an $L$-layer GNN. Under full-neighbor aggregation, layer $L$ needs $v$'s direct neighbors, layer $L-1$ needs the neighbors of those neighbors, and so on, so the computation for $v$ touches its entire $L$-hop neighborhood. If the average degree is $\bar{d}$, the receptive field is on the order of $\bar{d}^{\,L}$ in the best case, but a single hub of degree $d_{\max}$ within the neighborhood pushes the true count toward $d_{\max}$, which on web-scale graphs is millions. The cost of one node's forward pass is governed by the most popular vertex near it, not by anything about $v$ itself.
Node-wise sampling replaces every "all neighbors" with "at most $b_\ell$ neighbors at layer $\ell$." Walking outward from $v$, the seed is one vertex; it samples at most $b_1$ first-hop neighbors; each of those samples at most $b_2$ second-hop neighbors; and so on. The number of vertices in the sampled computation graph is therefore bounded by a geometric sum that does not mention the degree of any vertex,
$$\bigl|\mathcal{N}_{\text{sampled}}(v)\bigr| \;\le\; 1 + b_1 + b_1 b_2 + \cdots + \prod_{\ell=1}^{L} b_\ell \;=\; \sum_{\ell=0}^{L} \prod_{j=1}^{\ell} b_j.$$For the common two-layer case with fan-outs $(b_1, b_2) = (15, 10)$, this is $1 + 15 + 150 = 166$ vertices per seed, full stop, whether the seed is an isolated node or sits next to a hub with eight thousand edges. Multiply by a mini-batch of $B$ seeds and the whole batch touches at most $B \cdot \sum_\ell \prod_j b_j$ vertices, which is the budget the data loader and the network plan around. This is the precise sense in which sampling makes the workload schedulable: the per-batch vertex count, feature-fetch count, and arithmetic are all constants you set, not quantities the graph dictates.
Full-neighbor aggregation makes the cost of training depend on the graph's degree distribution, the one thing about a real graph you cannot change and cannot bound. Neighbor sampling severs that dependence: it makes per-seed cost a function of the fan-out vector $(b_1, \dots, b_L)$, which you choose. Every downstream system property that matters at scale, the size of a mini-batch's computation graph, the number of remote feature fetches, the memory a worker needs, the time a step takes, becomes a constant you tune rather than a hostage to the most popular vertex in the dataset.
2. Three Families of Sampling Intermediate
The bound above came from node-wise sampling, but it is only the first of three families, and they differ in how they fight a second problem that node-wise sampling has: even with a fixed fan-out, the sampled subgraphs of different seeds in a batch may have little overlap, so the total vertex count still grows with the fan-out raised to the depth. The families trade off this redundancy against bias and implementation complexity, and Table 13.6.1 lays them out.
| Family | What it samples | Representative method | Main trade-off |
|---|---|---|---|
| Node-wise | A fixed number of neighbors per node, per layer | GraphSAGE | Simple and unbiased per node, but vertex count still grows as $\prod_\ell b_\ell$ |
| Layer-wise | A fixed number of nodes for an entire layer, shared across the batch | FastGCN | Bounds each layer to a constant, but needs importance weighting to stay unbiased |
| Subgraph | An induced subgraph sampled once, reused for all its nodes | Cluster-GCN, GraphSAINT, ShaDow | Maximal feature reuse and locality, but introduces sampling bias at subgraph edges |
Node-wise sampling is GraphSAGE: each node independently draws $b_\ell$ neighbors at layer $\ell$. It is the easiest to reason about and the one our demo implements, but because each seed expands its own tree, the per-batch vertex count is the product of the fan-outs, and aggressive depth ($L \ge 3$) still inflates it. Layer-wise sampling, introduced by FastGCN (Chen, Ma, and Xiao, 2018), fixes this by sampling a single fixed-size set of nodes for an entire layer, shared by every seed in the batch, so the layer's width is a constant rather than a product; the cost is that nodes must be drawn from an importance distribution and reweighted, or the estimator becomes biased. Subgraph sampling goes furthest: Cluster-GCN (Chiang et al., 2019) partitions the graph into dense clusters (using the partitioners of Section 13.2) and trains on one or a few clusters at a time, while GraphSAINT (Zeng et al., 2020) samples an induced subgraph by random walks or node/edge sampling and runs a full GCN on it. Because a subgraph is sampled once and every node inside it reuses the same fetched features, subgraph methods have the best feature-reuse and the best locality on a partitioned graph, which is precisely why they dominate when the bottleneck is communication rather than arithmetic.
A pleasant accident of these methods is that the GNN architecture does not change at all. The same aggregation layers, the same weights, the same loss; only the question "which neighbors do I see this step?" gets a smaller answer. A model trained with fan-out $(15, 10)$ can be evaluated with full neighbors and usually does better, because more neighbors at inference time is strictly more information. You train cheap and predict rich, which is a rare direction for a free lunch to point.
3. Bias, Variance, and Convergence Advanced
Sampling buys a bounded workload, and the bill is paid in the statistics of the gradient. Each sampled neighborhood produces an estimate of the true full-neighbor aggregation, and that estimate has bias and variance. Node-wise uniform sampling is unbiased for a mean aggregator: the expected sampled mean equals the true neighborhood mean, so in expectation the model trains toward the same objective as full-neighbor training. What sampling adds is variance, because a small random subset of a large neighborhood is a noisy summary of it, and the variance compounds across layers since a layer's input is itself a sampled estimate from the layer below. Higher fan-out lowers this variance toward zero (at the limit $b_\ell \to d$ you recover full aggregation), which is the dial you turn when training is unstable.
Layer-wise and subgraph methods can introduce genuine bias, not just variance, because the nodes they keep are not a uniform sample of a node's neighbors. FastGCN counters this with importance sampling and explicit reweighting; GraphSAINT counters it with normalization coefficients estimated from the sampler so that the expected subgraph loss matches the full-graph loss. When those corrections are applied, the methods converge to the same optimum as full-graph training, just along a noisier path; when they are omitted, the model optimizes a subtly different objective. The practical consequence for a distributed run is that the fan-out vector is not only a system knob but a statistical one: too small and the gradient is too noisy to converge well, too large and you have surrendered the communication savings that justified sampling in the first place. We treat the variance-versus-communication trade-off with the formal cost models of Chapter 10, where the same tension governs gradient compression and local-update SGD.
4. Why Sampling Is a Systems Problem, Not a Math Problem Intermediate
On a single machine, sampling is a few lines of code and the only thing it saves is arithmetic. On a partitioned graph spread across machines, sampling is the entire ball game, because the sampled vertices for a seed almost never all live on the seed's own partition. Each vertex pulled into the computation graph that resides on a remote partition is a network round trip to fetch its feature vector, and feature vectors are large (hundreds to thousands of floats per node), so the bytes moved per mini-batch are dominated by feature fetching, not by the graph topology. Profiles of distributed GNN training consistently show the sampler and the feature loader, not the GPU, as the bottleneck: the accelerator sits idle waiting for features to arrive over the network.
This reframing changes the engineering targets. First, the system caches hot vertex features: a hub appears in a large fraction of mini-batches, so its features are fetched again and again, and a feature cache keyed on the hottest vertices turns most of those remote fetches into local hits. Second, the system overlaps sampling and feature fetching with GPU compute, so that while the accelerator processes mini-batch $t$, background workers are already sampling and prefetching the features for mini-batch $t+1$. This is exactly the prefetch-and-overlap pattern from the distributed data-loading pipeline of Section 8.5, applied to graph features instead of image shards; the sampler is just a structured data loader whose access pattern is dictated by the graph. Third, partitioning quality (the edge-cut and vertex-cut methods of Section 13.2) directly determines the remote-fetch rate, because a partition that keeps each node's likely-sampled neighbors local converts would-be network fetches into local reads. The sampler, the cache, the prefetcher, and the partitioner are one co-designed subsystem, and the GNN on top is almost an afterthought by comparison.
Who: A graph machine learning engineer on the recommendations team at a large e-commerce platform.
Situation: A two-layer GraphSAGE model over a user-item graph with 400 million nodes was training on 16 GPUs, but the GPUs reported under 20 percent utilization.
Problem: Each mini-batch sampled neighborhoods that spanned every partition, and the popular-item hubs were re-fetched on nearly every step, so the network and the CPU samplers could not feed the GPUs fast enough.
Dilemma: Raise the fan-out to improve accuracy and make the starvation worse, or cut the fan-out to relieve the network and risk a noisier, weaker model.
Decision: Neither; they attacked the data path instead of the fan-out, keeping $(b_1, b_2) = (15, 10)$ fixed and engineering the fetch to be cheaper.
How: They added an LRU feature cache for the top one percent hottest vertices, repartitioned the graph so each machine held a locality-preserving cluster, and moved sampling and feature prefetch into background threads overlapped with the backward pass.
Result: Remote feature fetches per step dropped by roughly three quarters, GPU utilization rose past 70 percent, and epoch time fell by more than half with identical model accuracy, because the fan-out and therefore the statistics were unchanged.
Lesson: When sampling is in place, the lever that moves wall-clock is the feature data path (cache, partition, overlap), not the model. Sampling fixes the compute; the system has to fix the communication.
5. Node-Wise Sampling From Scratch Beginner
The code below builds a 20,000-node graph with a handful of hub vertices, then computes, for a few seed nodes, the size of the two-layer computation graph under full-neighbor expansion versus node-wise sampling with fan-out $(15, 10)$. Each vertex in the computation graph that is not the seed implies one feature fetch, which on a partitioned graph is a candidate remote fetch, so the vertex count is also the communication count. The point to watch is that the sampled count stays near the theoretical bound of $1 + 15 + 15 \cdot 10 = 166$ for every seed, while the full count tracks the degree of whatever hubs are nearby.
import random
random.seed(7)
# Build a graph: sparse backbone plus a few very high-degree hubs.
N = 20_000
adj = [set() for _ in range(N)]
def add_edge(u, v):
if u != v:
adj[u].add(v); adj[v].add(u)
for _ in range(N * 3): # average degree ~ 6
add_edge(random.randrange(N), random.randrange(N))
for h in (0, 1, 2): # three hubs, ~8000 edges each
for _ in range(8_000):
add_edge(h, random.randrange(N))
deg = [len(a) for a in adj]
E = sum(deg) // 2
median_deg = sorted(deg)[N // 2]
def full_expand(seed, layers=2): # all neighbors, every hop
frontier, touched = {seed}, {seed}
for _ in range(layers):
nxt = set()
for u in frontier:
nxt |= adj[u]
touched |= nxt; frontier = nxt
return touched
def sampled_expand(seed, fanouts=(15, 10)): # at most b neighbors per node per hop
frontier, touched = {seed}, {seed}
for k in fanouts:
nxt = set()
for u in frontier:
nbrs = adj[u]
nxt |= nbrs if len(nbrs) <= k else set(random.sample(tuple(nbrs), k))
touched |= nxt; frontier = nxt
return touched
print(f"vertices : {N}")
print(f"edges : {E}")
print(f"max degree (a hub) : {max(deg)}")
print(f"median degree : {median_deg}\n")
# A hub seed, an ordinary-looking seed that is nonetheless 2 hops from a hub,
# and a truly peripheral seed. Each non-seed vertex is one (candidate remote) fetch.
for seed, label in [(0, "hub"), (1716, "high non-hub"), (12345, "ordinary")]:
full, samp = full_expand(seed), sampled_expand(seed)
print(f"seed {seed} ({label}, degree {deg[seed]}):")
print(f" full 2-hop comp graph : {len(full):6d} vertices -> {len(full)-1:6d} remote feature fetches")
print(f" sampled (15,10) : {len(samp):6d} vertices -> {len(samp)-1:6d} remote feature fetches")
print(f" shrink factor : {len(full)/len(samp):6.1f}x\n")
b1, b2 = 15, 10
print(f"theoretical sampled bound : 1 + 15 + 15*10 = {1 + b1 + b1 * b2} vertices (degree-independent)")
print(f"mini-batch of 1024 seeds : <= {1024 * (1 + b1 + b1 * b2):,} fetched vertices (bounded), vs unbounded under full expansion")
sampled_expand caps each node's contribution to k neighbors per hop; the resulting vertex count equals the number of feature fetches a partitioned system would issue for that seed.vertices : 20000
edges : 79714
max degree (a hub) : 6602
median degree : 7
seed 0 (hub, degree 6578):
full 2-hop comp graph : 18197 vertices -> 18196 remote feature fetches
sampled (15,10) : 111 vertices -> 110 remote feature fetches
shrink factor : 163.9x
seed 1716 (high non-hub, degree 18):
full 2-hop comp graph : 13961 vertices -> 13960 remote feature fetches
sampled (15,10) : 106 vertices -> 105 remote feature fetches
shrink factor : 131.7x
seed 12345 (ordinary, degree 3):
full 2-hop comp graph : 29 vertices -> 28 remote feature fetches
sampled (15,10) : 27 vertices -> 26 remote feature fetches
shrink factor : 1.1x
theoretical sampled bound : 1 + 15 + 15*10 = 166 vertices (degree-independent)
mini-batch of 1024 seeds : <= 169,984 fetched vertices (bounded), vs unbounded under full expansion
The middle case is the one that should change how you think. Node 1716 has degree 18, unremarkable, yet its full two-hop computation graph is nearly 14,000 vertices, because its ordinary neighbors are themselves connected to hubs. The explosion is not a property of the seed; it is contagious through the graph, which is why no amount of "just avoid the hubs as seeds" rescues full-neighbor training. Sampling cuts the contagion at every hop, holding all three seeds, hub and ordinary alike, to roughly a hundred fetches.
Code 13.6.1 hand-rolls the sampler to expose the bound; in production you declare the fan-out and the framework builds the sampled computation graph, fetches the features, and overlaps it with training. PyTorch Geometric's NeighborLoader and DGL's DataLoader with a NeighborSampler both express exactly the $(15, 10)$ fan-out of the demo:
# PyTorch Geometric
from torch_geometric.loader import NeighborLoader
loader = NeighborLoader(data, num_neighbors=[15, 10], # fan-out per layer
batch_size=1024, shuffle=True)
# DGL (the same sampler powers DistDGL across machines)
import dgl
sampler = dgl.dataloading.NeighborSampler([15, 10])
loader = dgl.dataloading.DataLoader(g, train_nids, sampler,
batch_size=1024, shuffle=True)
NeighborSampler to fetch remote features across partitions, the system we examine in Section 13.7 and Section 13.8.Because feature fetching is the bottleneck, the active research line is about not fetching. GPU-resident sampling and caching systems keep the graph structure and the hottest features in GPU memory so that sampling and gathering never touch the host or the network, building on the line that runs through GNS, BGL, and the GPU feature caches of DGL and the WholeGraph library in NVIDIA's cuGraph stack (2023 to 2025). A second thread reduces fetches statistically: historical-embedding methods such as GAS and its descendants cache each node's stale layer embeddings and reuse them instead of re-fetching and recomputing neighbors, trading a controlled staleness for a large drop in communication, an idea that rhymes with the bounded-staleness parameter servers of Chapter 11. A third asks whether sampling is needed at all at billion-edge scale: subgraph-sampling systems in the GraphSAINT and Cluster-GCN lineage, combined with locality-aware partitioning, are competitive with node-wise sampling while issuing far fewer cross-machine fetches, and recent work pushes full-graph training back onto large GPU clusters when the feature cache hit rate is high enough to make it affordable. The unifying message across all three is the one this section opened with: at scale, the sampler is a communication-optimization problem wearing a statistics costume.
With sampling in hand, the per-mini-batch cost is bounded and the data path is the thing to engineer. That sets up the choice that organizes the rest of distributed GNN training: do you train on small sampled mini-batches, as here, or on the full graph held collectively across machines, accepting more communication for an exact gradient? Section 13.7 puts mini-batch and full-graph distributed training side by side and shows when each wins.
A team runs a three-layer GraphSAGE model with fan-out $(b_1, b_2, b_3) = (20, 15, 10)$ on a graph whose maximum degree is two million. (a) Using the bound $\sum_{\ell=0}^{L} \prod_{j=1}^{\ell} b_j$, compute the worst-case number of vertices in one seed's sampled computation graph, and state how it compares to the two-million-degree hub. (b) For a mini-batch of 2,048 seeds, give the upper bound on feature fetches. (c) The team wants to halve the per-batch fetch budget by changing exactly one fan-out value; which one should they cut and why, and what does that cut cost statistically per Section 3?
Extend Code 13.6.1 so that, instead of counting vertices, it estimates the mean of a synthetic feature over a hub node's true neighborhood, then over many independent node-wise samples of fan-out $b$. Plot or print the sample-mean variance as a function of $b \in \{5, 10, 20, 40, 80\}$, and confirm it falls roughly as $1/b$. Then repeat with a two-layer sampled estimate and observe how variance compounds across layers. Explain, from your numbers, why a practitioner facing unstable training would raise the deeper-layer fan-out first.
Assign each of the 20,000 vertices in Code 13.6.1 to one of 8 partitions, first at random and then by a simple locality heuristic (place each vertex on the partition where most of its neighbors already live, iterating a few passes). For a batch of 256 random seeds with fan-out $(15, 10)$, count how many sampled vertices are remote (on a different partition than the seed) under each partitioning. Quantify the remote-fetch reduction the locality heuristic buys, and connect it to why Section 13.2 insists that partition quality and sampler cost are the same problem.