Part III: Distributed Machine Learning
Chapter 13: Distributed Graph Machine Learning

Why Graphs Are Hard to Distribute

"You sharded my examples and called them independent. Then a vertex on machine three asked about its neighbor on machine eight, and the whole party had to phone each other."

An Edge That Was Cut By the Partitioner
Big Picture

Every scalable method in this book so far has rested on one quiet assumption: the examples are independent, so a shard can be processed without talking to any other shard. Graphs break that assumption at the root, because a vertex's computation is defined by its neighbors, and the partitioner cannot guarantee that a vertex and its neighbors live on the same machine. The moment you cut a graph across machines, you create cross-machine edges, and every cross-machine edge is communication that some step must pay. Worse, real graphs are not uniform: a handful of super-hub vertices hold a large fraction of all edges, so any cut concentrates both load and communication on whoever owns them. This section explains exactly why the clean data-parallel decomposition fails for graphs, measures the cost of a bad cut in real numbers, and previews the four answers the rest of the chapter builds.

Through Part II and the first three chapters of Part III, distribution stayed comfortable because the data cooperated. A web crawl, a table of card transactions, a corpus of training examples: each row could be shipped to whichever machine had a free slot, processed there, and combined at the end, because no row needed to consult any other row. That independence is what made the gradient an average we could split exactly in Section 1.1, and what made the random forest of Section 12.4 embarrassingly parallel: grow each tree on a different machine, never synchronize, average the votes. Graphs revoke that permission. A social network, a citation graph, a knowledge graph, a molecule: in all of them the meaning of a vertex lives in its connections, and the connections refuse to respect any partition you draw.

This chapter is about machine learning on graphs at a scale where the graph does not fit on one machine: billion-edge social networks, web-scale link graphs, planet-scale knowledge graphs. Before we can train anything on such a graph, we have to confront why graphs are the hardest data structure in this book to distribute well. The difficulty has three faces, structural dependency, skewed degree, and irregular access, and the rest of this section takes them in turn before sketching the answers.

Machine A Machine B Machine C hub vertex orange dashed = cross-partition edge = one message per step grey = local edge, free; the hub on A reaches into B and C every iteration
Figure 13.1.1: A graph cut across three machines. Grey local edges sit entirely inside one machine and cost nothing to traverse; the orange dashed cross-partition edges span machines, and each one is a message the system must send on every iteration that propagates information along edges. The orange hub vertex on Machine A, with neighbors on both B and C, shows why high-degree vertices turn into communication hotspots no matter where the partitioner places them.

1. The Data-Parallel Assumption, and Why Graphs Revoke It Beginner

Recall the engine of data parallelism. A loss is an average over independent examples, $L(w) = \frac{1}{N}\sum_{i=1}^{N} \ell(w; x_i, y_i)$, and because the terms are independent we can scatter them across $K$ workers, let each worker sum over its own shard, and combine the partial sums into the exact full gradient. The independence is doing all the work: worker $k$ never needs a single byte from worker $j$ to compute its contribution, so the only communication is the one combine step at the end. That is why Section 12.4 could grow a random forest with zero inter-worker traffic during training, and why the whole earlier arc of the book felt almost too easy.

A graph computation has the opposite shape. The quantity we want at a vertex $v$ is a function of $v$'s neighbors, and theirs of their neighbors. Take the simplest non-trivial example, one round of neighbor aggregation, the operation at the heart of both classical graph analytics and graph neural networks:

$$h_v^{(\ell+1)} = \phi\!\left(h_v^{(\ell)},\ \bigoplus_{u \in \mathcal{N}(v)} \psi\big(h_u^{(\ell)}\big)\right),$$

where $\mathcal{N}(v)$ is the set of $v$'s neighbors, $\psi$ transforms a neighbor's state, $\bigoplus$ aggregates over neighbors, and $\phi$ updates the vertex. The term $\sum_{u \in \mathcal{N}(v)} \psi(h_u^{(\ell)})$ is a sum, exactly like the gradient sum, so you might hope it decomposes the same way. It does not, because the index set $\mathcal{N}(v)$ is not "all examples" but "the specific neighbors of $v$", and the partitioner has no way to keep all of them on $v$'s machine. If even one neighbor $u$ lives on another machine, computing $h_v^{(\ell+1)}$ requires fetching $h_u^{(\ell)}$ across the network. Every such fetch is communication, and it recurs at every layer or iteration, not once at the end.

Key Insight: Cross-Partition Edges Are the Unit of Communication

For independent examples, distribution costs one combine step per training step regardless of how you shard. For a graph, the cost is set by the partition itself: the number of edges whose endpoints land on different machines, the edge cut, is the number of messages each propagation step must send. Partitioning a graph is therefore not a load-balancing afterthought, it is the central performance decision, because a bad cut does not just unbalance the work, it manufactures communication that a good cut would have avoided. Minimizing the edge cut while keeping the machines balanced is the problem the next section is built around.

2. The Cost of a Bad Cut, Measured Intermediate

To make the edge cut concrete, the code below builds a small graph by preferential attachment (new vertices attach to already-popular vertices, the classic recipe for a power-law degree distribution), then partitions it across four machines two ways. The first cut hashes each vertex by its id, the kind of partition you get for free from a key-value store and the kind that worked perfectly for independent rows. The second is a greedy locality cut that places each vertex on the machine already holding the most of its neighbors. For each cut we count the cross-partition edges, the communication an iteration must pay, and the local edge work on the busiest machine.

import random

# Build a small power-law (preferential-attachment) graph: a few hubs, many leaves.
random.seed(7)
N = 60
edges = []
targets = [0, 1]                       # seed: the first vertices accumulate degree
for v in range(2, N):
    m = 2                              # each new vertex attaches with 2 edges
    chosen = set()
    while len(chosen) < m:
        chosen.add(random.choice(targets))   # attach preferentially to high-degree nodes
    for u in chosen:
        edges.append((v, u))
        targets.append(u)             # rich-get-richer: popular vertices reappear
        targets.append(v)

deg = {v: 0 for v in range(N)}
for a, b in edges:
    deg[a] += 1
    deg[b] += 1

def cross_edges(part):
    # An edge is cross-partition iff its endpoints live on different machines.
    return sum(1 for a, b in edges if part[a] != part[b])

P = 4   # four machines

# Cut A (bad): hash by vertex id. Hubs and their many leaves scatter across all
# machines, so almost every hub edge becomes cross-machine.
bad = {v: v % P for v in range(N)}

# Cut B (better): greedy locality. Place each vertex (heaviest first) on the
# machine already holding the most of its neighbors; a soft cap keeps balance.
adj = {v: [] for v in range(N)}
for a, b in edges:
    adj[a].append(b)
    adj[b].append(a)
owner, load, cap = {}, {p: 0 for p in range(P)}, N / P + 6
for v in sorted(range(N), key=lambda x: -deg[x]):
    score = {p: 0 for p in range(P)}
    for u in adj[v]:
        if u in owner:
            score[owner[u]] += 1       # reward machines already holding neighbors
    cand = max(range(P), key=lambda p: (score[p], -load[p]))
    if load[cand] >= cap:
        cand = min(load, key=load.get)
    owner[v] = cand
    load[cand] += 1
good = owner

def busiest_local(part):
    return max(sum(1 for a, b in edges if part[a] == p and part[b] == p)
               for p in range(P))

ca, cb = cross_edges(bad), cross_edges(good)
print(f"top hub degree                     : {max(deg.values())}")
print(f"cross-partition edges, hash cut    : {ca}  ({100*ca//len(edges)}% of edges)")
print(f"cross-partition edges, locality cut: {cb}  ({100*cb//len(edges)}% of edges)")
print(f"communication reduced by           : {100*(ca-cb)//ca}%")
print(f"busiest machine local edges, hash  : {busiest_local(bad)}")
print(f"busiest machine local edges, local : {busiest_local(good)}")
Code 13.1.1: Partitioning a power-law graph two ways and counting the cross-partition edges, which equal the messages one propagation step must send. The hash cut treats vertices like independent rows; the locality cut tries to keep neighbors together.
top hub degree                     : 18
cross-partition edges, hash cut    : 85  (73% of edges)
cross-partition edges, locality cut: 64  (55% of edges)
communication reduced by           : 24%
busiest machine local edges, hash  : 11
busiest machine local edges, local : 37
Output 13.1.1: The hash cut strands 73% of all edges across machines, so nearly three of every four edges become a network message every step; its busiest machine does only 11 edges of local work. The locality cut drops the cut to 55% and triples the local work concentrated on one machine. Even the good cut still pays heavily, because the degree-18 hub reaches almost everywhere no matter where it sits.

Two lessons sit in Output 13.1.1. First, the partition that was free and perfect for independent rows, hashing by id, is a disaster for a graph: it turned 73% of the edges into communication, because it scattered every hub's neighbors across all four machines. The locality-aware cut recovered roughly a quarter of that traffic with the same number of machines and the same balance budget, which is the entire reason Section 13.2 exists. Second, and more soberingly, even the good cut leaves more than half the edges crossing machines. That residue is not the partitioner's failure; it is the hub. A single vertex of degree 18 in a 60-vertex graph cannot be kept local to all its neighbors when only 15 vertices fit per machine, and that structural fact is the subject of the next axis of difficulty.

3. Power-Law Degree: A Few Vertices Hold Most of the Edges Intermediate

Real graphs are not uniform. In a social network, a typical account has a few dozen connections while a celebrity has tens of millions; in a web graph, most pages have a handful of inbound links while a few portals have billions; in a citation graph, a landmark paper accumulates orders of magnitude more references than the median. The degree distribution follows an approximate power law, $P(\deg = k) \propto k^{-\gamma}$ with $\gamma$ usually between 2 and 3, which means the mean degree is modest but the variance is enormous and a small set of super-hub vertices holds a large fraction of all edges. This is the same skew that wrecked naive joins and group-bys in Section 7.7, where a few hot keys overwhelmed a few reducers while the rest sat idle; here the hot keys are the hub vertices, and they overwhelm whichever machine owns them.

Skew attacks a graph partition from two sides at once. On the load side, whichever machine owns a hub inherits its entire neighborhood of work, so a vertex-balanced partition can be wildly edge-imbalanced: equal vertex counts, lopsided edge counts, and edges are where the computation lives. On the communication side, a hub by definition touches many vertices, and those vertices cannot all share its machine, so a hub generates cross-partition edges no matter how cleverly it is placed, exactly the residual cut we just saw survive the good partition in Output 13.1.1. The standard remedy, which Section 13.2 develops, is to stop insisting that a vertex live in one place: vertex-cut and edge-partitioning schemes split a hub across several machines, replicating the vertex and distributing its edges, so no single machine carries the whole hub.

Fun Note: The Tyranny of the Celebrity Node

Partitioning engineers have a grim affection for the celebrity node, the one vertex with so many edges that it ruins every cut. You can shuffle the other billion vertices into beautifully balanced, low-cut groups, and one pop star with eighty million followers will still light up your network monitor like a holiday. The honest fix is not a smarter cut, it is to admit the vertex is too big to live anywhere and split it across machines on purpose. Sometimes the right answer to "where does this vertex go?" is "everywhere, a little."

4. Irregular Access Defeats Locality Advanced

There is a third difficulty that bites even before the network does. The machines we distribute across are themselves built on an assumption: that memory is accessed in predictable, contiguous runs, so caches and prefetchers can stay a step ahead. A dense matrix multiply, the workhorse of Chapter 15, streams through memory in long sequential sweeps and saturates the hardware. A graph traversal does the opposite. Following $v$'s edges sends you to wherever $v$'s neighbors happen to be stored, which is data-dependent and effectively random, so each neighbor lookup is a cache miss and the prefetcher gives up. This is the locality problem of Section 2.8 in its sharpest form: the access pattern is determined by the data itself, so no static layout makes it sequential.

This matters for distribution because locality is the lever that turns a partition from a liability into a benefit. The whole point of placing neighbors on the same machine is that an in-memory neighbor lookup is thousands of times cheaper than a network round trip; cross-partition edges hurt precisely because they convert a (already slow) random memory access into a (far slower) random network access. So the irregular-access problem and the edge-cut problem are the same problem at two scales: within a machine, irregular access wastes the cache; across machines, it wastes the network. A graph system that does not respect locality pays the random-access tax twice, and at billion-edge scale that tax decides whether a computation finishes in minutes or days.

5. The Chapter's Four Answers Beginner

Having seen why graphs resist clean decomposition, we can name the four moves that the rest of the chapter makes to tame them, each a direct response to a difficulty above. Table 13.1.1 lays them out so the chapter's structure reads as a sequence of answers rather than a list of topics.

Table 13.1.1: The four difficulties of distributing graphs and the section of this chapter that answers each.
DifficultyThe answerWhere
Cutting a graph creates cross-machine edgesPartition to minimize the edge cut (edge-cut and vertex-cut schemes)Section 13.2
Neighbor dependency is awkward to programThe vertex-centric model: "think like a vertex", with the system handling messagingSection 13.3
Whole-graph algorithms must run distributedDistributed analytics: PageRank, connected components, shortest paths at scaleSection 13.4
Learning on graphs needs the same machineryDistributed graph neural network training, with neighbor samplingSections 13.5 onward

The thread tying these together is the one this book has followed since the start: a computation that is trivial on one machine becomes an exercise in moving the right bytes to the right place when the data outgrows a single box. For graphs the bytes are neighbor states, the right place is set by the partition, and the cost is the edge cut. The vertex-centric model of Section 13.3 is the graph world's answer to the same question data parallelism answered for independent examples, namely how to express the computation so the system can distribute it for you, and the parameter-server ideas of Chapter 11 reappear when the per-vertex states grow into embedding tables too large for one machine.

Thesis Thread: The Same Question, the Hardest Data

The spine of this book is a single question asked of ever harder data: how do we split this work across machines, move the necessary information between them, and recombine it correctly while keeping movement cheap? For independent examples the answer was almost free, one combine step at the end (Section 1.1). Graphs are where that question gets its sharpest form, because the data itself dictates the communication pattern through its edges, and you cannot shuffle the dependency away. Every technique in this chapter, from edge-cut partitioning to neighbor sampling, is an attempt to make the unavoidable communication of a dependency-laden dataset as small and as regular as the structure allows.

Library Shortcut: A Partitioner You Do Not Write by Hand

The greedy locality cut in Code 13.1.1 is a teaching toy; production systems delegate partitioning to a dedicated routine. PyTorch Geometric ships a one-call cluster-based partitioner that builds locality-aware shards (using METIS underneath) and hands back the per-partition subgraphs ready for distributed training:

# pip install torch-geometric
from torch_geometric.loader import ClusterData

# data: a PyG graph with data.edge_index of shape [2, num_edges]
clusters = ClusterData(data, num_parts=4)   # METIS-based low-edge-cut partition
# clusters[k] is the induced subgraph for machine k, neighbors kept together
print(clusters.num_parts, "balanced, low-cut partitions ready for distribution")
Code 13.1.2: The whole heuristic of Code 13.1.1 collapses to one ClusterData call. The library handles the balance constraint, the edge-cut objective via METIS, and the bookkeeping of which vertex lives where, the machinery Section 13.2 opens up by hand.
Practical Example: The Recommendation Graph That Hashed Itself Into a Corner

Who: A platform engineer on the recommendations team of a large social network.

Situation: A two-billion-edge user-item interaction graph was sharded across 32 machines by hashing the user id, the same scheme that served the team's key-value store flawlessly.

Problem: A nightly graph-propagation job that scored items for users was taking nine hours and pinning the network at full utilization, while the machine CPUs sat mostly idle waiting on remote neighbor fetches.

Dilemma: Buy more machines, which a hash cut would only make worse by scattering neighbors further, or invest in a locality-aware repartition, which meant a one-time expensive shuffle and a new pipeline to maintain.

Decision: They repartitioned, because the profiler showed the job was network-bound, not compute-bound, and adding machines to a network-bound job buys nothing, exactly the trap Code 13.1.1 illustrates in miniature.

How: They replaced the hash with a METIS-style edge-cut partition and split the few thousand celebrity vertices with a vertex-cut so no single machine owned a whole hub.

Result: Cross-machine edge traffic fell by roughly 70%, the nightly job dropped from nine hours to under two, and the network stopped being the bottleneck, on the same 32 machines.

Lesson: A partition that is perfect for independent rows can be the single worst decision for a graph. For dependency-laden data, the cut is the performance, and it deserves to be measured before any machines are added.

6. Frontier: Graphs at a Billion Edges and Beyond Advanced

The difficulties of this section do not vanish at scale, they intensify, and they define the current research frontier. The practical demonstration above used 60 vertices; the systems people actually build operate at six to ten orders of magnitude more, where every percentage point of edge cut is measured in terabytes of nightly network traffic.

Research Frontier: Billion-Edge GNNs and Graph Foundation Models (2024 to 2026)

Two threads dominate recent work, and both are responses to the difficulties above. The first is scaling graph neural network training to billion- and trillion-edge graphs: systems in the lineage of DistDGL and the partition-and-sample line keep neighbor fetches local with locality-aware partitioning plus quantized or cached neighbor states, and 2024 to 2025 work on disk-based and out-of-core GNN training (such as the Ginex and MariusGNN families) pushes graphs that dwarf cluster memory through carefully staged neighbor sampling. The second is graph foundation models: the 2024 wave of work argues for pretraining a single transferable model across many graphs, raising a sharp version of this section's problem, because a model meant to serve every graph must cope with every graph's skew and irregularity at once. A parallel line on scalable graph transformers attacks the quadratic all-pairs attention cost that makes naive global attention impossible on a large graph, using sparse or sampled attention so the access pattern stays as local as the partition allows. The unifying message of the frontier matches this section's thesis: the bottleneck is rarely arithmetic, it is moving neighbor information across a partition, and the winning methods are the ones that move the least.

We now have the problem stated precisely, the data-parallel assumption breaks because a vertex depends on its neighbors, power-law degree concentrates load and communication on hubs, and irregular access defeats locality at both the cache and the network. We have measured a bad cut against a good one in real numbers, and we have the map of the chapter's four answers. The first and most consequential answer is partitioning, the decision that, as Output 13.1.1 showed, sets the communication cost of everything that follows. That is where the next section begins, in Section 13.2.

Exercise 13.1.1: Why the Forest Was Free and the Graph Is Not Conceptual

The random forest of Section 12.4 trained with zero inter-worker communication, while the neighbor aggregation of this section forces communication on every step. Both compute a sum over a set. Explain precisely what property of the index set differs between the two sums and why that property, not the sum itself, is what makes one embarrassingly parallel and the other communication-bound. Then state what would have to be true of a graph for neighbor aggregation to become embarrassingly parallel, and why real graphs almost never satisfy it.

Exercise 13.1.2: Cut, Balance, and the Hub Coding

Extend Code 13.1.1 to sweep the number of machines $P$ over $\{2, 4, 8, 16\}$ and, for each $P$, report the cross-partition edge fraction and the edge-count imbalance (busiest machine's local edges divided by the average) for both the hash cut and the locality cut. Then add a third partitioner that splits the single highest-degree vertex across all machines (a one-vertex vertex-cut) by assigning each of its edges to the machine owning the other endpoint, and measure how much the residual cut drops. Explain the trend you observe as $P$ grows, and why the vertex-cut helps the hub specifically.

Exercise 13.1.3: Is It Worth Distributing at All? Analysis

A graph has $E = 2 \times 10^9$ edges. A locality cut places a fraction $f$ of edges across machines, and each cross-partition edge costs one 64-byte neighbor-state message per propagation step. The job runs 30 propagation steps nightly on a cluster whose aggregate cross-machine bandwidth is 200 gigabytes per second. Estimate the nightly communication time as a function of $f$, and find the largest $f$ for which the communication still fits in a two-hour window. Using Output 13.1.1 as a rough guide to achievable $f$, argue whether a hash cut ($f \approx 0.73$) or a locality cut ($f \approx 0.55$) would meet the window, and connect your answer to the communication-cost reasoning of Chapter 10.