Section 33.1: Anatomy of an AI Cluster

"Every paper in this library assumes I exist and never once describes me. I am the racks, the cables, and the one slow link nobody benchmarked. The algorithms are elegant; I am where they actually run."
A Cluster Fabric, Tired of Being Abstracted Away

Big Picture

Every distribution axis in this book, data, training, models, inference, coordination, and intelligence, ultimately executes on one physical thing: a cluster of networked machines with accelerators, a tiered interconnect, a storage hierarchy, and a control plane that decides what runs where. Until now we have reasoned about workers, shards, and collectives as abstractions. This chapter grounds those abstractions in hardware. An AI cluster is not a uniform pool of interchangeable compute; it is a deliberately heterogeneous, hierarchical machine in which bandwidth, latency, and failure probability all change as you move from inside a chip to across a rack to across the data center. This first section lays out what such a cluster is made of and why its shape, especially the shape of its network, determines which distributed algorithms run well on it and which stall.

The earlier parts of this book treated the cluster as a given. Data parallelism in Chapter 15 assumed $K$ workers that could all-reduce a gradient; sharded training in Chapter 16 assumed devices fast enough to exchange shards every step; LLM serving in Chapter 24 assumed a fleet of accelerators behind a router. Those assumptions are not free. Each one is a claim about a physical machine: how many accelerators it has, how they are wired together, how fast each link moves bytes, and how a scheduler placed the job onto them. This section opens the chapter by making the physical substrate explicit, because the performance models, the collective costs, and the placement decisions of the chapters that follow are all properties of the cluster's anatomy, not of the algorithm in isolation.

Figure 33.1.1: The hierarchy of an AI cluster. A control plane (scheduler, placement, health, quotas) draws jobs from a queue and binds them to nodes that share a storage tier. Nodes sit in racks; inside a node, several accelerators are wired by a fast intra-node mesh (solid links) and reach the rest of the cluster through a host CPU and a network card onto the slower inter-node fabric (dashed links). Bandwidth falls and latency rises at every step outward, which is the single fact that governs how distributed AI workloads should be placed.

1. The Cluster as the Substrate Beneath Every Axis Beginner

An AI cluster is a collection of compute nodes joined by a network and managed as one pool. A node is a single server: it has one or more host CPUs, a large pool of system memory, local disks, network cards, and, in the machines that matter for this book, a set of accelerators (typically GPUs, sometimes TPUs or other domain-specific chips). The accelerators do the dense linear algebra that training and inference are made of; the host CPU feeds them data, runs the framework's Python, drives the network stack, and handles everything that is not a matrix multiply. A cluster gathers many such nodes behind a control plane so that a user submits a job and the system, not the user, decides which physical accelerators run it.

The reason to insist on this picture now, in Part VII rather than Part I, is that the algorithms have accumulated specific demands the hardware must satisfy. Synchronous data-parallel training needs its workers close on the network, because they all-reduce a gradient every step and the slowest link sets the pace, as Chapter 4 established for collectives. Sharded and pipeline parallelism in Chapter 16 push even harder, exchanging shards or activations on a per-layer cadence that only the fastest intra-node links can hide. Inference fleets in Chapter 24 ask the opposite: many independent replicas that rarely talk to each other, where spreading across the cluster improves availability rather than hurting it. One cluster serves all of these, and the scheduler's job is to place each workload where the cluster's anatomy suits it.

Thesis Thread: The Substrate That Scale-Out Runs On

This book's thesis is that AI at scale is the engineering of intelligence distributed across many machines. Every axis we have studied, distributing data, training, the model, inference, and coordination, is a way of cutting work into pieces that must then be mapped onto real silicon and real cable. The cluster is where that mapping happens. A distribution strategy that looks optimal on paper can be defeated by a placement that puts two chatty workers on opposite ends of the fabric, and a mediocre strategy can be rescued by a topology-aware scheduler. Holding the physical substrate in view is what turns the abstract six axes of Section 1.1 into systems that actually scale.

2. Inside a Node: Accelerators, Hosts, and the First Bandwidth Cliff Beginner

The smallest unit worth distributing across is a single accelerator, and the first thing to understand about a node is that its accelerators are not equidistant from each other. Within one server, accelerators are wired by a dedicated high-bandwidth mesh, NVLink and NVSwitch on current GPU systems, that moves hundreds of gigabytes per second between any two chips in the box. This is the fast tier in Figure 33.1.1, drawn with solid links. It is fast enough that eight accelerators in one node can behave almost like one large accelerator for the purposes of a collective, which is exactly why tensor parallelism and the most communication-intensive sharding are kept inside a node whenever possible.

The host CPU and its system memory sit beside the accelerators and connect to them over PCIe or a comparable host link, a slower path than the inter-accelerator mesh. The host stages training data from storage, decodes and augments it, and pushes batches to the accelerators; if it cannot keep up, the accelerators starve and the expensive silicon idles. The network card, often a high-speed host channel adapter (HCA), is the node's door to the rest of the cluster. Crossing that door is the first real bandwidth cliff: leaving the node means leaving the fast mesh for the inter-node fabric, which moves perhaps a tenth of the bytes per second. The whole art of placement, developed through this chapter, is built on respecting that cliff.

Key Insight: The Cluster Is a Bandwidth Hierarchy, Not a Flat Pool

Treat "the cluster" as uniform compute and you will design algorithms that the hardware punishes. Bandwidth between two accelerators is highest inside a chip, very high across the intra-node mesh, much lower across the rack, and lowest across the data center; latency and failure probability rise in lockstep as you move outward. A correct mental model of a cluster is a tree of bandwidth tiers, and a good distribution strategy assigns the chattiest communication to the fattest tier. Every scheduling and parallelism decision in this chapter is a consequence of that single structural fact.

3. The Interconnect: NVLink, InfiniBand, and RoCE Intermediate

The network is the part of an AI cluster that most distinguishes it from a generic compute cloud, and it comes in tiers. The intra-node tier, NVLink-class, connects accelerators within a server and is the fastest. The inter-node tier carries traffic between servers, and here two technologies dominate. InfiniBand is a purpose-built low-latency fabric with hardware support for remote direct memory access (RDMA), letting one node write into another node's memory without involving either CPU; it is the default for high-end training clusters precisely because collective operations stress latency and CPU overhead. RoCE (RDMA over Converged Ethernet) brings the same RDMA semantics to Ethernet hardware, trading some latency predictability for the operational familiarity and cost profile of Ethernet, and it has become common in large inference and mixed-use clusters.

What unites these technologies is that they exist to make the collective operations of Chapter 4 cheap. An all-reduce that synchronizes a gradient, a reduce-scatter that distributes optimizer state in sharded training, an all-to-all that routes tokens to experts: each is a flood of traffic between accelerators, and the fabric's bandwidth and latency set how long that flood takes. A useful single number for comparing fabrics is the bisection bandwidth: cut the cluster into two equal halves and sum the bandwidth of all links crossing the cut. For $N$ nodes each contributing one uplink of bandwidth $b$ in an idealized non-blocking fabric, the bisection bandwidth is

$$B_{\text{bisect}} \;=\; \frac{N}{2}\, b,$$

which bounds the throughput of any communication pattern that must move data between the two halves, such as a large all-to-all. A fabric with high per-link bandwidth but a thin bisection (an oversubscribed tree) will serve isolated traffic well and choke on global collectives, which is why training clusters pay for non-blocking or lightly oversubscribed fabrics while inference clusters often do not need to.

Numeric Example: Where the Time Goes in a 512-Accelerator All-Reduce

The code below models a cluster of 64 nodes with 8 accelerators each (512 accelerators total) and a two-tier interconnect: a fast NVLink-class intra-node mesh and a slower InfiniBand-class inter-node fabric. It estimates the time to all-reduce a 7-billion-parameter gradient in half precision, first within one node, then across all 64 nodes, using the standard ring all-reduce volume of $2\frac{n-1}{n}M$ bytes per participant for a message of size $M$. It also reports the idealized bisection bandwidth from the formula above.

nodes, gpus_per_node = 64, 8
intra_gbps, inter_gbps = 450.0, 50.0          # per-link bandwidth, NVLink vs IB class
P, bytes_per_param = 7_000_000_000, 2         # 7B params, fp16 gradient
msg_bytes = P * bytes_per_param

def ring_allreduce_seconds(n, link_gbps):
    link_Bps = link_gbps * 1e9 / 8.0
    return 2.0 * (n - 1) / n * msg_bytes / link_Bps   # ring volume per participant

intra = ring_allreduce_seconds(gpus_per_node, intra_gbps)   # 8 GPUs, one node
inter = ring_allreduce_seconds(nodes, inter_gbps)           # 64 nodes, fabric
bisection_gbps = (nodes / 2) * inter_gbps                   # B = (N/2) * b

print(f"total accelerators           : {nodes * gpus_per_node}")
print(f"gradient payload (fp16)       : {msg_bytes/1e9:.1f} GB")
print(f"intra-node all-reduce (8 GPU) : {intra*1e3:7.1f} ms")
print(f"inter-node all-reduce (64 nd) : {inter*1e3:7.1f} ms")
print(f"fabric is slower by factor     : {inter/intra:7.1f}x")
print(f"idealized bisection bandwidth : {bisection_gbps/1e3:.2f} Tb/s")

Code 33.1.1: A first-order bandwidth model of a two-tier cluster. The ring all-reduce moves a fixed volume per participant, so the only thing that changes between the intra-node and inter-node estimates is which link tier carries the bytes.

total accelerators           : 512
gradient payload (fp16)       : 14.0 GB
intra-node all-reduce (8 GPU) :   435.6 ms
inter-node all-reduce (64 nd) :  4410.0 ms
fabric is slower by factor     :    10.1x
idealized bisection bandwidth : 1.60 Tb/s

Output 33.1.1: The inter-node all-reduce is about ten times slower per byte than the intra-node one, mirroring the ten-to-one bandwidth ratio of the two tiers. This is the cliff of Section 2 made quantitative: every step that crosses the fabric pays it.

Output 33.1.1 makes the chapter's central tension concrete. The same collective costs roughly $4.4$ seconds across the fabric versus $0.44$ seconds within a node, an order of magnitude, which is why real training systems do hierarchical all-reduce (reduce within each node on the fast mesh, then across nodes on the fabric, then broadcast back down) and why a scheduler that scatters a tightly coupled job across distant nodes can quietly halve its throughput.

4. Storage, the Control Plane, and the Scheduler Intermediate

Compute and network are only half of a cluster. A training job reads a dataset that does not fit on any one node, so the cluster provides a storage tier: a distributed or parallel file system, or an object store fronted by a cache, that every node can mount and read in parallel. This is the same storage and data-loading machinery built in Chapter 8, now seen as a physical component of the cluster rather than a data-pipeline abstraction. Its job is to keep the host CPUs fed so the accelerators never starve; a fast fabric and idle GPUs waiting on a slow file system is one of the most common and most expensive cluster pathologies.

Above the hardware sits the control plane: the set of services that know the whole cluster's state and decide what runs where. Its centerpiece for our purposes is the scheduler, which takes jobs from a queue and binds them to specific accelerators, subject to availability, quotas, and placement constraints. For AI workloads the scheduler must do something a generic batch scheduler need not: it must be topology-aware and gang-aware. Topology-aware means it knows the bandwidth tree of Figure 33.1.1 and prefers to pack the accelerators of one job close together. Gang-aware means it understands that a synchronous training job's $K$ workers are useless unless all $K$ start together, because a half-placed all-reduce job simply blocks; gang scheduling, the subject of the next section, exists for exactly this reason and is the mechanism that keeps stragglers from the placement layer, an arc that began with stragglers as a concept in Chapter 1.

Practical Example: The Job That Ran at Half Speed for a Week

Who: A platform engineer running a shared 256-GPU training cluster for several research teams.

Situation: A flagship language-model run was hitting only 55 percent of the throughput it reached during an earlier benchmark on the same hardware.

Problem: Nothing in the training code had changed, the GPUs were fully utilized in the profiler, and the data loader was not the bottleneck.

Dilemma: Chase a subtle software regression in the framework, or suspect the part of the system the team did not write, the placement.

Decision: They dumped the scheduler's actual assignment and found the job's 32 nodes spread across four racks behind an oversubscribed spine, instead of the contiguous block the benchmark had used.

How: They added a topology-aware gang constraint requiring the job's nodes to share a fabric island, then resubmitted.

Result: Throughput returned to the benchmark figure; the inter-node all-reduce had been crossing the thin bisection of Section 3, exactly the $10\times$ penalty Output 33.1.1 predicts when traffic leaves the fast tier.

Lesson: On a shared cluster, placement is part of your performance. A topology-blind scheduler will give you working but slow assignments, and the slowdown is invisible in any single-node profiler.

Library Shortcut: Ray and Kubernetes Express the Cluster as a Resource Graph

You do not hand-place accelerators in production. Cluster managers model the machine as a graph of typed resources and let you request a shape; the scheduler does the binding. Ray, for example, exposes accelerators as resources you annotate on tasks and actors, and a few lines request a gang of GPUs that the placement layer keeps together:

import ray
from ray.util.placement_group import placement_group

ray.init(address="auto")                       # join the running cluster

# Request 8 bundles of 1 GPU + 8 CPUs, packed onto as few nodes as possible.
pg = placement_group([{"GPU": 1, "CPU": 8}] * 8, strategy="PACK")
ray.get(pg.ready())                            # blocks until the whole gang is placed

@ray.remote(num_gpus=1)
def shard_worker(rank): ...                     # one worker per reserved GPU

Code 33.1.2: A gang of eight GPUs reserved as one placement group with a packing strategy. The dozen lines of topology reasoning and all-or-nothing reservation that Section 4 describes by hand collapse into placement_group(..., strategy="PACK"), with Ray (or, in the Kubernetes world, a gang-scheduling plugin such as Volcano or Kueue) handling the binding, the quotas, and the atomic placement internally.

5. Why the Anatomy Decides the Algorithm Advanced

The components assembled so far, accelerators on a fast intra-node mesh, nodes on a tiered inter-node fabric, a shared storage tier, and a topology-aware control plane, are not neutral plumbing. They decide which distributed AI algorithm is even viable. Synchronous training wants its workers inside one fabric island so the all-reduce stays cheap, and it wants gang scheduling so all workers start together; this is why the most aggressive parallelism, tensor and sequence parallelism, is confined to within a node and the looser data parallelism is allowed to span the fabric. Sharded training in Chapter 16 chooses its mesh dimensions to match the bandwidth tree, mapping the heaviest collective onto the fastest tier. Inference serving in Chapter 24 inverts the priorities: replicas barely communicate, so the scheduler should spread them across racks and power domains for availability, and the storage tier matters more than the fabric.

The same anatomy governs failure and elasticity. A thousand-accelerator job is, at any moment, a system in which some component is probably degraded, the reliability problem raised in Section 1.1 and answered for training in Chapter 18. The cluster's response, draining a sick node, migrating or preempting a job, restarting a gang from a checkpoint, all happen in the control plane and all depend on the scheduler understanding the job's structure. With the anatomy in hand, the rest of this chapter can take up its real subject: how the scheduler places, packs, partitions, and preempts AI workloads on this substrate, beginning with gang scheduling in the next section.

Research Frontier: Topology-Aware and Reconfigurable Fabrics (2024 to 2026)

Because the fabric sets the ceiling on every collective, recent work attacks the cluster's network directly rather than only its software. Optically reconfigurable interconnects, in the lineage of Google's TPU optical circuit switches and a wave of 2024 to 2025 academic designs, let the cluster physically rewire its topology to match a job's communication pattern, turning bisection bandwidth from a fixed property into a scheduled resource. On the software side, rail-optimized and topology-aware collective libraries (NCCL's topology detection, and research schedulers that co-design placement with the collective algorithm) aim to keep the heaviest traffic on the fattest links automatically. A parallel thread studies fault-aware scheduling for the very large training runs of Chapter 19, where component failure is frequent enough that the scheduler must plan around it rather than merely react. The common theme is that the cluster's anatomy, long treated as fixed, is becoming something the system actively shapes.

Fun Note

Ask a cluster operator for the bottleneck and they will rarely name a chip. The answer is almost always a cable: a flapping transceiver, a link negotiated down to a fraction of its rated speed, or one node quietly stuck on the slow tier while its seven siblings wait. The accelerators get the headlines and the budget; the humble link decides whether any of it goes fast.

Exercise 33.1.1: Map the Workload to the Tier Conceptual

For each workload, state which tier of the bandwidth hierarchy (intra-chip, intra-node mesh, inter-node fabric) its heaviest communication should be placed on, and what the scheduler should optimize for: (a) tensor-parallel attention layers that exchange activations every forward and backward pass; (b) data-parallel replicas that all-reduce a gradient once per step; (c) a fleet of independent inference replicas behind a router. Explain, using the bandwidth-hierarchy insight of Section 2, why placing each workload one tier "too far out" would hurt it, and why for case (c) spreading out is actually the goal.

Exercise 33.1.2: Hierarchical All-Reduce Coding

Extend Code 33.1.1 with a third estimate for a hierarchical all-reduce: first reduce within each node over the fast intra-node tier, then all-reduce the per-node partial results across the 64 nodes over the fabric, then broadcast back down. Model each phase with the same ring formula on the appropriate tier and message size, sum the phases, and compare the total to the flat 512-way inter-node estimate. By roughly what factor does respecting the hierarchy beat the flat scheme, and which phase dominates the hierarchical cost?

Exercise 33.1.3: Bisection and Oversubscription Analysis

A 64-node cluster has per-node uplink bandwidth $b = 50$ Gb/s. Using $B_{\text{bisect}} = \tfrac{N}{2}b$, compute the idealized non-blocking bisection bandwidth, then recompute it for a fabric oversubscribed $4{:}1$ at the spine (the spine provides only one quarter of the uplink bandwidth). Estimate how the time of a global all-to-all (which must cross the bisection) changes between the two fabrics. Argue from your numbers why a training cluster pays for a non-blocking fabric while an inference cluster of mostly independent replicas often does not, connecting your answer to the per-node serving economics of Chapter 22.