Chapter 33: Cluster Infrastructure and Scheduling

"I asked the cluster for sixty-four GPUs that could talk to each other quickly. It gave me sixty-four GPUs, eventually, on the understanding that two of them were in another building and one was already someone else's."
A Training Job Waiting in the Queue

Big Picture

Every distributed algorithm in this book eventually runs on a physical cluster, and a cluster is two things welded together: a pile of hardware and a control plane that decides which work runs on which piece of it. The first six parts asked what to distribute and how, the gradient all-reduce, the sharded optimizer state, the replicated inference server. This chapter asks the question those parts deferred: where does that work actually land, and who decides? A scheduler is the answer. It takes a queue of jobs and a fabric of machines and produces a placement, and for AI workloads that placement is not a detail. A training job whose workers are scattered across the wrong racks pays a communication tax that no amount of algorithmic cleverness recovers; an inference fleet that packs poorly burns money on idle accelerators. This chapter teaches you to read a cluster, name its scheduler, and reason about placement as a first-class part of system design.

Chapter Overview

Until now this book has lived one level above the machine. A data-parallel step in Chapter 15 assumed the workers existed, were connected, and stayed alive long enough to finish an epoch. The serving fleet of Chapter 24 assumed the replicas had somewhere to run. Part VII descends to the substrate those assumptions stand on. A cluster is hardware (compute nodes, accelerators, a network fabric, storage) plus a control plane that turns a stream of submitted work into a sequence of placements on that hardware. The control plane is the scheduler, and scheduling is the central act of this chapter: given who wants to run and what is available, decide who runs where, for how long, and at what priority. For ordinary batch jobs that is a packing problem. For AI workloads it is something sharper, because AI imposes demands that generic schedulers were never designed to meet.

Those demands organize the chapter. Section 33.1 dissects an AI cluster into its parts so the rest of the chapter has a shared vocabulary, and Section 33.2 looks closely at the compute itself, the CPUs, GPUs, TPUs, and accelerator instance types whose memory and interconnect set the ceilings every later section schedules around. Section 33.3 introduces containers and Kubernetes as the packaging and orchestration layer that makes a heterogeneous cluster addressable, and Section 33.4 surveys the batch schedulers (Slurm, Kubernetes batch, Volcano) that queue and place long-running jobs. The middle of the chapter confronts the demand that makes AI scheduling distinctive: a training job is not one process but a tightly coupled gang that must all run at once or not at all. Section 33.5 develops gang scheduling and collective-aware placement, the discipline of starting every worker together and putting them where the all-reduce of Chapter 4 stays cheap.

The second half turns from getting accelerators to using them well. Section 33.6 covers multi-tenant GPU sharing through MIG, MPS, and time-slicing, the techniques that let several smaller jobs share one large accelerator without trampling each other, which matters because a whole GPU is wasteful for a job that needs a fraction of it. Section 33.7 treats Ray clusters and the distributed object store as a Python-native alternative to the Kubernetes and Slurm worlds, the framework that carries much of the modern training and serving stack. Section 33.8 takes on cost directly: spot and preemptible instances are far cheaper than on-demand capacity, and the price of that discount is preemption, which turns the checkpoint-interval mathematics of Chapter 18 into a scheduling concern. Section 33.9 closes the chapter by stepping up to managed platforms (Databricks, SageMaker, Vertex AI) and weighing the standing trade between running your own cluster and renting one that hides the scheduler entirely.

A word on why this matters even if you never administer a cluster yourself. The placement decisions in this chapter set the constants in every performance model the book has built. The communication cost of Chapter 3 depends on whether the scheduler put your workers one hop apart or three; the failure rate that drives Chapter 18 depends on whether you bought reliable on-demand capacity or cheap preemptible capacity; the throughput of the serving fleet depends on how tightly the scheduler packs replicas onto accelerators. Reading a system's scheduler is reading the boundary conditions of its physics. This chapter gives you the language to do that.

Prerequisites

This chapter sits near the end of the book and draws the algorithmic threads down onto hardware, so it assumes the parts that defined those threads. From Chapter 4 it assumes the collective communication primitives (all-reduce, all-gather, all-to-all) and the idea that their cost depends on network topology, which is exactly what collective-aware placement in Section 33.5 optimizes. From Chapter 15 and Chapter 16 it assumes data, model, and sharded parallelism, the gang of tightly coupled workers that gang scheduling exists to serve. From Chapter 18 it assumes elastic and fault-tolerant training, the checkpointing and recovery machinery that spot scheduling in Section 33.8 leans on. From Chapters 22 through 24 it assumes the per-node inference economics and fleet sizing that make GPU packing and sharing a cost question rather than an abstract one. Readers comfortable with those four threads can read this chapter as the place where they finally touch the metal.

Learning Objectives

Decompose an AI cluster into hardware (compute nodes, accelerators, network fabric, storage) and a control plane, and explain what a scheduler does with the two.
Distinguish CPUs, GPUs, and TPUs by memory and interconnect, and read an accelerator instance type for the ceilings it imposes on a workload.
Explain how containers and Kubernetes make a heterogeneous cluster addressable, and place a workload on a batch scheduler (Slurm, Kubernetes batch, or Volcano).
Justify gang scheduling for tightly coupled training and describe collective-aware placement that keeps the all-reduce of Chapter 4 cheap.
Choose among MIG, MPS, and time-slicing to share one accelerator across several jobs, and reason about the isolation each provides.
Model the cost and risk of spot and preemptible scheduling using the checkpoint-interval mathematics of Chapter 18, and weigh self-run clusters against managed platforms.

The One Idea to Carry Out of This Chapter

If you keep one thing from this chapter, keep this: scheduling is where a distributed algorithm meets a finite, contended, failure-prone pile of hardware, and the placement it produces sets the constants in every performance model you have built. A training gang wants all its workers at once and close together; an inference fleet wants tight packing and cheap fractional accelerators; a cost-conscious job wants spot capacity and a checkpoint cadence that survives preemption. These pulls conflict, and the scheduler is where they are reconciled. Read forward, the chapter is a tour of the schedulers (Slurm, Volcano, Kubernetes, Ray) and the sharing and cost mechanisms (MIG, MPS, spot) that resolve those pulls in different ways. Read as a question, it is a single checklist you apply to any cluster: what is the hardware, what is the control plane, and what does its placement cost the algorithm running on top? The roadmap below walks the nine sections that build that checklist.

Chapter Roadmap

33.1 Anatomy of an AI Cluster The parts beneath every workload: compute nodes, accelerators, the network fabric, storage, and the control plane that turns submitted work into placements, with the shared vocabulary the rest of the chapter reuses.
33.2 Compute: CPUs, GPUs, TPUs, and Accelerator Instances The compute itself, read for the ceilings it imposes: device memory, interconnect bandwidth, and the accelerator instance types whose shape every later section schedules around.
33.3 Containers and Kubernetes for AI The packaging and orchestration layer that makes a heterogeneous cluster addressable: containers for reproducible environments and Kubernetes for declaring, placing, and reconciling AI workloads.
33.4 Batch Schedulers: Slurm, Kubernetes Batch, and Volcano The queue-and-place engines for long-running jobs, from the HPC heritage of Slurm to the Kubernetes-native batch and Volcano schedulers that bring AI-aware policies to the cloud-native world.
33.5 Gang Scheduling and Collective-Aware Placement The demand that makes AI scheduling distinctive: a training job is one gang that must start together or not at all, placed where the collective communication of Chapter 4 stays cheap.
33.6 Multi-Tenant GPU Sharing: MIG, MPS, and Time-Slicing Three ways to fit several smaller jobs onto one large accelerator without trampling each other, and the isolation, predictability, and utilization each technique trades.
33.7 Ray Clusters and the Object Store A Python-native cluster framework with a distributed object store at its core, carrying much of the modern training, tuning, and serving stack as an alternative to Kubernetes and Slurm.
33.8 Spot and Preemptible Scheduling for Cost Optimization The cheapest capacity in the cloud comes with the right to take it back; the discount is real, and paying for it correctly turns the checkpoint-interval mathematics of Chapter 18 into a scheduling decision.
33.9 Managed Platforms: Databricks, SageMaker, and Vertex AI The chapter's closing trade: rent a cluster that hides the scheduler entirely, or run your own and keep the control. What each managed platform automates, and what it costs to give up.

Read the nine sections in order and you will have a working map of the substrate the whole book runs on: Section 33.1 names the parts, Sections 33.4 through 33.7 name the schedulers that place work on them, and Sections 33.8 and 33.9 name the cost and convenience trades that decide which substrate you buy. The thread to watch is the one that runs back to Chapter 4: the topology that made collectives cheap or expensive there is a thing the scheduler controls here, which is why collective-aware placement in Section 33.5 is the hinge of the chapter.

What's Next?

This chapter kept the work inside the datacenter, where the network is fast, the power is plentiful, and the scheduler owns a contiguous fabric of machines. Chapter 34: Edge, Fog, and On-Device Distributed AI pushes the infrastructure outward from that warm center to the periphery, where the machines are phones, sensors, vehicles, and gateways, the network is intermittent, and no single scheduler owns anything. The gang scheduling and topology-aware placement of this chapter assumed a cluster you control; the edge withdraws that assumption and asks how distributed AI behaves when compute is scattered across thousands of weak, unreliable, geographically spread devices. The placement problem does not vanish, it inverts, and the federated and decentralized ideas of Chapter 14 return as the way to cope. Read it next to follow the infrastructure from the rack to the road.

Bibliography & Further Reading

Foundational Papers

Yoo, A. B., Jette, M. A., Grondona, M. "SLURM: Simple Linux Utility for Resource Management." Job Scheduling Strategies for Parallel Processing (JSSPP), 2003. slurm.schedmd.com

The original paper for the batch scheduler that runs most of the world's HPC and a large share of its training clusters; the queue-and-place model that Section 33.4 starts from.

📄 Paper

Verma, A., Pedrosa, L., Korupolu, M., et al. "Large-scale cluster management at Google with Borg." EuroSys 2015. research.google

The system whose ideas became Kubernetes: priority, preemption, and bin-packing at warehouse scale. The intellectual ancestor of the control plane in Sections 33.3 and 33.4.

📄 Paper

Moritz, P., Nishihara, R., Wang, S., et al. "Ray: A Distributed Framework for Emerging AI Applications." OSDI 2018. usenix.org

Introduces Ray and the distributed object store at its center; the design Section 33.7 develops as the Python-native alternative to Kubernetes and Slurm.

📄 Paper

Daly, J. T. "A higher order estimate of the optimum checkpoint interval for restart dumps." Future Generation Computer Systems 22(3), 2006. sciencedirect.com

The refined formula for how often to checkpoint given a failure rate and checkpoint cost; the mathematics Section 33.8 borrows to schedule against preemption.

📄 Paper

Young, J. W. "A first order approximation to the optimum checkpoint interval." Communications of the ACM 17(9), 1974. dl.acm.org

The original square-root rule relating optimal checkpoint interval to mean time between failures; the half-century-old result that still sizes spot-training cadence in Section 33.8.

📄 Paper

Tools & Libraries

Kubernetes Documentation: scheduling, preemption, and eviction. kubernetes.io/docs

The reference for the cloud-native control plane: how the default scheduler places pods, and how priority and preemption work, the foundation Section 33.3 builds on.

🔧 Tool

Volcano: a Kubernetes-native batch scheduling system (CNCF). volcano.sh

The kube-batch successor that brings gang scheduling, fair-share queues, and topology awareness to Kubernetes; the AI-aware batch scheduler of Sections 33.4 and 33.5.

🔧 Tool

NVIDIA Multi-Instance GPU (MIG) User Guide. docs.nvidia.com

The official guide to partitioning one accelerator into hardware-isolated instances; the strong-isolation end of the GPU-sharing spectrum in Section 33.6.

🔧 Tool

NVIDIA Multi-Process Service (MPS) Documentation. docs.nvidia.com

The official reference for sharing a GPU across processes through a single context; the higher-utilization, weaker-isolation counterpart to MIG in Section 33.6.

🔧 Tool

Ray Documentation: clusters, autoscaling, and the object store. docs.ray.io

The practical entry point for standing up and autoscaling a Ray cluster; the toolkit behind Section 33.7 and much of the training and serving stack elsewhere in the book.

🔧 Tool

SkyPilot: run AI on any cloud. docs.skypilot.co

An open framework for launching jobs across clouds and chasing spot capacity automatically; a concrete realization of the cost-driven spot scheduling of Section 33.8.

🔧 Tool

Platforms & Documentation

Amazon SageMaker Documentation: training and inference. docs.aws.amazon.com

The managed-platform reference for AWS; a concrete instance of the hide-the-scheduler trade weighed in Section 33.9.

📖 Docs

Google Cloud Vertex AI Documentation. cloud.google.com

Google Cloud's managed AI platform, with TPU access and pipeline orchestration; the second of the three managed platforms compared in Section 33.9.

📖 Docs

Databricks Documentation: clusters and jobs. docs.databricks.com

The Spark-rooted lakehouse platform with managed clusters and job scheduling; the data-and-ML managed option in Section 33.9, tying back to the Spark of Part II.

📖 Docs