Section 33.3: Containers and Kubernetes for AI

"They promised me eight identical siblings on eight identical machines. I was rescheduled onto a node that has no GPU, and now I am importing torch into the void."
A Pod, Rescheduled Onto a Node That Has No GPU

Big Picture

A cluster does not run your model; it runs a container, and Kubernetes decides where that container lands. Before any distributed-training collective fires or any inference request is served, the work has to be packaged into an immutable image that pins the exact CUDA, driver-userspace, and Python stack, and then placed onto machines that actually have the accelerators the image expects. Kubernetes is the layer that turns "a YAML description of what I want" into "running processes on real nodes", advertising GPUs as schedulable resources, honoring placement rules, and enforcing per-tenant quotas. This section shows how containers freeze the environment and how Kubernetes schedules it, and it is honest about one sharp limitation: the default scheduler places one Pod at a time and knows nothing about the fact that a training job needs all of its Pods at once. That gap is the entire motivation for the gang and topology-aware scheduling that the next two sections build.

A distributed AI system is, at the level of an operating cluster, a population of processes that must be created, placed on machines with the right hardware, kept alive, and torn down. Doing that by hand (logging into nodes, checking which GPUs are free, copying environments, launching processes in the right order) does not survive contact with a fleet of hundreds of machines and dozens of concurrent jobs. Two technologies have become the near-universal answer. Containers package the software so that the environment a job runs in is identical on every node and reproducible months later. Kubernetes orchestrates those containers, deciding which node each one runs on, restarting the ones that die, and rationing scarce accelerators among competing teams. Together they are the substrate on which everything in Parts III through VI ultimately executes. The previous section introduced the cluster as a pool of heterogeneous resources; this section makes the resources schedulable and the workloads portable.

1. Why Containers: Freezing the CUDA Stack Beginner

The defining pain of running AI software is that it depends on a tall, brittle stack of versioned components: a Python interpreter, a specific PyTorch or JAX build, a matching CUDA toolkit, cuDNN, NCCL, and a host GPU driver whose userspace libraries must be compatible with all of the above. Two machines that look identical can differ in one of these layers and produce a training run that either crashes on import or, worse, runs slightly differently. A container is a filesystem image plus metadata that bundles everything above the host kernel, the interpreter, the libraries, the CUDA userspace, and the application code, into one immutable artifact. When you run that image, the process sees exactly the files baked into it, regardless of what the host machine happens to have installed. The reproducibility this buys is the same reproducibility that makes a result trustworthy, which is why container images are the unit of deployment in the MLOps practices of Chapter 26.

The one component a container does not carry is the kernel-level GPU driver itself, which lives on the host. The container brings the CUDA userspace and expects the host driver to be present and new enough; on a GPU cluster this is arranged by a device runtime (historically the NVIDIA Container Toolkit) that injects the host driver libraries and device files into the container at launch. The practical rule, which we make precise in the next section, is that the image pins everything above the driver and the cluster guarantees a compatible driver below it. Get that contract right and the same image runs unchanged on a laptop with one GPU and on a node with eight.

Key Insight: The Image Is the Reproducible Unit, the Driver Is the Contract

A container freezes the entire software stack above the host kernel into one immutable image, so the environment is identical on every node and reconstructable later. The single layer it cannot freeze is the host GPU driver, which it borrows from the machine it lands on. Distributed AI on Kubernetes therefore rests on one contract: the image pins CUDA, cuDNN, NCCL, and the framework; the cluster guarantees a host driver compatible with them. Every reproducibility failure on a GPU cluster is, at bottom, a violation of one side of that contract.

2. The GPU Operator and the Device Plugin Intermediate

Kubernetes by default understands two resources: CPU and memory. It has no native concept of a GPU. Accelerators enter the scheduler's vocabulary through the device plugin mechanism, a small agent running on every GPU node that discovers the local accelerators and advertises them to the node's kubelet as a custom countable resource, conventionally named nvidia.com/gpu. Once advertised, a GPU is treated like any other quantity the scheduler can hand out: a node reports it has eight of them, each Pod asks for some number, and the scheduler subtracts as it places. The device plugin also handles the low-level work of exposing the right device files into each container so the process can actually open the hardware.

Installing the device plugin, the container runtime hooks, the driver, and the monitoring exporters by hand on every node is itself a distributed-systems chore. The common solution is the GPU Operator, a Kubernetes operator that installs and reconciles the entire GPU software stack (driver, container toolkit, device plugin, and metrics exporter) as managed cluster resources, so that adding a new GPU node is a matter of labeling it rather than configuring it. Operators encode operational knowledge as control loops, the same reconciliation pattern that makes Deployments self-healing, applied to infrastructure. With the operator in place, the abstraction the rest of this section relies on holds: a Pod that requests nvidia.com/gpu: 1 will only ever be placed on a node that has a free GPU and the software to use it.

Library Shortcut: One Helm Install Replaces Per-Node Driver Surgery

Manually installing a matching driver, the container toolkit, the device plugin, and a metrics exporter on every GPU node is dozens of error-prone steps per machine, repeated whenever a node is reimaged. The NVIDIA GPU Operator collapses the entire stack into one Helm release that reconciles itself across all current and future nodes:

# Install the full GPU software stack as a self-reconciling operator.
helm install --wait gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator --create-namespace

# Confirm every GPU node now advertises schedulable accelerators.
kubectl get nodes -o custom-columns=NODE:.metadata.name,GPU:.status.allocatable.'nvidia\.com/gpu'

Code 33.3.1: The GPU Operator turns per-node driver and plugin installation into a single declarative release. The second command reads the nvidia.com/gpu allocatable count the device plugin advertised, the same quantity the scheduler subtracts from when it places a GPU Pod.

3. Pods, Deployments, and Jobs Beginner

The atom of execution in Kubernetes is the Pod: one or more containers that share a network identity and are always scheduled together onto a single node. A Pod is the smallest thing the scheduler places, and for AI work the mental model is "one Pod is one process on one machine, typically owning some slice of the accelerators on that machine". You rarely create bare Pods, because a bare Pod that dies stays dead. Instead you declare a higher-level controller that owns Pods and continuously drives the cluster's real state toward your declared intent. Two controllers carry almost all AI workloads, and the choice between them is exactly the choice between training and serving.

A Deployment manages a set of identical, interchangeable, long-lived Pods and keeps a target number of them running. If a Pod or its node dies, the Deployment creates a replacement; if you ask for more replicas, it scales out. This is precisely the shape of an inference service: a horizontally scaled pool of stateless model-server replicas sitting behind a load-balancing Service, which is the deployment pattern that Chapter 23 develops and that Chapter 24 specializes for large language models. A Job, by contrast, manages Pods that run to completion and then stop, which is the shape of a training run or a batch evaluation: it launches the work, tracks how many Pods finished successfully, and retries failures up to a limit. Figure 33.3.1 shows the control plane translating both kinds of declared intent into placed Pods on worker nodes.

Figure 33.3.1: The control plane translates declared intent into placed Pods. A training Job (top) is a set of GPU Pods spread across nodes that, for a distributed run, must all be running at once to make progress. An inference Deployment (bottom) is a set of interchangeable replicas behind a load-balancing Service, where any one replica can serve any request and a missing replica only reduces capacity. The default scheduler places each Pod independently, which is why the top group has a coordination hazard the bottom group does not.

The distinction in Figure 33.3.1 is the heart of why AI strains Kubernetes. An inference Deployment is embarrassingly tolerant: replicas are independent, the loss of one is a small capacity dip, and Pod restarts driven by the controller map cleanly onto graceful fleet churn. A distributed training Job is the opposite: its Pods are not interchangeable, they are coordinated peers running a collective such as the all-reduce of Chapter 4, and a single missing peer stalls the entire group at the next synchronization barrier. The controller abstractions describe both, but they protect only the first.

4. Requests, Limits, and Bin Packing Intermediate

Every container declares two numbers per resource: a request, the amount the scheduler reserves and uses for placement, and a limit, the ceiling the runtime enforces at execution. For CPU and memory the two can differ, allowing oversubscription. For whole GPUs they are effectively the same: an accelerator is an indivisible countable unit, so a Pod that requests one GPU holds that GPU exclusively for its lifetime, and no other Pod can share it. This is why GPU clusters live or die by packing efficiency. If jobs request GPUs in shapes that do not tile cleanly onto the nodes, accelerators sit idle inside partially filled machines while new jobs wait. The scheduler's placement problem is a bin-packing problem, and its quality is measured by utilization,

$$U = \frac{\sum_{j} g_j}{G_{\text{total}}}, \qquad \text{fragmentation} = 1 - \frac{\sum_{j} g_j}{\sum_{n} \mathbb{1}[\text{node } n \text{ has any GPU in use}] \cdot G_n},$$

where $g_j$ is the number of GPUs job $j$ holds, $G_{\text{total}}$ is the cluster's total, $G_n$ is node $n$'s GPU count, and the fragmentation term measures how much capacity is stranded inside nodes that are partly but not fully used. A cluster can show high allocation yet poor effective utilization when many nodes each carry one small job and refuse a large one that needs a whole node. Requesting GPUs in node-aligned shapes (a full node of eight, or a clean half) is how large jobs keep fragmentation low; the topology-aware placement of Chapter 4 adds a second reason to pack a job onto as few nodes as possible, namely that intra-node accelerator links are far faster than the inter-node network.

When a whole GPU per Pod is wasteful, the hardware offers a middle path. Multi-Instance GPU partitioning splits one physical accelerator into several smaller, hardware-isolated instances, each advertised to the scheduler as its own unit so that several light inference Pods can pack onto one card. We treat that partitioning in detail in the next section as one lever for raising utilization; here it is enough to note that requests and limits are the dial through which a tenant tells the scheduler how much accelerator it intends to hold.

Practical Example: The Cluster That Was Full and Idle at the Same Time

Who: A platform engineer running a shared 16-node, 128-GPU research cluster for several model teams.

Situation: The dashboard showed 96 percent of GPUs allocated, yet a new eight-GPU training Job had been pending for two hours and the teams complained the cluster felt empty.

Problem: Many single-GPU notebook Pods had been placed one per node by the default scheduler, leaving seven free GPUs on each of a dozen nodes but no node with eight free at once.

Dilemma: Buy more hardware to relieve apparent pressure, or accept that the cluster was fragmented, not full, and change how it packed.

Decision: They treated it as a packing problem, not a capacity problem, since effective utilization by the fragmentation measure above was far below the 96 percent allocation figure.

How: They moved interactive single-GPU Pods into a dedicated node pool with bin-packing placement, reserved the remaining nodes as whole-node units for multi-GPU Jobs, and enforced it with node affinity and taints (Section 5).

Result: The pending eight-GPU Job scheduled in seconds, and measured GPU-hours of useful work rose by roughly a third with no new hardware.

Lesson: Allocation is not utilization. A scheduler that places each Pod greedily and independently can strand most of a cluster's capacity inside partly filled nodes, and the fix is shaping requests and placement, not buying GPUs.

5. Placing Work on the Right Accelerator: Selectors, Taints, and Affinity Intermediate

A real GPU cluster is heterogeneous: some nodes carry top-tier training accelerators, others carry cheaper inference cards, some are spot or preemptible instances, some sit in a different network rack. Getting the right Pod onto the right node is the job of three placement controls, and AI workloads use all three. A node selector (or its richer cousin, node affinity) is a requirement a Pod attaches to itself: "schedule me only on a node labeled with this accelerator type". It is how a memory-hungry training Pod insists on a high-memory card and how a latency-sensitive inference replica insists on staying off a slow one. A taint works in the opposite direction: it is a mark a node attaches to itself to repel Pods that do not explicitly tolerate it, the mechanism by which a pool of expensive GPU nodes refuses ordinary CPU jobs that would otherwise pack onto it and waste the accelerators. Together, affinity pulls the right Pods toward a node and taints push the wrong ones away.

The third control, Pod affinity and anti-affinity, expresses relationships between Pods rather than between a Pod and a node. Anti-affinity spreads the replicas of an inference Deployment across distinct nodes and racks so that one machine or rack failure cannot take out the whole service, a direct application of the failure-domain thinking that fault tolerance demands. Affinity does the reverse, packing the peers of a training Job close together to shorten the communication path. These rules shape placement but do not coordinate it: each rule is still evaluated for one Pod at a time, which is the seam this section keeps returning to.

apiVersion: batch/v1
kind: Job
metadata:
  name: bert-finetune
spec:
  backoffLimit: 4                       # retry a failed Pod up to 4 times
  completions: 1
  template:
    spec:
      restartPolicy: OnFailure
      nodeSelector:
        accelerator: a100-80gb          # only land on A100-80GB nodes
      tolerations:
      - key: nvidia.com/gpu             # tolerate the GPU-node taint so we may
        operator: Exists                #   schedule onto the reserved GPU pool
        effect: NoSchedule
      containers:
      - name: trainer
        image: registry.example.com/ml/bert-train:cuda12.4-torch2.5
        command: ["torchrun", "--nproc_per_node=1", "finetune.py"]
        resources:
          requests:
            cpu: "8"
            memory: 64Gi
            nvidia.com/gpu: 1           # request reserves; for whole GPUs
          limits:
            nvidia.com/gpu: 1           #   request and limit must match

Code 33.3.2: A single-Pod GPU training Job. The nodeSelector pins it to A100-80GB nodes, the toleration lets it onto the tainted GPU pool, and the matching GPU request and limit reserve exactly one accelerator. The pinned image (cuda12.4-torch2.5) is the frozen stack of Section 1; backoffLimit is the Job-level analogue of the Pod restarts that elastic training exploits.

Code 33.3.2 is a complete, runnable manifest for the common case of a single-worker fine-tune. Its backoffLimit is worth pausing on: a Job retries a failed Pod automatically, which means a transient node failure during training is recovered by Kubernetes recreating the Pod. That recovery is exactly the Pod-restart mechanism that elastic and fault-tolerant training in Chapter 18 builds on, pairing automatic restarts with checkpointing so a multi-hour run survives losing a worker. The manifest scales to one worker cleanly; scaling it to many coordinated workers is where the default scheduler's blind spot appears.

6. Namespaces and Quotas for Multi-Tenant Clusters Intermediate

A cluster shared by several teams needs walls. A namespace is a logical partition of the cluster that scopes names and serves as the unit to which policy attaches, so each team works inside its own namespace without colliding with another team's Pod names or seeing its objects. The teeth come from a ResourceQuota bound to a namespace, a hard cap on the total resources its Pods may collectively request, including the custom GPU resource. A quota of nvidia.com/gpu: 16 on a team's namespace means that team can never hold more than sixteen accelerators at once, no matter how many Jobs it submits, which protects the other tenants from a single team monopolizing the cluster. Quotas turn a shared cluster from a free-for-all into a governed commons, and they are the cluster-level expression of the fairness concerns that scheduling policy, in the next section, makes dynamic.

Quotas interact with the packing story of Section 4 in a way worth flagging. A namespace can be well within its GPU quota yet still unable to schedule a large Job because the cluster is fragmented, and conversely a cluster with free capacity may refuse a Job because its namespace quota is exhausted. The two limits are independent gates, one about fairness across tenants and one about physical fit, and a Pod schedules only when it clears both. Naming this explicitly saves many hours of debugging a Pod that is "pending" for reasons that look contradictory until you check both the quota and the node fit.

7. The Gang-Scheduling Gap Advanced

Everything so far describes a scheduler that, by design, considers one Pod at a time. It picks a Pod from the queue, finds a node that satisfies that Pod's requests, affinity, taints, and the namespace quota, binds it, and moves on. For an inference Deployment this is exactly right: replicas are independent, so placing them one by one and starting each as soon as it lands is optimal. For a distributed training Job it is a trap. Suppose a Job needs eight GPU Pods that must run together to form a single all-reduce group. The default scheduler may place six of them, exhaust the available GPUs, and leave the last two pending. The six placed Pods are now running, holding six precious GPUs, blocked at the first synchronization barrier waiting for two siblings that cannot be scheduled. They make no progress and they release nothing. Worse, two such half-placed Jobs can each hold half the cluster and each wait for the other to release GPUs, a deadlock that wastes the entire machine.

The missing property has a name: gang scheduling, the rule that a group of Pods is scheduled all-or-nothing, never partially. Vanilla Kubernetes does not provide it, because its core scheduler has no notion that a set of Pods forms an indivisible unit whose value is zero until every member is placed. This is not a bug; it is a design choice optimized for the independent, long-lived services that Kubernetes was born to run, and AI training is the workload that exposes its limit. The remedy is to extend the scheduler with awareness of the group, which is precisely what gang and batch schedulers add, and the straggler-avoidance benefit of placing all peers together at once was previewed back in the scheduling arc of Chapter 4.

Thesis Thread: The Collective Reaches Down Into the Scheduler

The all-reduce that Chapter 4 introduced and Chapter 15 turned into a training step is not only a network operation; it is a scheduling constraint. Because every worker must reach the barrier for the collective to complete, the workers form an indivisible gang, and a scheduler that places them one at a time can stall the whole group. The same primitive that makes data parallelism exact (one synchronized step across all workers) is what makes naive placement fail. Scale-out, in other words, pushes its demands all the way down from the math of the gradient to the policy of the cluster scheduler, and the next two sections answer that demand.

Research Frontier: Topology-Aware and Gang Scheduling for Large Jobs (2024 to 2026)

Because the gap above wastes the most expensive hardware in the building, gang-aware batch scheduling is an active line. Kubernetes-native batch schedulers such as Volcano and the Kueue queueing layer add all-or-nothing placement, fair-share queues, and preemption on top of the default scheduler, and have become the standard way to run training Jobs at scale. A parallel thread makes placement topology-aware so a job's gang lands on GPUs that share fast interconnect: Kubernetes added dynamic resource allocation and structured device topology so the scheduler can reason about which accelerators sit on the same high-bandwidth fabric, and large-cluster operators report that placing a gang onto interconnect-adjacent nodes can change effective training throughput substantially. The frontier question is co-optimizing the three pressures of this section at once, packing for utilization, gang placement for correctness, and topology for communication speed, under live multi-tenant quotas. We take up the schedulers themselves next.

Fun Note

A six-of-eight training gang holding six idle GPUs at a barrier is the distributed-systems version of holding the entire dinner table hostage because two guests are stuck in traffic. Nobody eats, the food gets cold, and the seats cannot be given to anyone else. Gang scheduling is the maitre d' who simply will not seat the party until everyone has arrived.

We now have the substrate. Containers freeze the CUDA stack into a reproducible image; the device plugin and GPU Operator make accelerators schedulable; Pods, Deployments, and Jobs express serving and training; requests, limits, selectors, taints, affinity, namespaces, and quotas govern where work lands and who may hold what. We also have the precise shape of the substrate's limit: it places one Pod at a time and cannot, alone, guarantee a training gang runs all-or-nothing on fast-interconnect nodes. The next section turns to the schedulers and hardware-partitioning techniques that close that gap, and Section 33.4 begins with how the cluster scheduler is extended to reason about whole jobs rather than lone Pods.

Exercise 33.3.1: Job or Deployment? Conceptual

For each workload, state whether you would express it as a Kubernetes Job or a Deployment, and say in one sentence why its Pods are or are not interchangeable: (a) a four-node, 32-GPU pretraining run using all-reduce; (b) a stateless image-classification API that must serve 5,000 requests per second; (c) a one-time batch job that computes embeddings for a 100-million-document corpus and then exits; (d) a vector-search service behind a load balancer. For the two training-flavored cases, explain why the default scheduler's one-Pod-at-a-time placement is a hazard that the serving cases do not face.

Exercise 33.3.2: A Multi-Worker Training Manifest Coding

Extend Code 33.3.2 from a single-Pod Job into a four-worker distributed training Job. Set parallelism: 4 and completions: 4, give each worker one GPU, and add a headless Service plus the environment variables (MASTER_ADDR, WORLD_SIZE=4, a per-Pod RANK) that torchrun needs to form the process group of Chapter 4. Then answer in prose: with the default scheduler and only three free GPUs in the cluster, what state are the four Pods in after placement, and what are the three placed Pods doing while the fourth is pending? Name the property the manifest cannot express that would prevent this.

Exercise 33.3.3: Fragmentation Arithmetic Analysis

A cluster has 16 nodes of 8 GPUs each ($G_{\text{total}} = 128$). Twelve single-GPU interactive Pods are each placed on a distinct node by a greedy scheduler. Using the two formulas in Section 4, compute the raw allocation $U$ and the fragmentation term, then determine the largest single multi-GPU Job (in whole GPUs on one node) that can still be scheduled. Now suppose those twelve Pods had instead been bin-packed onto two nodes; recompute the largest schedulable single-node Job and the number of completely free nodes. Quantify, in GPU-hours over an eight-hour window, how much large-Job capacity the greedy placement stranded.