Part VII: Cluster, Edge, and Reliable Infrastructure
Chapter 33: Cluster Infrastructure and Scheduling

Batch Schedulers: Slurm, Kubernetes Batch, and Volcano

"I am a perfectly reasonable 8-GPU job, sitting in the queue behind a 512-GPU reservation that does not start for six hours. I could have finished twice by now. Nobody asked."

A Job, Sitting in the Queue Behind a 512-GPU Reservation
Big Picture

A cluster is shared, finite, and contested; the batch scheduler is the program that decides, moment by moment, which of the waiting jobs gets which machines. Every training run, hyperparameter sweep, and data-processing pipeline in this book ultimately arrives as a job in a queue, and the scheduler turns a pile of competing requests into an ordered, packed assignment of work to hardware. The hard part is not running one job; it is keeping a thousand-GPU cluster busy and fair when jobs range from a one-GPU notebook to a thousand-GPU foundation-model run that needs every one of its GPUs at the same instant or none of them. This section covers the two scheduler families that AI clusters actually run on, the HPC lineage (Slurm) and the cloud-native lineage (Kubernetes plus Volcano), and the queue, priority, fair-share, and backfill machinery they share. It sets up the gang-scheduling and topology-aware placement that Section 33.5 makes precise.

The previous section treated a cluster as a pool of nodes that some controller hands out. This section names that controller. Whenever a research team submits a large training run, a platform team launches a nightly retraining pipeline, or an AutoML system fires off two hundred trials, the request does not run immediately on bare metal; it is placed in a queue, ranked against everything else waiting, and dispatched only when the scheduler decides its turn has come and enough hardware is free. The scheduler is the single most consequential piece of shared infrastructure in a multi-team AI cluster, because it determines both how fast any one job starts and how much of the expensive accelerator fleet sits idle. For foundation-model training the stakes are extreme: a run that distributes data, the model, and the optimizer state across hundreds of accelerators (Chapter 19) is useless unless all of those accelerators are granted together.

Two scheduler families dominate. The high-performance-computing world, where multi-node simulations have needed coordinated placement for decades, runs Slurm and its relatives. The cloud-native world, built around containers and Kubernetes, originally had only a service-oriented scheduler and grew batch and gang capabilities later, most visibly through Volcano. AI training sits awkwardly between the two: it has the all-or-nothing, topology-sensitive, long-running character of an HPC job, but it increasingly lives in Kubernetes clusters that also run inference services, data pipelines, and web backends. Understanding both families, and why neither is a perfect fit, is the goal here.

1. The Queue, the Partition, and the Priority Model Beginner

Every batch scheduler shares the same three-part skeleton, and naming the parts makes both Slurm and Volcano legible. First, jobs wait in one or more queues. A job is a resource request plus a command: "I need 64 GPUs across 8 nodes for up to 12 hours, then run this script." Second, the cluster is carved into partitions (Slurm's term) or queues with quotas (Volcano's term), which fence off subsets of hardware for particular uses: a partition of high-memory nodes, a partition reserved for short interactive jobs, a partition that only a privileged project may use. Third, the scheduler assigns each waiting job a priority, a single number that ranks it against its peers, and then repeatedly tries to start the highest-priority job that currently fits.

Figure 33.4.1 shows this skeleton. Jobs enter a queue, the scheduler ranks them by priority and consults the partition layout to see what hardware each may use, and dispatched jobs occupy nodes until they finish or are preempted. The same picture describes Slurm and Volcano; they differ in vocabulary and in how cloud-native the plumbing is, not in shape.

Priority-ordered queue big-512 (prio 90, 16 GPU) smallX (prio 50, 4 GPU) smallY (prio 40, 4 GPU) smallZ (prio 30, 8 GPU) submit: sbatch / kubectl apply Scheduler rank by priority, match to partition, backfill the gaps Partition: training (GPU nodes) node A node B node C Partition: interactive (short) node D node E dispatched jobs hold nodes until they finish or are preempted
Figure 33.4.1: The shared skeleton of a batch scheduler. Jobs wait in a priority-ordered queue, the scheduler ranks them and matches each to a partition of eligible nodes, and dispatched jobs occupy hardware until completion or preemption. Slurm and Volcano differ in vocabulary and plumbing, not in this shape. The job names match the simulation in Code 33.4.2.

Priority is rarely a fixed number. A pure first-come-first-served queue is simple but lets one user monopolize the cluster by submitting a thousand jobs at midnight. A pure largest-job-first policy starves small interactive work. Real schedulers compute priority from several weighted factors: how long the job has waited, how large it is, what its project's quota is, and crucially how much the submitting user or group has consumed recently. That last factor is fair-share, and it deserves its own treatment.

Key Insight: The Scheduler Optimizes Two Things in Tension

A batch scheduler is always balancing utilization against fairness and responsiveness. Pack the cluster tighter and a few large jobs win while small jobs starve; spread access more evenly and the expensive fleet sits partly idle waiting for the right-sized job. Every policy knob (priority weights, fair-share decay, partition limits, backfill aggressiveness) is a point on this trade-off. There is no globally correct setting, only a setting that matches your cluster's job mix, which is why scheduler configuration is a continuous operational task, not a one-time install.

2. Slurm: sbatch, Partitions, Fair-Share, and Backfill Intermediate

Slurm (Simple Linux Utility for Resource Management) is the workhorse of academic and national-lab GPU clusters, and many industrial AI clusters run it too. A user describes a job in a batch script and submits it with sbatch; the script's #SBATCH directives declare the resource request (nodes, GPUs, time limit, partition), and the body runs once the allocation is granted, typically launching the distributed program with srun, which spawns one task per allocated slot and wires up the rank environment. Code 33.4.1 is a representative multi-node training submission.

#!/bin/bash
#SBATCH --job-name=gpt-pretrain
#SBATCH --partition=training        # which pool of nodes to draw from
#SBATCH --nodes=8                    # 8 machines, all granted together (gang)
#SBATCH --gpus-per-node=8           # 64 GPUs total
#SBATCH --ntasks-per-node=8         # one task (rank) per GPU
#SBATCH --time=12:00:00             # walltime cap; lets backfill plan around us
#SBATCH --exclusive                 # no other job shares these nodes

export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)
export MASTER_PORT=29500

# srun launches all 64 ranks at once; the allocation is all-or-nothing.
srun python -u train.py --config configs/gpt_64gpu.yaml
Code 33.4.1: A Slurm batch script requesting 8 nodes and 64 GPUs for a foundation-model pretraining run. The #SBATCH directives are the resource request the scheduler queues and ranks; srun launches all 64 ranks together once the gang allocation is granted. The --time cap is what lets backfill (Section 4) reason about when these nodes will free up.

Two features of Code 33.4.1 are doing quiet but essential work. The first is that the request is a gang: 8 nodes and 64 GPUs must be granted simultaneously, because a data-parallel or sharded training step (Chapter 15) does an all-reduce across all ranks on every step, and a rank with no peer to reduce against simply hangs. The second is the walltime cap, --time=12:00:00. It is not just a safety limit; it is the information backfill needs to plan, as Section 4 shows. With these declared, Slurm can rank the job, and ranking is where fair-share enters.

Slurm's fair-share priority gives each account a target portion of the cluster and penalizes accounts that have recently overconsumed, so the share decays over time and a heavy user's jobs sink in the queue until their usage cools. A common formulation expresses a job's priority as a weighted sum of normalized factors,

$$P_{\text{job}} = w_{\text{fair}}\,F + w_{\text{age}}\,A + w_{\text{size}}\,S + w_{\text{qos}}\,Q,$$

where each factor is normalized to $[0,1]$: $A$ grows with time spent waiting, $S$ with the job's size, $Q$ with an administratively assigned quality-of-service class, and $F$ is the fair-share factor. A widely used form of the fair-share factor compares an account's target share $s$ to its recent effective usage $u$,

$$F = 2^{-\,u / (f \cdot s)},$$

so an account using exactly its target ($u = s$) sits near $F = 2^{-1/f}$, an underused account ($u \ll s$) approaches $F \to 1$ and floats to the top of the queue, and an account that has consumed far past its share ($u \gg s$) sees $F \to 0$ and sinks. The tunable $f$ controls how sharply overuse is punished. The exponential shape matters: it makes the penalty for hoarding the cluster grow without bound, which is exactly the disincentive a shared resource needs.

Practical Example: The Sweep That Buried the Whole Lab

Who: A graduate student on a shared 256-GPU university Slurm cluster.

Situation: Facing a deadline, the student submitted a 400-job hyperparameter sweep, each job claiming 4 GPUs for 6 hours, all at once.

Problem: By morning the student's jobs held most of the cluster, and every other group's jobs were stuck behind them in the queue, generating angry emails to the cluster admin.

Dilemma: Kill the running sweep and waste the compute already spent, or let it finish and leave the rest of the lab idle for a day.

Decision: The admin did neither by hand; Slurm's fair-share had already begun to act, because the student's account usage $u$ had shot far past its target share $s$, driving its fair-share factor $F = 2^{-u/(f s)}$ toward zero.

How: As each sweep job finished, the student's queued jobs were now ranked below every other account's waiting work, so the cluster naturally drained the backlog from other groups before scheduling any more of the sweep.

Result: Within a few hours the cluster was serving every group again, and the sweep finished in the background using only the slack capacity its depressed priority earned it.

Lesson: Fair-share is not a punishment bolted on after the fact; it is the mechanism that lets a shared cluster absorb a burst from one user without a human arbitrating every conflict. Declaring accurate walltimes and reasonable sizes lets it work smoothly.

3. Backfill: Filling the Holes Without Delaying the Big Job Intermediate

Strict priority order has a costly failure mode. Suppose the highest-priority waiting job needs 16 GPUs but only 8 are free, because a long incumbent job holds the other 8. Pure priority scheduling would leave those 8 GPUs idle, doing nothing, until the incumbent finishes and the big job can finally start. Backfill is the fix: the scheduler is allowed to start a lower-priority job out of order, filling the idle hole, but only if doing so provably does not delay the start of the highest-priority job. This is the EASY backfill discipline, and it depends entirely on the walltime caps from Code 33.4.1, because the scheduler must know when each running job will end to compute the reservation time the big job is waiting for.

The condition is precise: a small job may backfill if it fits in the currently free resources and its walltime guarantees it will finish at or before the moment the high-priority job's reservation becomes satisfiable. Code 33.4.2 simulates this on a 16-GPU cluster, scheduling the same five jobs first under strict priority and then with EASY backfill, and reporting makespan and utilization for each.

import heapq

TOTAL_GPUS = 16
# id: (priority, gpus, walltime)
jobs = {
    "long-8":   (100, 8, 40),   # long incumbent: holds 8 GPUs for 40 units
    "big-512":   (90, 16, 20),  # gang job, needs ALL 16 GPUs, must wait for long-8
    "smallX":    (50, 4,  8),   # short: fits the 8-GPU hole, ends before big's reservation
    "smallY":    (40, 4,  8),
    "smallZ":    (30, 8,  8),
}

def schedule(use_backfill):
    free, now, running = TOTAL_GPUS, 0, []
    pending = sorted(jobs, key=lambda j: -jobs[j][0])   # priority order
    start, finish = {}, {}
    while pending or running:
        progressed = False
        if not pending:
            end, g, jid = heapq.heappop(running)
            finish[jid] = end; free += g; now = end; continue
        head = pending[0]; hp_gpus = jobs[head][1]
        if hp_gpus <= free:                              # head job fits: start it
            _, g, w = jobs[head]; start[head] = now
            heapq.heappush(running, (now + w, g, head))
            free -= g; pending.pop(0); progressed = True
        elif use_backfill and len(pending) > 1:
            freed, t = free, now                         # when can the head job start?
            for end, g, _ in sorted(running):
                freed += g; t = end
                if freed >= hp_gpus: break
            resv = t                                     # head's reservation time
            for cand in pending[1:]:
                _, g, w = jobs[cand]
                if g <= free and now + w <= resv:         # fits AND ends by reservation
                    start[cand] = now
                    heapq.heappush(running, (now + w, g, cand))
                    free -= g; pending.remove(cand); progressed = True; break
        if not progressed:
            end, g, jid = heapq.heappop(running)
            finish[jid] = end; free += g; now = end
    makespan = max(finish.values())
    used = sum(jobs[j][1] * jobs[j][2] for j in jobs)
    return makespan, used / (TOTAL_GPUS * makespan)

for mode in (False, True):
    mk, util = schedule(mode)
    print(f"backfill {'ON ' if mode else 'OFF'}: makespan={mk}  utilization={util:.1%}")
Code 33.4.2: An EASY-backfill simulation on a 16-GPU cluster. The big gang job (big-512) waits for the long incumbent (long-8) to free 16 GPUs; backfill slips the short jobs into the 8-GPU hole only when their walltime guarantees they finish before the big job's reservation. The job names match Figure 33.4.1.
backfill OFF: makespan=68  utilization=70.6%
backfill ON : makespan=60  utilization=80.0%
Output 33.4.2: Backfill cut the makespan from 68 to 60 time units and lifted utilization from 70.6% to 80.0%, without ever delaying the high-priority gang job, whose reservation at $t=40$ is untouched. The idle 8-GPU hole that strict priority would have wasted is now doing useful work.

The lesson of Output 33.4.2 is that backfill is close to free utilization, paid for with a single piece of honesty: an accurate walltime. If users systematically pad their --time values to avoid being killed, the scheduler's reservation math becomes pessimistic and the holes go unfilled, so a cluster's effective throughput depends on a cultural norm of declaring realistic limits. We define utilization as the fraction of available accelerator-time actually spent on jobs, $U = \big(\sum_j g_j w_j\big) / (G \cdot T_{\text{makespan}})$, where $g_j$ and $w_j$ are a job's GPU count and runtime, $G$ is the cluster size, and $T_{\text{makespan}}$ is the wall-clock to drain the queue; backfill raises $U$ by shrinking $T_{\text{makespan}}$.

Library Shortcut: squeue and sbatch Replace a Custom Job Tracker

You never implement the queue logic of Code 33.4.2 in production; Slurm exposes it directly. Submitting and inspecting a job is two commands, and the second shows you exactly why a job is waiting (a Resources or Priority reason, and the estimated start time backfill computed):

sbatch train.sbatch                 # queue the job; prints a job id
squeue --me --start                 # show MY jobs, their state, and predicted start time

# Example squeue output:
#  JOBID PARTITION     NAME ST  TIME  NODES   START_TIME        REASON
# 184213  training gpt-pre PD  0:00      8  2026-06-16T18:40    Resources
Code 33.4.3: Slurm's sbatch and squeue --start give the submit-and-inspect loop that Code 33.4.2 simulates by hand. The START_TIME column is the backfill reservation; the REASON tells you whether the job waits on hardware (Resources) or on rank (Priority).

4. Kubernetes Batch and Its Limitation for Gang Jobs Intermediate

Kubernetes was built to schedule long-running services: stateless web replicas that can start one at a time, in any order, on any node. Its batch story arrived later as the Job object, which runs a pod (or a fixed number of pods) to completion and retries on failure, and the Indexed Job, which gives each of $N$ completion pods a stable index $0, 1, \ldots, N-1$, exactly the rank a distributed training program needs. For embarrassingly parallel batch work, like a data-cleaning pass over shards or a fleet of independent inference jobs, the Kubernetes Job is entirely adequate.

The limitation appears the moment the work is a gang. The default Kubernetes scheduler places pods one at a time and independently; it has no native concept of all-or-nothing placement. Submit a training job that needs 64 pods, and the scheduler will happily start 40 of them on whatever nodes are free, leave the remaining 24 pending because the cluster is momentarily full, and now you have 40 GPUs locked up doing nothing while their ranks block on an all-reduce that can never complete because peers 40 through 63 do not exist. Worse, two competing gang jobs can each grab half the cluster and deadlock, each holding resources the other needs to reach its full gang. This is the partial-allocation pathology, and it is precisely the failure mode that the gang scheduling of Section 33.5 exists to prevent.

Key Insight: A Training Job Is All-or-Nothing, a Web Service Is Not

The deepest reason AI training stresses cloud-native schedulers is a mismatch in placement semantics. A web service scales pod by pod and tolerates partial readiness; a synchronous training job is a single indivisible unit that needs every rank running before any step can complete. A scheduler designed for the first treats the second as a set of independent pods, and that incremental, one-pod-at-a-time placement is exactly what produces the deadlocks and the idle-but-locked GPUs. Closing this gap is the entire reason batch-for-AI extensions to Kubernetes exist.

5. Volcano: Kubernetes-Native Batch for AI Advanced

Volcano is the most widely adopted batch scheduler for AI on Kubernetes, a CNCF project that replaces or augments the default scheduler with the queue, fair-share, and gang machinery that HPC schedulers have always had. It reintroduces the three-part skeleton of Section 1 in Kubernetes-native form. A Queue is the analogue of a Slurm partition: a named pool with a capacity weight that bounds how much of the cluster its jobs may consume and gives Volcano its fair-share lever across teams. A PodGroup is the unit that makes gang scheduling possible: it declares a minMember count, and Volcano will not start any pod of the group until at least minMember pods can all be placed, which is the all-or-nothing guarantee the default scheduler lacks. Code 33.4.4 shows a queue and a gang job expressed in Volcano's resources.

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: training
spec:
  weight: 4                 # this queue's fair-share weight across the cluster
  reclaimable: true         # idle capacity here may be reclaimed under pressure
---
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: gpt-pretrain
spec:
  queue: training
  minAvailable: 8           # GANG: do not start until all 8 pods can be placed
  schedulerName: volcano
  tasks:
    - replicas: 8           # 8 worker pods, each on a GPU node
      name: worker
      template:
        spec:
          containers:
            - name: trainer
              image: registry.example/gpt-trainer:latest
              resources:
                limits:
                  nvidia.com/gpu: 8     # 8 GPUs per pod -> 64 GPUs, all-or-nothing
Code 33.4.4: A Volcano Queue and a gang training Job. The minAvailable: 8 field creates the implicit PodGroup constraint: Volcano holds all 8 worker pods until it can place every one, eliminating the partial-allocation deadlock of the default Kubernetes scheduler (Section 4). The queue weight is the fair-share lever, the cloud-native counterpart of the Slurm formula in Section 2.

With minAvailable, Volcano gives Kubernetes the gang guarantee that Code 33.4.1 got for free from Slurm's --nodes=8 --exclusive allocation. Volcano also supports priority classes, preemption, and backfill within queues, so the full Section 2 and Section 3 machinery is available, just expressed as Kubernetes custom resources rather than #SBATCH directives. The practical consequence is that an organization already running Kubernetes for its inference services and data pipelines can add Volcano and schedule large training runs in the same cluster, without standing up a separate Slurm island.

6. HPC-Style versus Cloud-Native for AI Training Advanced

The choice between the Slurm lineage and the Kubernetes-plus-Volcano lineage is one of the larger architectural decisions an AI platform team makes, and it is not a matter of one being better. Table 33.4.1 lays out the trade-off along the axes that matter for training workloads. Slurm carries decades of tuning for exactly the gang-scheduled, topology-sensitive, long-running jobs that foundation-model training produces, and its backfill and fair-share are mature; its weakness is that it is a world apart from the container-and-microservice tooling that the rest of a modern stack uses. Volcano on Kubernetes inherits the entire cloud-native ecosystem (containers, autoscaling, service meshes, GitOps) and unifies training with serving on one substrate; its weakness is that gang and topology scheduling are comparatively young and that Kubernetes adds overhead that a bare-metal HPC scheduler does not.

Table 33.4.1: HPC-style (Slurm) versus cloud-native (Kubernetes + Volcano) batch scheduling for AI training. Neither dominates; the right choice depends on whether the cluster is training-only or shared with serving and data workloads.
AxisSlurm (HPC lineage)Kubernetes + Volcano (cloud-native)
Submission unitsbatch script with #SBATCHYAML Job / PodGroup via kubectl apply
Gang schedulingNative; an allocation is atomicVia Volcano minAvailable; absent in default scheduler
Backfill and fair-shareMature, heavily tunedPresent in Volcano, younger
Topology awarenessStrong (see Section 33.5)Emerging (topology plugins, NUMA)
Ecosystem fitSeparate from container toolingUnified with serving, autoscaling, GitOps
Best whenCluster is training-dominated HPCTraining shares hardware with services and pipelines

For the workloads of this book the convergence point is clear. A pure research supercomputer training one giant model is well served by Slurm. A company whose GPU fleet must train models, serve inference at scale (Chapter 24), and run data pipelines on the same hardware increasingly chooses Kubernetes with Volcano, accepting younger batch tooling in exchange for one operational substrate. What both must deliver, and what the next section makes precise, is the all-or-nothing, topology-aware placement that a distributed training step demands.

Thesis Thread: The Scheduler Is Where Distribution Becomes Physical

Every parallel method in Parts III and IV assumed a set of workers that exist, are reachable, and start together. This section is where that assumption is manufactured. The all-reduce of Chapter 15 requires all ranks present at once, which is exactly the gang guarantee Slurm and Volcano provide; the elastic and fault-tolerant training of Chapter 18 depends on the preemption and reservation mechanics introduced here. Scheduling is the layer that turns "distribute the work across many machines" from an algorithm on paper into specific GPUs granted at a specific instant.

7. Preemption, Reservations, and the Road to Gang Scheduling Advanced

Three remaining mechanisms complete the cluster-scheduling picture and each connects forward. Preemption lets a high-priority job evict a running lower-priority one to claim its resources; the evicted job is killed or, ideally, checkpointed and requeued. Preemption is what makes priority classes meaningful under contention, and it is the foundation of training on spot and preemptible instances, where the cloud provider itself preempts your job on short notice, a regime that Section 33.8 develops in full. Reservations let an administrator fence off a block of nodes for a future window, for a scheduled large run or a maintenance event, and the backfill of Section 3 plans around them just as it plans around the high-priority job's implicit reservation.

The third mechanism, gang scheduling proper, is the all-or-nothing placement that this section has repeatedly pointed at: the guarantee that a job's entire set of workers starts together or not at all, which both Slurm's atomic allocation and Volcano's minAvailable provide. Gang scheduling also explains a connection from Chapter 2: by starting all ranks together rather than trickling them in, the scheduler removes a whole class of stragglers that would otherwise arise from staggered starts, which is why a gang-scheduled training step has a cleaner synchronization profile than one whose workers drift into existence. The next section turns gang scheduling and its topology-aware cousin into the precise placement algorithms that a foundation-model run depends on.

Research Frontier: Scheduling Heterogeneous and Elastic AI Jobs (2024 to 2026)

Classical backfill and fair-share assume a job has a fixed, rigid resource request, but modern AI workloads strain that assumption. A vigorous research line schedules elastic jobs whose worker count can shrink or grow at runtime: systems in the lineage of Pollux co-adapt each job's resource allocation and its batch size to maximize cluster-wide goodput rather than raw utilization, and elastic-training integrations (Ray, TorchElastic) let the scheduler resize a run between checkpoints. A second thread targets heterogeneous fleets, where a cluster mixes GPU generations and the scheduler must decide which model trains on which silicon; Gavel and related work frame this as a throughput-aware allocation problem solved with optimization rather than fixed priority. A third addresses the interactive, bursty load of large-scale hyperparameter and AutoML search (Chapter 21), where many short trials must be packed and early-stopped, pushing schedulers toward trial-aware, preemption-friendly policies. The common direction is a scheduler that treats AI jobs as malleable and reward-bearing, not as opaque fixed-size boxes.

Fun Note: The Walltime Arms Race

Every cluster eventually discovers that users pad their walltime estimates to avoid being killed mid-run, which quietly defeats backfill by making every reservation pessimistic. Some sites fight back by giving jobs with tight, accurate walltimes a small priority bonus, turning honest estimation into a competitive advantage. The scheduler, it turns out, is also a small mechanism-design problem, a theme Chapter 28 takes seriously.

Exercise 33.4.1: Why Gangs Break the Default Scheduler Conceptual

A Kubernetes cluster has 100 free GPUs and the default scheduler. Two users each submit a synchronous training job needing 60 GPUs as 60 one-GPU pods. Describe, step by step, how the scheduler can place pods such that neither job ever runs, and identify the exact resource state in which the deadlock is reached. Then explain how Volcano's minAvailable (Code 33.4.4) prevents this state from ever being entered, and why a web-service Deployment of 120 replicas would not deadlock under the same scheduler.

Exercise 33.4.2: Extend the Backfill Simulator Coding

Starting from Code 33.4.2, add a sixth job that is large enough to fit the 8-GPU hole but whose walltime is longer than the high-priority job's reservation time. Confirm that backfill correctly refuses to start it (it would delay the gang job), then shorten its walltime by one unit at a time and find the exact threshold at which it becomes backfill-eligible. Report makespan and utilization at the threshold, and explain in one sentence why padding this job's walltime would have cost the cluster utilization.

Exercise 33.4.3: Fair-Share Under a Burst Analysis

Using the fair-share factor $F = 2^{-u/(f s)}$ from Section 2, consider two accounts with equal target shares $s = 0.5$ and $f = 1$. Account A has recent usage $u_A = 0.1$ and account B has $u_B = 0.9$. Compute each account's $F$ and the ratio of their priorities (assuming fair-share dominates the priority sum). Now suppose B stops submitting and its usage decays toward zero while A's holds steady; sketch how the priority ranking flips and after roughly how much decay. Relate your answer to the Practical Example in Section 2: why does the buried sweep eventually get scheduled without any human intervention?