Part VII: Cluster, Edge, and Reliable Infrastructure
Chapter 33: Cluster Infrastructure and Scheduling

Spot and Preemptible Scheduling for Cost Optimization

"I was given two minutes to say goodbye. I spent the first ninety seconds flushing a checkpoint and the last thirty telling the scheduler where to find it. Resurrection, it turns out, is mostly bookkeeping."

A Spot Instance, Given Two Minutes to Say Goodbye
Big Picture

Cloud providers sell their idle capacity at deep discounts on one condition: they may take it back with only seconds of warning, so the cheapest compute in the world is also the least reliable, and the engineering problem is to make a workload that does not care. Long training runs and large hyperparameter sweeps are the most expensive line items in any AI budget and, because they are already elastic and fault tolerant, they are the perfect tenants for interruptible capacity. This section treats spot and preemptible instances as a scheduling and cost-optimization problem: what the market offers, what preemption costs, and how frequent checkpointing, elastic re-scaling, redundancy, and preemption-aware schedulers turn a sixty-to-ninety-percent discount into real savings rather than a stream of lost work. The fault-tolerance machinery of Chapter 18 is the mechanism; this section is the cost lever that machinery exists to pull.

Every public cloud runs far more hardware than its customers reserve at any instant, and that spare capacity earns nothing while it sits idle. To monetize it, providers sell it under a different contract: you pay a fraction of the on-demand price, often a quarter to a third of it, in exchange for surrendering the strongest guarantee a cloud normally gives you, that the machine stays yours until you release it. On this contract the provider may reclaim the instance whenever it needs the capacity back, giving you a short warning, typically thirty seconds to two minutes, before the machine disappears. Amazon calls these Spot Instances, Google calls them Spot or Preemptible VMs, and Azure calls them Spot Virtual Machines; the names differ but the bargain is identical. For a workload that can absorb the interruption, this is the single largest cost lever in distributed AI, and the rest of this section is about earning the discount without paying it back in wasted work.

wall-clock time useful compute on spot capacity ckpt ckpt ckpt lost recompute (since last ckpt) 2-min warning: final flush ckpt preempted: reacquire resume from last ckpt restore
Figure 33.8.1: Anatomy of a spot run. The job checkpoints at a fixed interval $\tau$. A preemption arrives with a short warning, just long enough to flush one final checkpoint; the only work truly lost is the slice computed since the previous checkpoint (red). After a gap while the scheduler reacquires capacity, the job restores from the last saved state and continues. Every quantity in this picture, $\tau$, the warning window, the lost slice, and the gap, enters the cost model of Section 4.

1. The Interruptible Market and Why AI Workloads Fit It Beginner

The economics are stark enough to reorganize a budget around. Interruptible capacity routinely trades at sixty to ninety percent below the on-demand price for the identical hardware, which means a training run that costs a thousand dollars on reserved instances can cost two to four hundred on spot, provided it survives the interruptions. Historically the price floated on a supply-and-demand market and could spike, but the major providers have moved to smoother, capacity-driven pricing, so the dominant risk today is not a price spike but a preemption: the provider reclaiming the machine because a full-price customer wants it. The discount is therefore best understood as payment for accepting a known interruption rate, and the engineering question is how to make that interruption rate cheap to absorb.

Not every workload qualifies. An interactive inference endpoint with a strict latency budget cannot tolerate a machine vanishing mid-request, so it stays on on-demand or reserved capacity. The workloads that fit are the ones that are long, batch-shaped, and already designed to recover from failure, and that description fits the two most expensive activities in this book almost exactly. A foundation-model pretraining run (Chapter 19) executes for days or weeks, checkpoints already for its own safety, and is built on the elastic, fault-tolerant training stack of Chapter 18. A large hyperparameter sweep (Chapter 21) is even more forgiving: it is a bag of independent trials, and losing one preempted trial costs at most that trial's progress while the rest continue untouched. These are precisely the line items where a sixty-percent discount moves the budget, which is why spot scheduling sits in the cluster chapter rather than as a footnote.

Key Insight: The Discount Is a Payment for Tolerable Interruption, Not Free Money

Spot capacity is not cheaper compute; it is the same compute sold under a weaker guarantee. You capture the discount only to the extent that your workload can lose its machine on short notice and resume cheaply. That makes spot a derivative of fault tolerance: the more robust the run already is to a node disappearing, the larger the fraction of the headline discount it actually keeps. A workload that wastes half its time recovering from preemptions has converted a seventy-percent discount into a thirty-five-percent one. The whole craft is keeping the recovery tax small.

2. The Four Mechanisms That Make Spot Survivable Intermediate

A run survives on interruptible capacity because four mechanisms work together, and each one corresponds directly to a piece of machinery built earlier in the book. The first is frequent checkpointing: the job periodically writes its full state, model weights, optimizer moments, data-loader position, and step counter, to durable storage, so that a preemption costs only the work done since the last write. Checkpointing is the load-bearing mechanism, and its frequency is the one knob with a clean optimum, derived in Section 4. The second is the short warning itself: providers expose the impending preemption as a signal (an instance-metadata notice on AWS, a shutdown hook on Google Cloud), and a well-built training loop catches that signal and flushes one final checkpoint inside the warning window, turning a hard kill into a graceful save, exactly the final-flush step drawn in Figure 33.8.1.

The third mechanism is elastic re-scaling. When a worker is preempted, the job should not halt waiting for an identical replacement; it should continue on the survivors at reduced width and absorb new capacity when it returns. This is the elastic-training capability of Chapter 18, where the world size changes mid-run and the data-parallel group reforms around whoever is present. The fourth is redundancy and diversification: spreading the request across several instance types, several availability zones, and a blend of spot and on-demand so that a preemption wave in one pool does not take the whole job down at once. A common pattern keeps a small on-demand core (the coordinator and one or two stable workers) and fills the rest of the fleet with diversified spot, so the run can always make slow progress even when an entire spot pool is reclaimed. Table 33.8.1 ties each mechanism to the chapter that builds it and to the failure it neutralizes.

Table 33.8.1: The four mechanisms that make interruptible capacity survivable, the preemption symptom each one addresses, and where the underlying machinery is built.
MechanismPreemption symptom it neutralizesBuilt in
Frequent checkpointingLost compute since the last saved stateCh 18, this section §4
Warning-triggered final flushThe in-flight slice at the moment of the killCh 18, this section §3
Elastic re-scalingJob stalls waiting for an exact replacement nodeChapter 18
Redundancy and diversificationA correlated preemption wave across one poolThis section §5, Ch 18

3. A Preemption-Aware Training Loop Intermediate

The signal-handling half of survival is small and worth seeing concretely. A preemption-aware loop checkpoints on a fixed cadence and also installs a handler that fires when the cloud delivers its termination notice, flushing one last checkpoint before the machine goes away. The structure below catches the operating-system signal that cloud agents raise on impending preemption; in production the same handler is wired to the provider's metadata poll, but the recovery logic is identical.

import signal, time

preempting = False
def on_preempt(signum, frame):
    global preempting
    preempting = True                  # set a flag; do NOT save inside the handler

signal.signal(signal.SIGTERM, on_preempt)   # cloud agents raise SIGTERM on warning

def train(model, loader, start_step, total_steps, ckpt_every):
    for step in range(start_step, total_steps):
        loss = train_one_step(model, loader)        # one optimizer step
        if step % ckpt_every == 0:                   # periodic checkpoint (interval tau)
            save_checkpoint(model, step)
        if preempting:                               # warning received this step
            save_checkpoint(model, step)             # final flush inside the window
            print(f"preempted at step {step}; state is durable, exiting")
            return step                              # scheduler will resume from here
    return total_steps
Code 33.8.1: A preemption-aware training loop. The signal handler only flips a flag; the actual save happens in the main loop where the model state is consistent, avoiding a half-written checkpoint. On the next launch the scheduler passes the returned step as start_step, so the run resumes from the last durable state rather than from zero.
Library Shortcut: One Flag Tells the Cluster Manager to Use Spot

You rarely hand-roll the spot request or the resume bookkeeping. A managed launcher such as SkyPilot provisions interruptible capacity, watches for preemptions, and relaunches the job pointed at the last checkpoint, all from a short spec. The thirty lines of metadata polling, capacity reacquisition, and resume wiring collapse to a single field plus a recovery policy:

# SkyPilot task spec: run on spot, auto-recover on preemption.
resources:
  accelerators: A100:8
  use_spot: true            # request interruptible capacity (the discount)
  spot_recovery: FAILOVER   # on preemption, find capacity elsewhere and relaunch

run: |
  python train.py --resume-from /checkpoints/latest   # idempotent resume
Code 33.8.2: The same survival behavior as Code 33.8.1 expressed declaratively. use_spot: true captures the discount; spot_recovery: FAILOVER hands the launcher the job of detecting preemption, finding fresh capacity (possibly in another zone or instance type), and restarting from --resume-from. The AWS and GCP equivalents are the --instance-market-options spot flag and the --provisioning-model=SPOT flag respectively.

4. The Cost and Wasted-Work Model Advanced

Checkpointing fights a two-sided tax, and the optimum balances them. Checkpoint too rarely and a preemption discards a large slice of recomputable work; checkpoint too often and you spend the run writing state you never need to restore. Model the run with a preemption rate $\lambda$ (preemptions per hour), so the mean time between preemptions is $\text{MTBF} = 1/\lambda$, a checkpoint cost $C$ (the wall-clock time to write one), and a checkpoint interval $\tau$. If a preemption lands at a uniformly random moment within an interval, the expected work lost is half the interval, because on average the kill arrives halfway between checkpoints:

$$\mathbb{E}[\text{wasted work per preemption}] \approx \frac{\tau}{2}.$$

Spread over the run, three overheads accumulate per unit of useful compute: the checkpoint-writing fraction $C/\tau$, the expected lost-recompute fraction $\tau/(2\,\text{MTBF})$, and a restore fraction $R/\text{MTBF}$ for the time $R$ to detect the loss, reacquire capacity, and reload. The total overhead fraction is

$$f(\tau) = \frac{C}{\tau} + \frac{\tau}{2\,\text{MTBF}} + \frac{R}{\text{MTBF}}.$$

Only the first two terms depend on $\tau$, and they trade off cleanly: one falls as $\tau$ grows, the other rises. Setting $\mathrm{d}f/\mathrm{d}\tau = 0$ gives $-C/\tau^2 + 1/(2\,\text{MTBF}) = 0$, whose solution is the Young/Daly optimal checkpoint interval, the same square-root law that governs checkpointing in classical high-performance computing:

$$\tau^{\star} = \sqrt{2\,C\,\text{MTBF}}.$$

The interpretation is worth holding onto: the best interval grows with the geometric mean of how expensive a checkpoint is and how long the machine typically survives. Cheap checkpoints or frequent preemptions pull $\tau^{\star}$ down; expensive checkpoints or rare preemptions push it up. Finally, the dollar cost is just price times wall-clock. If the run needs $T_{\text{useful}}$ hours of pure compute and spot sells at a fraction $(1-d)$ of the on-demand price $p$ per hour, the expected spot bill is

$$\text{cost}_{\text{spot}} = \underbrace{T_{\text{useful}}\,(1 + f(\tau^{\star}))}_{\text{wall-clock hours}} \; \cdot \; p \,(1 - d),$$

to be compared against the on-demand bill $T_{\text{useful}}\,p$ when the run is never preempted. The discount $d$ is large; the overhead $f(\tau^{\star})$, at a sensible interval, is small. The demo below evaluates this model end to end, finding $\tau^{\star}$ and the resulting saving for a representative hundred-hour run.

import math

mtbf_hours, C_min, restore_min = 6.0, 4.0, 8.0   # preempt every ~6h; 4-min ckpt; 8-min restore
useful_hours = 100.0                              # pure compute if never interrupted
on_demand_per_hr, spot_discount = 12.00, 0.70     # $/hour and the 70% spot discount

C_hours, restore_h = C_min / 60.0, restore_min / 60.0
tau_opt_h = math.sqrt(2.0 * C_hours * mtbf_hours)             # Young/Daly: sqrt(2*C*MTBF)

def overhead_fraction(tau_h):                                 # f(tau)
    return C_hours / tau_h + tau_h / (2.0 * mtbf_hours) + restore_h / mtbf_hours

def wall_hours(tau_h):
    return useful_hours * (1.0 + overhead_fraction(tau_h))

print("tau (min) | overhead% | wall (h) | spot cost ($)")
for tau in sorted({0.25, 0.5, tau_opt_h, 1.0, 2.0, 4.0}):
    cost = wall_hours(tau) * on_demand_per_hr * (1.0 - spot_discount)
    tag = "  <- Young/Daly" if abs(tau - tau_opt_h) < 1e-9 else ""
    print(f"{tau*60:8.1f}  | {overhead_fraction(tau)*100:8.2f}  | {wall_hours(tau):7.2f}  | {cost:11.2f}{tag}")

on_demand_cost = useful_hours * on_demand_per_hr
spot_cost = wall_hours(tau_opt_h) * on_demand_per_hr * (1.0 - spot_discount)
print(f"\noptimal tau*  : {tau_opt_h*60:.1f} min, overhead {overhead_fraction(tau_opt_h)*100:.2f}%")
print(f"on-demand     : ${on_demand_cost:,.2f}   spot at tau*: ${spot_cost:,.2f}")
print(f"net saving    : ${on_demand_cost - spot_cost:,.2f} "
      f"({100*(on_demand_cost - spot_cost)/on_demand_cost:.1f}%)")
Code 33.8.3: The full spot economics model. It sweeps the checkpoint interval, marks the Young/Daly optimum $\tau^{\star} = \sqrt{2C\cdot\text{MTBF}}$, and prints the wall-clock overhead and spot bill at each, then compares the optimal spot cost to the never-preempted on-demand cost.
tau (min) | overhead% | wall (h) | spot cost ($)
    15.0  |    30.97  |  130.97  |      471.50
    30.0  |    19.72  |  119.72  |      431.00
    53.7  |    17.13  |  117.13  |      421.67  <- Young/Daly
    60.0  |    17.22  |  117.22  |      422.00
   120.0  |    22.22  |  122.22  |      440.00
   240.0  |    37.22  |  137.22  |      494.00

optimal tau*  : 53.7 min, overhead 17.13%
on-demand     : $1,200.00   spot at tau*: $421.67
net saving    : $778.33 (64.9%)
Output 33.8.3: The optimum sits at $\tau^{\star} \approx 53.7$ minutes, where the overhead curve is flattest; checkpointing four times more often (every 15 minutes) nearly doubles the overhead to 31 percent and burns fifty dollars writing state. Even after carrying a 17 percent preemption tax, the run costs $\$421.67$ against $\$1{,}200$ on-demand, a 64.9 percent saving that tracks the headline discount closely.
Key Insight: The Optimum Is Flat, So Aim Slightly Long

Notice in Output 33.8.3 that the cost at $\tau = 60$ minutes is within a dollar of the cost at the exact optimum of 53.7 minutes, while $\tau = 15$ minutes is far worse. The Young/Daly curve is flat near its minimum and steep on the too-frequent side, because over-checkpointing wastes a fixed cost on every write whereas under-checkpointing only risks losing work that a preemption may never claim. The practical reading: estimate $\tau^{\star}$, then round up to a convenient interval rather than down. Paying a little for slightly stale checkpoints is cheaper than paying certainly for too-frequent ones.

5. Diversification, Redundancy, and the Reliability Frontier Advanced

The cost model in Section 4 assumes a single preemption rate, but in practice $\lambda$ is something you shape through how you request capacity. Concentrating an eight-node job on one instance type in one zone maximizes the discount and the correlation of failures: when that pool is reclaimed, all eight nodes go at once, and the run halts rather than degrading. Spreading the same request across several instance families and zones raises the effective MTBF, because an independent preemption now removes one node instead of all of them, and the elastic loop of Chapter 18 simply continues on the survivors. Mixing in a small on-demand base, the coordinator and a worker or two, guarantees the job can always make forward progress even under a total spot wipeout. Each of these choices trades a sliver of discount for a higher MTBF, and Figure 33.8.2 sketches the frontier they trace.

reliability (effective MTBF) → cost / hour → all spot, one type cheapest, fragile diversified spot + on-demand core (sweet spot) all on-demand priciest, robust
Figure 33.8.2: The cost-reliability frontier for capacity sourcing. Pure single-type spot is the cheapest and most fragile point (lower left); pure on-demand is the most reliable and most expensive (upper right). Diversifying spot across instance types and zones and backing it with a small on-demand core moves the run to the knee of the curve (green), where a large fraction of the discount survives but a single preemption wave no longer halts the job. Spot scheduling is the act of choosing a point on this curve.
Practical Example: Halving the Pretraining Bill Without Losing a Day

Who: An ML platform engineer running a multi-week language-model pretraining job for a startup.

Situation: The run needed roughly three hundred GPU-hours per day on eight A100s, and the on-demand bill was the largest line in the monthly cloud invoice.

Problem: Finance wanted the cost halved, but the team had been burned once by a naive spot attempt that lost twelve hours of work to a preemption with no checkpoint in place.

Dilemma: Stay on on-demand and keep the predictable but expensive bill, or move to spot and risk repeating the lost-work incident if the recovery story was not airtight.

Decision: They moved to diversified spot with a one-node on-demand core, after wiring the preemption-aware loop of Code 33.8.1 and setting the checkpoint interval near the Young/Daly optimum for their measured MTBF.

How: They launched through a managed launcher (Code 33.8.2) requesting four A100 instance families across three zones, set spot_recovery to fail over on preemption, and measured a real MTBF of about five hours, which put $\tau^{\star}$ near forty-five minutes.

Result: Preemptions happened a few times a week and cost minutes each, not hours; the effective overhead held under twenty percent, and the bill fell about sixty percent with no change to final model quality, matching the model in Output 33.8.3.

Lesson: The lost-work disaster was a missing-checkpoint failure, not a spot failure. With the four mechanisms of Section 2 in place, the discount is real and the risk is bounded.

Thesis Thread: Fault Tolerance Is the Enabler, Cost Is the Payoff

Spot scheduling is the clearest place in the book where scale-out machinery pays for itself in dollars. The elastic, fault-tolerant training loop of Chapter 18 was built to survive the random node failures that are inevitable across a thousand machines; here that exact capability is repurposed to survive deliberate preemptions and, in doing so, unlocks a sixty-to-ninety-percent discount. The same checkpoint-and-resume discipline that keeps a long run correct also makes the cheapest capacity in the cloud usable. Reliability engineering and cost engineering turn out to be the same engineering, viewed from two sides, a connection Chapter 26 makes the organizing principle of MLOps.

6. Where Spot Fits in the MLOps Cost Picture Intermediate

Spot scheduling is one lever among several in the cost discipline that Chapter 26 formalizes for production AI. Compute cost in an ML organization breaks into roughly three pools: training and experimentation, which is bursty, batch-shaped, and the natural home of spot; serving, which is latency-bound and mostly stays on reserved or on-demand capacity; and data processing, which sits in between and often tolerates spot for non-urgent batch jobs. The HPO sweeps of Chapter 21 are the ideal spot tenant because their independent trials make the effective cost of a preemption nearly zero, while a synchronous data-parallel run pays the recompute tax of Section 4 and so demands the checkpointing discipline. A mature cost policy routes each workload to the cheapest capacity tier it can tolerate, and Chapter 26 turns that routing into a measured, monitored part of the deployment pipeline rather than a per-job decision.

Research Frontier: Preemption-Aware Schedulers and Spot-Native Training (2024 to 2026)

The frontier is moving from surviving preemptions to scheduling around them. Cluster schedulers are learning to forecast preemption risk per instance pool and place the most loss-sensitive shards on the most stable capacity, while letting redundant or recomputable work ride the cheapest pools; systems in the lineage of Bamboo and Oobleck pursue redundant or resilient pipeline-parallel training that tolerates node loss with near-zero recompute by keeping redundant model stages. A parallel thread on geo-distributed and over-the-internet training (the DiLoCo-style local-update methods noted in Chapter 3's communication-avoiding line) makes spot even more attractive, because runs that already tolerate loose synchronization tolerate preemption almost for free. Managed launchers such as SkyPilot have turned cross-cloud spot arbitrage, chasing the cheapest available interruptible capacity across providers in real time, into a one-line policy, and the open question is how far an automated scheduler can push the cost-reliability frontier of Figure 33.8.2 before the recovery overhead eats the discount.

Fun Note: The Two-Minute Goodbye

The thirty-second-to-two-minute warning is the most consequential short window in cloud computing. Engineers measure their checkpoint-flush time against it the way sprinters measure a start, because a save that does not finish inside the window is a save that did not happen. The healthiest spot loops treat the warning as a fire drill they have rehearsed: flag set, weights flushed, exit clean, all with seconds to spare. The instance gets its dignified goodbye, and the run barely notices it left.

Exercise 33.8.1: When Does Spot Stop Paying? Conceptual

Using the overhead model $f(\tau) = C/\tau + \tau/(2\,\text{MTBF}) + R/\text{MTBF}$, argue qualitatively what happens to the optimal interval $\tau^{\star}$ and to the total saving as the preemption rate $\lambda = 1/\text{MTBF}$ rises. At what point does a workload stop being a good spot candidate? Identify two workload properties (beyond the raw preemption rate) that determine whether the recompute tax stays small, and explain why a large HPO sweep keeps that tax near zero while a tightly synchronized data-parallel run does not.

Exercise 33.8.2: Re-derive and Extend the Optimum Coding

Reproduce Code 33.8.3, then extend it two ways. First, add a finite warning window $w$ during which the final flush either succeeds (if $C \le w$) or fails (losing the whole interval if $C > w$), and re-plot the cost as you scale $C$ past $w$. Second, replace the single MTBF with a diversification factor $g \ge 1$ that multiplies the effective MTBF (modeling spread across $g$ independent pools) at the price of a smaller discount $d(g) = d_0 / \sqrt{g}$. Find the $g$ that minimizes total cost for the parameters in the demo, and relate your answer to the knee of Figure 33.8.2.

Exercise 33.8.3: Budget a Real Pretraining Run Analysis

A pretraining job needs 500 GPU-hours of pure compute on a cluster that costs $\$20$ per hour on-demand and $\$5$ per hour on spot. A checkpoint takes 6 minutes to write, restore takes 10 minutes, and the measured MTBF is 4 hours. Compute $\tau^{\star}$, the overhead fraction at $\tau^{\star}$, the expected wall-clock hours, and the expected spot bill, and compare to the on-demand bill. Then state how many independent spot pools you would need to diversify across to push the MTBF to 12 hours, and recompute the saving. Show that the dominant term in the bill is the discount, not the overhead, and explain why that justifies aiming the checkpoint interval slightly long, as Section 4 advises.