"I judged a thousand configurations after eight epochs each, kept the best dozen, and let the rest go. Eleven of them thanked me. One of them is still bitter about it."
A Scheduler That Learned to Stop Early
The cost of hyperparameter search is dominated by the configurations that were never going to win, so the single most powerful lever is to spend almost nothing finding that out. Multi-fidelity optimization evaluates every candidate cheaply first, at a low fidelity (a few epochs, a slice of the data, a smaller model), reads the noisy early signal, and pays the full training cost only for the promising minority. Because the cheap trials are independent, a cluster can run hundreds of them at once and then promote the survivors, which is exactly the shape that Section 21.4 turns into successive halving and Hyperband. This section establishes the idea, the assumption it rests on, the compute-saving arithmetic, and a runnable demo that finds a near-best configuration for a quarter of the full-search compute.
The previous two sections gave us the search strategies. Section 21.1 showed that hyperparameter search is embarrassingly parallel across configurations yet still wasteful, because most configurations are bad and we pay full price to learn so; Section 21.2 made the proposals smarter with random and Bayesian search. Both treat the evaluation of a single configuration as an atomic, full-cost operation: you pick a configuration, you train it to completion, you read its validation score. This section attacks that assumption directly. A full training run is not atomic at all. It is a sequence of partial results, a learning curve, and the early part of that curve already carries information about where the curve is heading. Multi-fidelity optimization is the discipline of buying that early information cheaply and acting on it before the bill for a full run ever arrives.
1. Fidelity Is a Knob You Already Have Beginner
A fidelity is any cheaper approximation of the full evaluation whose result still correlates with it. Deep learning hands you several such knobs for free, and they cost nothing to install because they are quantities you already set. The most common is the number of training iterations or epochs: a model trained for eight epochs is a rough preview of the same model trained for sixty-four, and it costs one eighth as much. A second knob is the dataset fraction: training on ten percent of the examples gives a noisier but far cheaper estimate of how a configuration will rank. A third is input resolution, lowering image side length or sequence length, which cuts the compute per step quadratically for vision models. A fourth is model size, evaluating a configuration on a scaled-down architecture before committing to the full one. Each knob trades signal quality for compute, and the art is choosing the cheapest setting that still preserves the ranking you care about.
These knobs are not interchangeable, and the right one depends on what is expensive about your workload. When training time dominates, epochs are the natural fidelity. When the dataset itself is the bottleneck, as in the data-loading-bound regimes of Chapter 8, the data fraction is the lever that actually moves the cost. The unifying picture is a budget, a single scalar $b$ measured in whatever unit you spend (epochs, examples seen, wall-clock seconds), where larger $b$ buys a cleaner, more expensive estimate of a configuration's true quality.
Classical search asks "which configuration should I evaluate next?" Multi-fidelity search asks the strictly richer question "which configuration should I evaluate next, and at what fidelity?" The second question contains the first, and answering it well is what makes large-scale tuning affordable. The total cost of a search is set far more by how much budget you waste on configurations that will lose than by how cleverly you propose new ones, so the highest-leverage decision is when to stop spending on a candidate, not which candidate to spend on.
2. The Assumption, and Its Risk Intermediate
Everything rests on one assumption: low-fidelity performance must correlate with full-fidelity performance. If the configurations that look best after eight epochs tend to be the ones that finish best after sixty-four, then a cheap screen preserves enough of the true ranking to be worth acting on. This assumption holds often, because learning curves rarely cross arbitrarily; a configuration with a sensible learning rate and a workable architecture usually pulls ahead early and stays ahead. That is the same early-stopping intuition introduced in Section 21.1, now promoted from a per-trial convenience to the organizing principle of the whole search.
The risk is the slow starter: a configuration whose low-fidelity score is mediocre but whose learning curve overtakes the others late, perhaps because a low learning rate warms up slowly or a regularizer only bites after many epochs. A naive low-fidelity screen will eliminate it before its curve turns, and no amount of compute at the full tier can recover a candidate you already discarded. This is the central failure mode of multi-fidelity methods, and it is why the correlation between fidelities is not a footnote but the quantity you should measure before trusting the scheme. A rank correlation near one means the cheap screen is nearly free information; a correlation near zero means the screen is throwing away winners at random. Real workloads sit in between, and the honest move is to estimate where.
Every practitioner who has run a multi-fidelity search has a story about the configuration that would have won if only it had been allowed to keep training. It is the hyperparameter equivalent of cutting a marathon field after the first kilometer: you will be right about the people walking, and you will occasionally eliminate the one who negative-splits. The defense is not to abandon early stopping; it is to give the cheap tier enough budget that the slow starters have begun to move, and to keep a small fraction of promotions as a hedge against the curve you cannot yet see.
3. The Compute-Saving Arithmetic Intermediate
The saving comes from a simple imbalance: many configurations are evaluated cheaply, and only a few are evaluated expensively. Suppose we screen $M$ configurations at a low budget $b_{\text{low}}$, promote the top fraction $\eta$ to a high budget $b_{\text{high}}$, and train only those survivors to completion. The total budget spent is
$$B_{\text{mf}} = \underbrace{M \, b_{\text{low}}}_{\text{screen all configs}} \;+\; \underbrace{\eta M \, b_{\text{high}}}_{\text{full-train the survivors}},$$against the cost of a full search that trains every configuration to completion,
$$B_{\text{full}} = M \, b_{\text{high}}.$$The fraction of compute we keep is therefore
$$\frac{B_{\text{mf}}}{B_{\text{full}}} = \frac{b_{\text{low}}}{b_{\text{high}}} + \eta,$$which is striking because $M$ cancels: the saving is set entirely by the ratio of the cheap budget to the full budget plus the promotion fraction. With a low tier one eighth the cost of the full tier ($b_{\text{low}}/b_{\text{high}} = 1/8$) and a promotion fraction $\eta = 1/8$, multi-fidelity spends $1/8 + 1/8 = 1/4$ of a full search. You evaluate four times as many configurations for the same compute, or the same number for a quarter of the compute. Stacking more rounds with smaller surviving fractions, as Section 21.4 does, pushes this ratio down further, while Chapter 3 supplies the scalability models that say when the cheap tier is itself worth parallelizing.
The cheap low-fidelity trial is precisely the work unit that a distributed scheduler loves: short, independent, and plentiful. Where Section 21.1 ran a handful of full-length trials in parallel, multi-fidelity runs hundreds of brief ones, fills the cluster more evenly, and turns over jobs fast enough that a straggler costs minutes rather than hours. The promotion step is a small synchronization point, a rank-and-keep that gathers low-fidelity scores and selects survivors, structurally a reduction over trials much like the all-reduce of Chapter 15 is a reduction over workers. The distributed trial-scheduling machinery of Section 21.6 and the elastic, preemptible execution of Chapter 18 are exactly what make a fleet of thousands of cheap trials practical.
4. A Multi-Fidelity Screen From Scratch Intermediate
The code below builds a controlled world where we know each configuration's true quality, so we can measure both how much compute multi-fidelity saves and how much accuracy it sacrifices. Each of $M = 256$ configurations has a hidden asymptotic accuracy and a learning rate that governs how fast its curve approaches that asymptote. Evaluating a configuration at budget $b$ returns a point on its learning curve plus measurement noise that shrinks with $b$, so a low-fidelity look is genuinely noisier than a full one. The scheme screens all configurations at low fidelity, promotes the top eighth, trains only those to full fidelity, and reports the winner against the ground-truth best.
import numpy as np
rng = np.random.default_rng(7)
M = 256 # candidate configurations
B_LOW = 8 # low-fidelity budget: epochs of training
B_HIGH = 64 # high-fidelity budget: full epochs
# Each config has a hidden "true" quality (final-epoch validation accuracy).
# A learning curve rises toward that asymptote at a config-specific rate.
true_quality = 0.55 + 0.40 * rng.beta(2.0, 2.0, size=M) # in [0.55, 0.95]
rate = rng.uniform(0.04, 0.25, size=M) # how fast it learns
def evaluate(cfg, budget):
# Accuracy at a given budget = asymptote scaled by an exponential learning
# curve, plus measurement noise that shrinks with more budget (cleaner signal).
curve = true_quality[cfg] * (1.0 - np.exp(-rate[cfg] * budget))
noise = rng.normal(0.0, 0.18 / np.sqrt(budget), size=np.size(cfg))
return curve + noise
cfgs = np.arange(M)
# --- Stage 1: cheap low-fidelity screen of ALL configs ---
low = evaluate(cfgs, B_LOW)
order = np.argsort(low)[::-1] # best low-fidelity first
keep_frac = 0.125 # promote the top 1/8
n_keep = int(M * keep_frac)
promoted = order[:n_keep]
# --- Stage 2: full high-fidelity evaluation of survivors only ---
high_promoted = evaluate(promoted, B_HIGH)
mf_winner = promoted[np.argmax(high_promoted)]
# --- Ground truth: what full evaluation of EVERY config would have found ---
high_all = evaluate(cfgs, B_HIGH)
best_cfg = int(np.argmax(high_all))
best_acc = high_all[best_cfg]
mf_acc = high_all[mf_winner]
# --- Compute spent, in units of one epoch ---
mf_cost = M * B_LOW + n_keep * B_HIGH # all configs cheap + survivors full
full_cost = M * B_HIGH # every config full
regret = best_acc - mf_acc # accuracy left on the table
# --- How well does the low-fidelity signal rank configs? ---
def spearman(a, b):
ra = np.argsort(np.argsort(a)); rb = np.argsort(np.argsort(b))
return np.corrcoef(ra, rb)[0, 1]
rho = spearman(low, high_all)
top_recall = np.mean(np.isin(np.argsort(high_all)[::-1][:n_keep], promoted))
print(f"configurations screened : {M}")
print(f"low / high fidelity budget : {B_LOW} / {B_HIGH} epochs")
print(f"promoted (top {keep_frac:.0%}) : {n_keep}")
print(f"rank corr (low vs high) : {rho:.3f} (Spearman)")
print(f"top-{n_keep} recall at low fidelity : {top_recall:.2f}")
print(f"multi-fidelity best acc : {mf_acc:.4f}")
print(f"full-search best acc : {best_acc:.4f}")
print(f"accuracy regret : {regret:.4f}")
print(f"compute spent (epoch-units) : {mf_cost} vs {full_cost} full")
print(f"compute saved : {100*(1-mf_cost/full_cost):.1f}%")
configurations screened : 256
low / high fidelity budget : 8 / 64 epochs
promoted (top 12%) : 32
rank corr (low vs high) : 0.485 (Spearman)
top-32 recall at low fidelity : 0.38
multi-fidelity best acc : 0.9441
full-search best acc : 0.9492
accuracy regret : 0.0050
compute spent (epoch-units) : 4096 vs 16384 full
compute saved : 75.0%
Two numbers in Output 21.3.1 carry the whole lesson. The Spearman correlation of 0.485 says the low-fidelity ranking is noisy, genuinely so; the screen is not a clean preview of the final order. The regret of 0.005 says it does not matter, because we did not bet on a single low-fidelity winner. We promoted thirty-two survivors and let the full tier sort them out, so the screen only had to keep the eventual winner somewhere in its top eighth, not at the very top. This is the practical defense against the slow-starter risk of Section 2: widen the promotion set when the correlation is weak, and the noise averages out. The arithmetic of Section 3 predicted the quarter-cost outcome exactly ($1/8 + 1/8 = 1/4$), and the simulation confirms it while showing what that quarter buys: a near-best configuration, not the guaranteed best.
Who: A computer-vision team tuning an image-classification backbone before a large pretraining run.
Situation: A sweep of 200 configurations (learning rate, weight decay, augmentation strength, warmup) each needed roughly nine hours to train to full schedule on the team's eight-GPU nodes.
Problem: Running all 200 to completion would have consumed the cluster for over a week, and most of those runs were predictably going to lose.
Dilemma: Cut the sweep to a few hand-picked configurations and risk missing the good region, or run the full sweep and miss the deadline and the budget.
Decision: They screened all 200 configurations at low fidelity, using ten percent of the data and a quarter of the schedule, then promoted the top fifteen percent to the full run.
How: The cheap screen finished overnight because the short, small-data jobs packed densely onto the cluster and turned over fast; the promotion step was a single rank-and-keep over the screen scores, and the survivors launched as full nine-hour runs the next morning.
Result: The configuration they shipped was indistinguishable in final accuracy from the best of an after-the-fact full sweep run for validation, at roughly thirty percent of the compute, and the project hit its deadline.
Lesson: When the data fraction is a faithful fidelity for your task, screening on it is nearly free signal; promote a wide enough top fraction that the screen's noise cannot eliminate the eventual winner.
5. From One Screen to Many Rounds Advanced
The two-tier screen in Code 21.3.1 is the simplest multi-fidelity method, and it already captures the essential economics. But it leaves budget on the table in a specific way: it commits a flat low budget to every configuration, including the ones that look hopeless after the first epoch. A natural improvement is to add more tiers, screening at very low fidelity, promoting to medium, promoting again to high, so the budget per configuration grows only as a configuration keeps surviving. That is the multi-round generalization of Figure 21.3.1, and it is the direct subject of the next section: successive halving runs a geometric ladder of fidelities, halving the survivor count and multiplying the budget at each rung, and Hyperband hedges over how aggressive that ladder should be. Everything here, the fidelity knobs, the correlation assumption, the cancelling-$M$ arithmetic, carries forward unchanged; the next section only adds structure to the schedule of when to promote.
The frontier is making the cheap signal less noisy and the promotion decision less greedy. Multi-fidelity Bayesian optimization methods in the BoTorch and Ax ecosystems model accuracy jointly as a function of the hyperparameters and the fidelity, so the optimizer reasons explicitly about whether the next dollar is better spent screening a new configuration or upgrading a promising one; cost-aware acquisition functions such as knowledge-gradient and entropy-search variants choose the fidelity, not just the point. A parallel line learns to extrapolate learning curves: recent neural and Gaussian-process curve models, including the prior-fitted-network curve predictors behind tools like the 2024-era ifBO work, forecast where a partial curve is heading and so explicitly hunt for the slow starters that a flat screen discards. Large meta-learned benchmarks (the HPOBench and PD1 / TaskSet curve collections refreshed across 2023 to 2025) let researchers measure inter-fidelity rank correlation across hundreds of real tasks, turning the assumption of Section 2 from an article of faith into a measured property. We meet the population-based methods that blur the line between training and tuning in Section 21.5.
6. When the Cheap Signal Lies Intermediate
Multi-fidelity optimization is not universally safe, and naming its failure modes is part of using it well. The correlation can be low for structural reasons, not just noise: when the hyperparameter being tuned is itself a schedule (a learning-rate decay that only acts late, a curriculum that reorders data over epochs), the low-fidelity prefix may not reflect the full-fidelity behavior at all, and a flat early screen can systematically prefer the wrong region. Reducing the data fraction can change which configurations regularize well, because a small-data regime rewards stronger regularization than the full dataset does, so the cheap ranking is biased rather than merely noisy. The discipline is to verify the fidelity, by occasionally running a handful of configurations at full budget and measuring the rank correlation against their cheap scores, exactly the $\rho$ that Code 21.3.1 computes, before trusting the screen to eliminate anyone. A correlation you have measured is an asset; one you have assumed is a liability.
The promote-and-prune loop of Code 21.3.1, score every trial, rank, keep the top fraction, escalate the survivors, is exactly what a multi-fidelity trial scheduler automates. In Ray Tune you describe the fidelity (here, training iterations) and hand the scheduling decision to an ASHAScheduler, the asynchronous successive-halving scheduler of Section 21.4, which stops weak trials early and promotes strong ones without any manual rank-and-keep:
# pip install "ray[tune]"
from ray import tune
from ray.tune.schedulers import ASHAScheduler
scheduler = ASHAScheduler(
metric="accuracy", mode="max",
max_t=64, # full fidelity: 64 epochs (B_HIGH above)
grace_period=8, # never stop a trial before 8 epochs (B_LOW above)
reduction_factor=8, # keep the top 1/8 at each rung (eta above)
)
tuner = tune.Tuner(
train_fn, # reports accuracy each epoch
param_space=search_space, # 256 configs sampled from here
tune_config=tune.TuneConfig(num_samples=256, scheduler=scheduler),
)
results = tuner.fit() # the cluster runs the funnel
print(results.get_best_result(metric="accuracy", mode="max").config)
ASHAScheduler; Ray Tune handles trial placement across the cluster, early-stop signaling, checkpoint-based promotion, and straggler-tolerant execution, the distributed-scheduling concerns of Section 21.6.We now have the key idea that makes large-scale tuning affordable: spend almost nothing to rank many configurations at low fidelity, then spend the real budget only on the promising few, accepting a small and measurable accuracy regret in exchange for a multiplicative cut in compute. The two-tier screen is the seed; the next section grows it into a principled multi-round schedule. That progression begins in Section 21.4 with successive halving and Hyperband.
For each of the following sweeps, name the fidelity knob (training epochs, data fraction, input resolution, or model size) you would screen on first, and state one way that knob could give a misleading cheap signal: (a) tuning the learning-rate warmup length and final decay for a transformer; (b) tuning augmentation strength and weight decay for an image classifier on a very large dataset; (c) tuning the number of attention heads and layer width for a model that will be trained at high sequence length. Explain in each case why the cheap ranking might not match the full-fidelity ranking, connecting your answer to the slow-starter and biased-screen risks of Sections 2 and 6.
Modify Code 21.3.1 to loop the promotion fraction $\eta$ over $\{1/32, 1/16, 1/8, 1/4, 1/2\}$ while holding the budgets fixed. For each $\eta$, record the compute spent and the accuracy regret against the global best, averaging over at least twenty random seeds to smooth the noise. Plot regret against compute. Identify the knee of the curve, the smallest promotion fraction whose regret is statistically indistinguishable from the largest, and explain how that knee shifts when you make the low-fidelity noise larger (raise the $0.18$ noise coefficient) or smaller. Relate your finding to the cancelling-$M$ formula $b_{\text{low}}/b_{\text{high}} + \eta$ of Section 3.
Using the cost formula $B_{\text{mf}}/B_{\text{full}} = b_{\text{low}}/b_{\text{high}} + \eta$, derive the condition on the fidelity ratio and promotion fraction under which multi-fidelity spends less than a full search, and the condition under which it spends less than half. Then argue, qualitatively, that the saving is not free: relate the inter-fidelity rank correlation $\rho$ to the probability that the true best configuration survives a single screen that keeps a fraction $\eta$, and explain why a very small $b_{\text{low}}$ (cheap but low $\rho$) and a very small $\eta$ (cheap but few survivors) are dangerous in combination. State the regime where a two-tier screen is the wrong tool and you should run the full search instead.