"I gave a thousand configurations one epoch each, kept the hundred that did not embarrass themselves, and only then spent real compute. The other nine hundred never knew how close they came."
A Scheduler That Plays Favorites Early
Successive halving turns the multi-fidelity idea of the previous section into a concrete schedule: launch many configurations on a tiny budget, throw out the worst fraction, multiply the budget of the survivors, and repeat, so compute flows geometrically toward the configurations that keep winning. Hyperband wraps several such schedules with different aggressiveness because the right cut rate is itself unknown. The version that scales across a cluster is ASHA, which promotes a configuration the instant enough of its peers have finished rather than waiting for an entire rung to complete; that one change deletes the synchronization barrier that would otherwise leave most of the cluster idle behind a straggler. This section builds the schedule from scratch, shows the compute concentrating on winners, and demonstrates why asynchronous promotion beats the synchronous barrier on real hardware.
Section 21.3 made the case for multi-fidelity optimization: most hyperparameter configurations are bad, and you can usually tell a bad one from a good one with a cheap, low-budget estimate (a few training epochs, a subset of the data, a smaller image resolution) long before paying for a full run. That observation is a license to stop early, but a license is not a policy. How many configurations should start? At what budget? When exactly do you cut, and how many survive each cut? Successive halving and Hyperband are the algorithms that answer those questions with a fixed, analyzable schedule, and ASHA is the answer to the question this book always asks next: how does that schedule behave when it runs on a thousand machines that fail, straggle, and finish out of order?
1. Successive Halving: A Schedule, Not a Heuristic Intermediate
Successive halving (SHA) takes three inputs: a number of starting configurations $n$, a minimum budget $r$ (the cheapest fidelity, say one epoch), and a halving factor $\eta$ that controls how aggressively it cuts. Despite the name, $\eta$ need not be two; values of three or four are common, and "successive thirding" would be the honest title. The schedule proceeds in rungs. At rung zero, all $n$ configurations train to budget $r$ and are scored. The top $1/\eta$ are promoted; the rest are stopped. At rung one, the survivors train to budget $\eta r$, are rescored, and again only the top $1/\eta$ survive. The pattern repeats until one configuration (or a handful) remains at the maximum budget.
The arithmetic of the schedule is what makes it beautiful. At rung $k$, the number of configurations is $n_k = \lfloor n / \eta^{k} \rfloor$ and the per-configuration budget is $r_k = r\,\eta^{k}$. The total resource spent at rung $k$, in units of the cheapest run, is their product:
$$n_k \cdot r_k = \frac{n}{\eta^{k}} \cdot r\,\eta^{k} = n\,r,$$which does not depend on $k$. Every rung costs the same. This is the design choice at the heart of the algorithm: the geometric thinning of the population exactly cancels the geometric growth of the per-trial budget, so a search that explores $n$ configurations and runs the best one to a budget $\eta^{K}$ times the minimum costs only about $(K+1)\,n\,r$ in total, where $K+1 = \lfloor \log_{\eta} n \rfloor + 1$ is the number of rungs. Compared with the naive policy of running all $n$ configurations to the full budget, which costs $n \cdot r\,\eta^{K}$, successive halving is cheaper by roughly a factor of $\eta^{K} / (K+1)$, an enormous saving that grows with the budget range.
Successive halving is engineered so that every rung spends the same total compute, $n\,r$. The population shrinks by $\eta$ exactly as fast as the budget grows by $\eta$, and the two cancel. That balance is why the algorithm can afford to evaluate a huge initial population: most of those configurations die after one cheap rung, and the expensive budgets are reserved for the few that keep proving themselves. You are not gambling more compute on promising configurations; you are spending the same compute per rung, just on fewer and fewer survivors.
The code below implements the full schedule with $\eta = 3$ and $n = 27$, so the population walks $27 \to 9 \to 3 \to 1$ over four rungs. Each configuration carries a hidden "true quality"; the score observed at budget $r$ is that quality plus noise that shrinks like $1/\sqrt{r}$, which is the multi-fidelity assumption from Section 21.3 made literal: cheap estimates are noisy, expensive ones are trustworthy. The program prints the per-rung bookkeeping and the resulting compute saving.
import math, random
random.seed(7)
eta = 3 # halving factor: keep top 1/eta each rung
n0 = 27 # configs at rung 0
r0 = 1 # budget (epochs) at rung 0
# Each config has a hidden "true quality"; the score observed at budget r is the
# true quality plus noise that shrinks as the budget grows (cheap estimates are
# noisy, expensive ones reliable). Lower score is better.
configs = [{"id": i, "true_q": random.uniform(0.0, 1.0)} for i in range(n0)]
def observe(cfg, budget):
return cfg["true_q"] + random.gauss(0, 0.25 / math.sqrt(budget))
n_rungs = 1 + int(math.floor(math.log(n0) / math.log(eta)))
print(f"eta={eta} n0={n0} rungs={n_rungs}")
print(f"{'rung':>4} {'budget r':>9} {'configs n':>10} {'unit-cost n*r':>14}")
alive, total_units = list(configs), 0
for rung in range(n_rungs):
r, n = r0 * (eta ** rung), len(alive)
total_units += n * r
print(f"{rung:>4} {r:>9} {n:>10} {n * r:>14}")
scored = sorted(alive, key=lambda c: observe(c, r)) # lower is better
alive = scored[: max(1, n // eta)] # keep top 1/eta
best = alive[0]
flat = n0 * (r0 * eta ** (n_rungs - 1)) # all n0 run to full budget
print(f"\nsurvivor id={best['id']} true_q={best['true_q']:.3f}")
print(f"total unit-cost (SHA) : {total_units}")
print(f"flat cost if all ran full: {flat}")
print(f"SHA speedup over flat : {flat / total_units:.2f}x")
alive = scored[: max(1, n // eta)] is the entire promotion rule; everything else is bookkeeping that prints the constant-per-rung budget and the compute saving over running every configuration to full budget.eta=3 n0=27 rungs=4
rung budget r configs n unit-cost n*r
0 1 27 27
1 3 9 27
2 9 3 27
3 27 1 27
survivor id=6 true_q=0.058
total unit-cost (SHA) : 108
flat cost if all ran full: 729
SHA speedup over flat : 6.75x
unit-cost column is flat at $27$ across all four rungs, the constant-per-rung budget the math predicts. Successive halving spends $108$ units total and surfaces a survivor with true quality $0.058$ (near the best possible $0$), versus $729$ units to run all $27$ configurations to full budget: a $6.75\times$ saving for an equally good answer.The survivor has true quality $0.058$, close to the global best, even though every promotion decision was made on noisy low-budget scores. That is the multi-fidelity bet paying off: the cheap rungs are wrong about the exact ranking but right about which configurations are hopeless, and stopping the hopeless ones early is where all the savings come from. The $6.75\times$ figure is modest here only because the budget range is small; at the budget ranges of real deep-learning training, where the full run is hundreds of times the cheapest probe, the saving is two or three orders of magnitude.
2. Hyperband: Hedging the One Knob Successive Halving Cannot Set Intermediate
Successive halving has a single dangerous degree of freedom: how aggressively to cut, which is governed jointly by $\eta$ and by how many configurations $n$ you start at the minimum budget. Start with a huge $n$ at a tiny budget and cut hard, and you explore widely but make every promotion decision on very noisy one-epoch scores, risking the early death of a slow-starting configuration that would have won with patience. Start with a small $n$ at a larger budget and cut gently, and every score is reliable but you have explored only a thin slice of the space. There is a genuine exploration-versus-early-promotion trade-off, and the catch is that the right setting depends on the unknown shape of the problem: how quickly good configurations distinguish themselves from bad ones, which is exactly what you are searching to discover.
Hyperband's answer is to refuse to guess. Instead of betting on one $(n, r)$ pairing, it runs several successive-halving instances, called brackets, that span the trade-off from "many configurations, brutal early cutting" to "few configurations, gentle cutting that is nearly a plain full-budget search". Each bracket is given roughly the same total budget, and the most aggressive brackets achieve that by starting many more configurations. Hyperband then simply takes the best configuration found across all brackets. By hedging across the whole spectrum of aggressiveness, it is robust to the unknown problem shape: if early scores are informative, the aggressive brackets win and waste little; if early scores mislead, the conservative brackets provide a safety net. The price of the hedge is a small constant factor of extra compute over having guessed the optimal bracket, a price the original authors show is provably bounded.
Most optimization algorithms project confidence about a hyperparameter and then hide the one they could not set. Hyperband is refreshingly candid: its central design move is to confess that the most aggressive setting of successive halving is unknowable in advance, and to budget for that ignorance by running every setting a little. It is one of the few algorithms whose pseudocode contains a loop whose entire purpose is to cover for the loop variable you were never able to choose.
3. ASHA: Deleting the Rung Barrier So the Cluster Stays Busy Advanced
Successive halving as written in Code 21.4.1 has a structural flaw the moment it leaves one machine: the promotion at each rung is a barrier. To decide the top $1/\eta$ of a rung, you must have scored every configuration in that rung, which means the rung cannot finish until its slowest trial finishes. On a cluster this is precisely the straggler problem of Section 2.7: one trial that is slow because it landed on a busy node, a thermally throttled GPU, or simply an unlucky configuration that trains slowly holds up the promotion of everyone else, and the workers that have already finished their rung sit idle waiting at the barrier. As the cluster grows, the probability that some trial in each rung is a straggler approaches one, and synchronous successive halving spends much of its time waiting.
Asynchronous Successive Halving (ASHA) removes the barrier with a single change of perspective. Rather than asking "is this whole rung done so I can promote the top fraction?", each worker, whenever it becomes free, asks a local question: "among the configurations that have already reached rung $k$, is there one in the top $1/\eta$ that has not yet been promoted to rung $k+1$?" If yes, it promotes that one and runs it at the higher budget; if no, it starts a fresh configuration at the bottom rung. Promotion is decided against the configurations that have so far reached a rung, not against a completed rung, so a finished configuration can be promoted the instant a quorum of peers exists to judge it against. No worker ever waits for a straggler; a slow trial simply keeps running in its lane while every other worker races ahead. This is the same lesson, in a new costume, that Section 10.4 taught for gradient descent: dropping the global synchronization barrier trades a small amount of decision quality (you sometimes promote on a thinner sample than the full rung) for a large gain in hardware utilization.
The move from successive halving to ASHA is the same move this book makes everywhere a synchronization barrier meets a heterogeneous cluster. Asynchronous SGD (Section 10.4) dropped the all-reduce barrier so fast workers need not wait for slow ones; bounded-staleness parameter servers (Chapter 11) dropped it for embedding updates; asynchronous actor-learner architectures (Chapter 20) dropped it for reinforcement-learning rollouts. ASHA is that identical insight applied to the promotion decision of a hyperparameter search. Whenever a distributed method must wait for the slowest participant to reach a barrier, ask whether the barrier can be replaced by a local quorum rule; the answer, across optimization, serving, and search alike, is usually yes, and the payoff is usually a cluster that stops idling.
The code below runs both schedules under a model of trial duration that makes stragglers explicit: each trial's wall-clock is its budget times a random per-worker speed factor between $1.0$ and $2.5$, so trials at the same rung finish at different times. The synchronous schedule pays, at each rung, the time of the slowest trial (the barrier). The asynchronous schedule with $W$ workers is governed instead by total work divided by throughput, because no worker waits. The promotion decisions are identical; only the cost model differs.
import math, random
random.seed(7)
eta, n0, r0 = 3, 27, 1
configs = [{"id": i, "true_q": random.uniform(0.0, 1.0)} for i in range(n0)]
def observe(c, b): return c["true_q"] + random.gauss(0, 0.25 / math.sqrt(b))
n_rungs = 1 + int(math.floor(math.log(n0) / math.log(eta)))
# Trial wall-clock = budget * random per-worker speed factor, so same-rung
# trials finish at DIFFERENT times (the straggler effect of Section 2.7).
def run_times(items, budget):
return {c["id"]: budget * random.uniform(1.0, 2.5) for c in items}
# Synchronous: a rung cannot start until EVERY trial in the previous rung is
# done, so the rung's wall-clock is its slowest (max) trial: the barrier.
sync_wall, alive = 0.0, list(configs)
for rung in range(n_rungs):
r = r0 * eta ** rung
t = run_times(alive, r)
sync_wall += max(t.values()) # wait for the straggler
alive = sorted(alive, key=lambda c: observe(c, r))[: max(1, len(alive)//eta)]
# Asynchronous (ASHA): promote the moment a rung quorum exists; no worker waits
# for a straggler, so with W workers the cost is total work over throughput.
W, total_work, alive = 9, 0.0, list(configs)
for rung in range(n_rungs):
r = r0 * eta ** rung
total_work += sum(run_times(alive, r).values())
alive = sorted(alive, key=lambda c: observe(c, r))[: max(1, len(alive)//eta)]
async_wall = total_work / W
print(f"sync wall-clock (rung barriers, slowest per rung): {sync_wall:.1f}")
print(f"async wall-clock (W={W} workers, no barrier) : {async_wall:.1f}")
print(f"async speedup over sync: {sync_wall / async_wall:.2f}x")
max(t.values()), paying for the slowest trial at each rung; the asynchronous loop accumulates sum(...)/W, paying only for total work spread over $W$ workers. The promotion rule is byte-for-byte the same in both.sync wall-clock (rung barriers, slowest per rung): 48.7
async wall-clock (W=9 workers, no barrier) : 20.7
async speedup over sync: 2.35x
The $2.35\times$ here is a floor, not a ceiling. The straggler penalty in the synchronous schedule grows with both the number of trials per rung (more trials, higher odds one is slow) and the spread of per-trial durations, so on a real cluster of hundreds of GPUs with heterogeneous hardware and preemptible spot instances the asynchronous advantage is routinely an order of magnitude. ASHA is, for this reason, the default early-stopping scheduler in production hyperparameter-search systems, and the next two sections build the distributed trial scheduler (Section 21.6) and the production framework (Section 21.7) on top of exactly this asynchronous core.
Code 21.4.1 and 21.4.2 implement the schedule and its cost model by hand to expose the mechanics. In production you neither write the rung bookkeeping nor manage the asynchronous promotions yourself: Ray Tune ships ASHA as a scheduler you attach to a tuner, and it handles trial placement across the cluster, the quorum-based promotion logic, checkpoint-aware pausing and resuming, and fault tolerance when a worker dies mid-trial.
# pip install "ray[tune]"
from ray import tune
from ray.tune.schedulers import ASHAScheduler
scheduler = ASHAScheduler(
metric="val_loss", mode="min",
max_t=27, # maximum budget (epochs) at the top rung
grace_period=1, # minimum budget r before a trial can be stopped
reduction_factor=3, # this is eta: keep the top 1/3 at each rung
)
tuner = tune.Tuner(
train_fn, # reports val_loss each epoch
param_space={"lr": tune.loguniform(1e-4, 1e-1),
"wd": tune.loguniform(1e-6, 1e-2)},
tune_config=tune.TuneConfig(scheduler=scheduler, num_samples=27),
)
results = tuner.fit() # asynchronous, no rung barrier
print(results.get_best_result(metric="val_loss", mode="min").config)
ASHAScheduler with reduction_factor as $\eta$. Ray Tune internally handles the asynchronous quorum promotion, cluster-wide trial placement, and worker-failure recovery; you supply only a training function that reports its metric each epoch.Who: An ML platform engineer running hyperparameter search for a vision team on a shared 64-GPU Kubernetes pod.
Situation: A nightly search over learning rate, weight decay, and augmentation strength used synchronous successive halving with $\eta = 4$ across 256 configurations.
Problem: The pod's utilization graph showed deep periodic troughs; GPUs sat at near-zero between rungs, and a search that should have taken three hours took nine.
Dilemma: Buy more GPUs to brute-force the wall-clock down, which would only add more idle hardware to each barrier, or change the scheduler to stop the idling, which meant trusting promotions made on partial rungs.
Decision: They switched the scheduler from synchronous successive halving to ASHA, keeping $\eta$, the budgets, and the search space identical and changing only how promotions were triggered.
How: A one-line swap to ASHAScheduler in their Ray Tune job (Code 21.4.3), plus checkpointing in the training function so paused trials could resume on any free worker.
Result: Pod utilization rose from roughly 35 percent to over 90 percent, wall-clock fell from nine hours to under three, and the best configuration found was statistically indistinguishable from the synchronous run's.
Lesson: When a search idles a cluster, the culprit is almost always a synchronization barrier, not a shortage of hardware; deleting the barrier with asynchronous promotion buys more than buying machines would.
4. When the Schedule Is the Wrong Tool Intermediate
Successive halving and its descendants rest on one assumption, and it is worth stating plainly because when the assumption fails the algorithm fails with it: low-budget performance must be a usable predictor of high-budget performance. The relative ranking of configurations at one epoch need not match the final ranking exactly, but a configuration that is hopeless cheaply must remain hopeless expensively. Most deep-learning settings satisfy this, which is why the methods are so widely used, but there are real exceptions. A learning-rate schedule with a long warmup can make a good configuration look terrible for its first few epochs, and successive halving will execute it for being a slow starter. Some architectures only reveal their advantage after a regime change late in training. In these cases the cheap rungs are not merely noisy but actively misleading, and aggressive early stopping discards the eventual winners.
Hyperband's conservative brackets are a partial defense, since they evaluate fewer configurations at higher minimum budgets and so give slow starters more rope, but a partial defense is not a guarantee. The honest move is to know your training curves before you set the minimum budget: pick a grace period long enough that warmup and early transients have passed, so the first promotion decision is made on signal rather than on warmup noise. This is also why the next section turns to population-based training, which never permanently kills a configuration but instead lets struggling members copy and perturb the parameters of strong ones, sidestepping the irreversible-early-death failure mode entirely.
The fixed-rung schedule of successive halving leaves performance on the table when the search space is structured, and recent work attacks that on several fronts. Model-based asynchronous schemes in the lineage of BOHB combine ASHA's asynchronous promotion with a Bayesian surrogate (Section 21.3) so that which configurations to sample is learned rather than random, and 2024 to 2026 systems extend this with multi-fidelity Gaussian-process and gradient-boosted surrogates that predict full-budget performance from partial learning curves and promote on the prediction rather than the raw rung score. A second thread targets large-language-model fine-tuning, where a single full-budget trial is so costly that even one wasted rung hurts: cost-aware and learning-curve-extrapolation variants stop trials on a predicted-final-loss criterion, and asynchronous Hyperband is increasingly paired with the elastic, preemptible-aware schedulers of Chapter 18 so that a promoted trial can migrate across spot instances without losing its rung. The throughline is that the rung structure is becoming a prior to be refined by a model, not a fixed grid to be obeyed.
You run successive halving with halving factor $\eta = 4$, a minimum budget of $r = 2$ epochs, and $n = 256$ starting configurations. (a) How many rungs are there, and what are the budget $r_k$ and the configuration count $n_k$ at each? (b) Confirm that the per-rung unit-cost $n_k r_k$ is constant and state its value. (c) Compare the total compute to the naive policy of running all 256 configurations to the maximum budget, and express the saving as a single ratio. (d) In one sentence, explain why doubling the maximum budget (adding one more rung) increases the saving rather than shrinking it.
Start from Code 21.4.2. (a) Replace the uniform per-worker speed factor with one heavy straggler per rung: most trials run at factor $1.0$, but one randomly chosen trial runs at factor $8.0$. Re-measure the synchronous and asynchronous wall-clocks and the speedup. (b) Sweep the number of starting configurations $n_0 \in \{27, 81, 243\}$ (keeping $\eta = 3$) and plot or tabulate the async-over-sync speedup against $n_0$. (c) Explain, from the numbers, why the synchronous barrier gets relatively more expensive as the rung population grows, and connect this to the straggler probability argument of Section 2.7.
Section 4 warns that successive halving fails when low-budget performance does not predict high-budget performance. (a) Modify the observe function in Code 21.4.1 so that a fixed 20 percent of configurations are "slow starters": their observed score is inflated (made worse) at low budgets and only reveals their true quality once the budget exceeds, say, 9. (b) Run the search many times with different seeds and measure how often the true global best is killed before reaching the top rung. (c) Now raise the grace period (the minimum budget $r_0$) and show the trade-off: a longer grace period rescues more slow starters but reduces the total compute saving. Argue which budget you would choose if a full run costs 200 times the cheapest probe, and justify it with one of the production constraints discussed in this section.