"I do not pick the configurations and I do not decide when to kill them. I just keep the workers busy and let the two specialists argue. Somehow we always converge."
A Trial Runner With No Opinions of Its Own
Everything this chapter built so far becomes practical only when one library cleanly separates two decisions, which configuration to try next and how long to let each one run, and then drives both across a cluster for you. Sections 21.2 through 21.5 gave us search algorithms (random, Bayesian, BOHB) and stopping rules (successive halving, Hyperband, population-based training); Section 21.6 gave us the distributed scheduling, checkpointing, and fault tolerance that make thousands of trials survivable. Ray Tune is the tool that holds those pieces apart on purpose: a pluggable search algorithm proposes configurations, a pluggable scheduler decides which trials to promote, pause, or kill, and a runtime fans the work out as Ray tasks and actors over many machines. This section explains why that separation is the right abstraction, runs a tiny version of it end to end, and places Ray Tune inside the wider AutoML ecosystem of Optuna, Hyperopt, and the systems that search architectures and whole pipelines.
By the end of Section 21.6 we had every ingredient of a distributed hyperparameter search except a kitchen to cook them in. We knew how to propose configurations cheaply (Section 21.2), how to stop unpromising trials early (Sections 21.3 to 21.5), and how to schedule trials across workers with checkpointing and fault tolerance (Section 21.6). What we did not have was a single piece of software that lets you swap any proposer against any stopping rule without rewriting the loop that drives them. That swap is exactly what Ray Tune is built around, and the design choice behind it is worth dwelling on because it recurs throughout machine-learning infrastructure: keep the policy that decides what to do separate from the policy that decides when to stop, and separate both from the mechanism that runs the work.
1. The Search/Scheduler Separation, and Why It Is the Right Abstraction Beginner
A hyperparameter search makes two decisions on every step, and they are genuinely different decisions made with different information. The first is which configuration to evaluate next: this is the search algorithm, and it reasons over the space of hyperparameters using the scores of finished trials. Random search ignores past scores entirely; Bayesian optimization fits a surrogate model to them; tree-structured Parzen estimators (TPE) and BOHB build density models of good and bad regions. The second decision is how much budget to give a trial that has already started, and whether to keep it alive at all: this is the scheduler, and it reasons over partial learning curves. Successive halving and ASHA promote the top fraction at each rung; the median-stopping rule kills a trial that falls below the running median; population-based training (Section 21.5) goes further and mutates a live trial's configuration in place.
Bundling these two decisions into one monolithic optimizer is the natural first design, and it is a mistake, because the two axes vary independently. You might want TPE's sample efficiency with ASHA's aggressive early stopping, or plain random search with population-based training, or Bayesian optimization with no early stopping at all when every trial is cheap. If search and stopping are welded together you get a small fixed menu of combinations; if they are orthogonal interfaces you get their product. Ray Tune makes this orthogonality explicit: a search_alg object implements "propose the next config," a scheduler object implements "given this trial's latest result, promote, pause, or stop it," and the two never call each other. The composition rule that makes BOHB work, Bayesian-optimized sampling feeding Hyperband-style brackets, is in Ray Tune just one search algorithm paired with one scheduler, not a special-cased algorithm.
A distributed hyperparameter search is the product of three independent choices: a search algorithm (what to try, reasoning over completed scores), a scheduler (how long to run each trial, reasoning over partial curves), and a runtime (how trials map onto workers, with checkpointing and fault tolerance). Keeping them orthogonal turns a fixed menu of named algorithms into their full Cartesian product: any proposer with any stopping rule on any cluster. Most of the value of a tool like Ray Tune is that it refuses to let these three concerns leak into each other.
2. What Ray Tune Adds on Top of Ray Intermediate
Ray Tune is a library built on Ray, the distributed-execution framework whose tasks and actors we used to run actor-learner reinforcement learning in Chapter 20. Each trial is a Ray task or actor that occupies some slice of the cluster (a fraction of a CPU, one or more GPUs); the Tune driver holds the search algorithm and scheduler, launches trials onto free resources, collects reported metrics, and acts on the scheduler's verdicts by promoting, pausing, or terminating trial actors. The machinery from Section 21.6 lives underneath: when a trial is paused for a higher-priority one, its state is checkpointed so it can resume; when a spot instance is preempted, the trial is rescheduled from its last checkpoint rather than lost. Tune contributes the trial lifecycle, the resource accounting, the result stream, and the glue that lets a scheduler express "pause this, run that instead" as concrete actor operations.
Figure 21.7.1 shows the pieces and how they communicate. The point to read off it is the flow of information: completed scores travel up to the search algorithm, partial curves travel to the scheduler, and only the runtime touches the workers. Neither specialist knows the cluster exists.
3. A Tiny Tune-Like Engine, From Scratch Intermediate
The cleanest way to see why the separation matters is to build a miniature of it and change only one component. The code below defines a pluggable search algorithm (RandomSearch), two pluggable schedulers (NoScheduler, which runs every trial to the full budget, and ASHA, which promotes only the top fraction at each rung), and a run_tune engine that pairs any search with any scheduler over a pool of workers. The engine measures total compute as the number of training steps actually executed, the same currency a real cluster bills you in. We then run the identical search and worker pool twice, swapping only the scheduler.
import math, random, heapq
# The objective: a noisy "training curve". step(config, rung) returns a
# validation score after `rung` units of budget; more budget climbs a ceiling.
def train_step(config, rung):
lr, wd = config["lr"], config["wd"]
quality = -((math.log10(lr) + 2.0) ** 2) - 30.0 * (wd - 1e-4) ** 2
ceiling = 1.0 - math.exp(-0.6 * rung) # more budget -> closer to ceiling
noise = (random.random() - 0.5) * 0.02
return (quality * 0.1 + 1.0) * ceiling + noise
class RandomSearch: # SEARCH ALGORITHM: what to try
name = "RandomSearch"
def __init__(self, seed=0):
self.r = random.Random(seed)
def ask(self):
return {"lr": 10 ** self.r.uniform(-4, 0), "wd": 10 ** self.r.uniform(-6, -2)}
class ASHA: # SCHEDULER: how long to run each
name = "ASHA"
def __init__(self, max_rung=3, eta=3):
self.max_rung, self.eta = max_rung, eta
self.rungs = {r: [] for r in range(max_rung + 1)}
def on_result(self, rung, score):
self.rungs[rung].append(score)
if rung >= self.max_rung:
return "stop"
peers = sorted(self.rungs[rung], reverse=True)
cutoff = max(1, len(peers) // self.eta) # keep top 1/eta
return "promote" if score >= peers[cutoff - 1] else "stop"
class NoScheduler: # baseline: no early stopping
name = "FullBudget"
def __init__(self, max_rung=3):
self.max_rung = max_rung
def on_result(self, rung, score):
return "promote" if rung < self.max_rung else "stop"
def run_tune(search, scheduler, n_trials, n_workers): # the RUNTIME mechanism
budget_used, best, trial_id = 0, (-1e9, None), 0
pool, pending = [], []
for w in range(n_workers):
heapq.heappush(pool, (0.0, w))
for _ in range(n_trials):
heapq.heappush(pending, (0.0, trial_id, search.ask(), 0)); trial_id += 1
while pending:
ready_at, tid, config, rung = heapq.heappop(pending)
free_at, w = heapq.heappop(pool)
start = max(ready_at, free_at)
score = train_step(config, rung + 1); budget_used += 1
heapq.heappush(pool, (start + 1.0, w)) # this worker is busy 1 unit
if score > best[0]:
best = (score, dict(config))
if scheduler.on_result(rung, score) == "promote":
heapq.heappush(pending, (start + 1.0, tid, config, rung + 1))
return best, budget_used
if __name__ == "__main__":
N, W = 27, 4
print(f"trials={N} workers={W} search={RandomSearch.name}\n")
for sched in (NoScheduler(), ASHA()):
random.seed(7)
(score, cfg), used = run_tune(RandomSearch(seed=1), sched, N, W)
print(f"scheduler={sched.name:<11} compute_units={used:>4} "
f"best_score={score:.4f} lr={cfg['lr']:.2e} wd={cfg['wd']:.2e}")
print("\nSame search algorithm, same workers: only the scheduler changed.")
run_tune is the runtime mechanism; it knows nothing about the objective, the proposer, or the stopping rule. Swapping NoScheduler for ASHA is the only change between the two runs, exactly the orthogonality of Section 1.trials=27 workers=4 search=RandomSearch
scheduler=FullBudget compute_units= 108 best_score=0.9146 lr=9.59e-03 wd=6.28e-05
scheduler=ASHA compute_units= 44 best_score=0.9132 lr=9.59e-03 wd=6.28e-05
Same search algorithm, same workers: only the scheduler changed.
lr and wd to two significant figures, a score within $1.5 \times 10^{-3}$) while spending 44 compute units against the full-budget run's 108, a $2.5\times$ reduction. The search algorithm and worker pool were byte-for-byte the same; only the scheduler changed.The lesson Output 21.7.1 teaches is the whole argument of this section in miniature. We changed one object, not the loop, not the search, not the workers, and traded under two parts in a thousand of final quality for less than half the compute. In Ray Tune the same swap is a one-line change to the scheduler= argument of tune.Tuner, and the runtime that turns "stop this trial" into a terminated GPU actor on some other machine is the Section 21.6 machinery you no longer have to write.
4. Ray Tune in Practice, Illustratively Intermediate
In real code the three concerns appear as three arguments. You write a training function that reports a metric each epoch, hand Tune a search space, and select a search algorithm and a scheduler independently. The library does the resource accounting, checkpointing, pausing, and fault recovery that Code 21.7.1 only sketched.
The hand-rolled engine of Code 21.7.1 becomes a handful of lines, and the runtime, the part that maps trials onto a cluster with checkpointing and fault tolerance, is entirely handled for you. Swapping the search algorithm or the scheduler is a one-line change; nothing else in the script moves.
from ray import tune
from ray.tune.schedulers import ASHAScheduler
from ray.tune.search.optuna import OptunaSearch # TPE proposer, via Optuna
def train_fn(config): # one trial
model = build(config)
for epoch in range(config["epochs"]):
val_acc = train_one_epoch(model, config["lr"])
tune.report({"val_acc": val_acc}) # stream the partial curve
tuner = tune.Tuner(
tune.with_resources(train_fn, {"gpu": 1}), # one GPU per trial
param_space={"lr": tune.loguniform(1e-4, 1e-1), # the search space
"epochs": 50},
tune_config=tune.TuneConfig(
search_alg=OptunaSearch(metric="val_acc", mode="max"), # WHAT to try
scheduler=ASHAScheduler(metric="val_acc", mode="max"), # HOW LONG to run
num_samples=200), # 200 trials across the cluster
)
results = tuner.fit() # fans out over all Ray workers
print(results.get_best_result(metric="val_acc", mode="max").config)
search_alg and scheduler are independent arguments; pairing OptunaSearch (a TPE proposer) with ASHAScheduler reconstructs a BOHB-style method, and tune.with_resources plus num_samples let Ray place 200 trials across every GPU in the cluster. The roughly forty lines of run_tune collapse to the Tuner construction, and Ray supplies the checkpointing and preemption recovery of Section 21.6.It is easy to spend more engineering effort tuning the tuner than the model. A search over search algorithms, schedulers, and their own knobs (the $\eta$ of ASHA, the surrogate of Bayesian optimization) is a perfectly real meta-problem, and people do run it. The discipline this chapter preaches is to stop one level up: pick a sensible proposer and a sensible stopping rule, give them a fixed budget, and resist the temptation to descend forever. The compute you save not meta-tuning is compute you can spend on more trials.
5. The Wider Ecosystem: Optuna, Hyperopt, and Full AutoML Intermediate
Ray Tune is a distributed driver that happily borrows proposers from elsewhere, which is why it integrates the rest of the ecosystem rather than competing with all of it. Two libraries supply the proposers most often plugged into it. Optuna introduced the define-by-run search space: instead of declaring the space up front, your objective calls trial.suggest_float(...) as it executes, so the space can branch on earlier choices (sample a number of layers, then sample a width per layer). Optuna's default proposer is TPE, and it ships its own pruners (its word for schedulers) including a median-stopping rule and a successive-halving pruner. Hyperopt is the older library that popularized TPE in Python; it remains a common search_alg backend. The relationship is layered, not rivalrous: Optuna or Hyperopt decides what to try, Ray Tune decides how long to run it and on which machine.
Above hyperparameter search sits full AutoML, which searches more than a fixed model's knobs. Neural architecture search (NAS) treats the network structure itself as the search space, which can contain billions of candidate architectures; early NAS runs famously consumed thousands of GPU-days, and the field's central effort has been to cut that cost with weight sharing and one-shot supernets. Pipeline AutoML systems search over preprocessing, model family, and hyperparameters jointly: auto-sklearn does this for classical models via Bayesian optimization plus meta-learning, and AutoGluon takes a different bet, ensembling and stacking a portfolio of strong models rather than searching exhaustively, which often wins on tabular and multimodal data for far less compute. What unites them is appetite: the search space is enormous, every candidate costs a training run, and so the distributed scheduling and early stopping of this chapter are not optional niceties but the only thing that makes the search affordable. The compute a naive NAS demands is precisely the compute that ASHA-style promotion and one-shot sharing exist to avoid.
Who: An ML platform engineer at a streaming company responsible for the weekly retrain of a deep recommendation model.
Situation: Each trial trained for six hours on one GPU, and a fresh hyperparameter sweep of 200 configurations on a fixed grid would have monopolized the shared cluster for over a week.
Problem: Most configurations were visibly hopeless within the first hour, yet the grid ran every one of them to completion, and a single bad cloud preemption lost an entire trial's progress.
Dilemma: Buy more GPUs to brute-force the grid faster, or change the search itself, accepting the engineering work of wiring up a proper proposer, a stopping rule, and checkpointing.
Decision: They moved the sweep to Ray Tune, pairing OptunaSearch (TPE) as the proposer with ASHAScheduler for early stopping, and let trials checkpoint so preempted spot instances resumed instead of restarting.
How: The training function reported validation AUC each epoch; ASHA killed the bottom two thirds at the first rung, and Tune packed surviving trials onto whatever GPUs the cluster freed up, recovering preempted ones from their last checkpoint.
Result: The sweep finished in under two days instead of nine, the best configuration matched the full-grid winner's AUC within noise, and total GPU-hours fell by roughly $2.5\times$, the same order of saving Output 21.7.1 shows in miniature.
Lesson: The cheapest way to speed up a sweep is usually a better scheduler, not a bigger cluster. Early stopping plus checkpointing turns a brute-force grid into an affordable search without touching the model.
These sweeps do not run in a vacuum: every trial is an experiment, and an experiment that is not recorded cannot be compared or reproduced. Ray Tune, Optuna, and the AutoML systems all integrate with experiment-tracking backends (MLflow, Weights & Biases, TensorBoard) so that each trial's configuration, metric curve, and final checkpoint are logged automatically. Treating a hyperparameter search as a fleet of tracked experiments rather than a throwaway script is the bridge to production operations, which is the subject of MLOps for distributed AI in Chapter 26; the cost side of that same coin, when to stop spending on a search at all, is where Section 21.8 takes us next.
Two threads are reshaping the ecosystem this section surveys. The first is LLM-driven AutoML: large language models are used as the proposer in the search loop, reading a dataset description and prior trial results in natural language and suggesting the next configuration, pipeline, or even code. Systems in the lineage of AutoML-GPT and agentic data-science frameworks (and Google's 2024 work on LLMs as hyperparameter optimizers) report competitive results with far fewer trials, because the model brings prior knowledge that random or Bayesian proposers lack. The second is efficient NAS at scale: one-shot and weight-sharing supernets, zero-cost proxies that rank architectures without training them, and hardware-aware search that folds latency on a target device into the objective, all aimed at the same goal of cutting NAS from thousands of GPU-days to a tractable budget. Both threads lean entirely on the distributed scheduling and early stopping of this chapter; an LLM proposer still needs a scheduler to kill its weak suggestions, and a one-shot NAS still needs a cluster to train its supernet. We meet the cost-accounting that governs how much of any of this you should run in Section 21.8, and the optimization theory the proposers rest on in Chapter 10.
6. When to Reach for Which Tool Beginner
The ecosystem is large, but the choice is usually quick once you name your constraint. If your trials are expensive deep-learning runs that must spread across a GPU cluster with early stopping and fault tolerance, reach for Ray Tune and plug a proposer into it. If you are tuning on a single machine and want a clean define-by-run space with built-in pruning, Optuna alone is often enough. If you have tabular or multimodal data and want a strong model fast without writing a search at all, AutoGluon's portfolio-and-ensemble approach frequently beats a hand-built sweep. If you are searching network architectures, you are in NAS territory and should budget for one-shot or weight-sharing methods rather than naive per-architecture training. Table 21.7.1 condenses the map.
| Tool | What it searches | Distinctive feature | Best when |
|---|---|---|---|
| Ray Tune | Hyperparameters, any proposer | Search/scheduler split, runs on a Ray cluster | Expensive trials, many machines, early stopping needed |
| Optuna | Hyperparameters | Define-by-run space, built-in pruners | Single machine or moderate scale, branching spaces |
| Hyperopt | Hyperparameters | Classic TPE proposer | A drop-in proposer for a larger driver |
| auto-sklearn | Pipeline + hyperparameters | Bayesian optimization plus meta-learning | Classical ML on tabular data |
| AutoGluon | Models + ensembles | Portfolio stacking, little searching | Strong tabular/multimodal result fast |
| NAS frameworks | Network architecture | One-shot / weight-sharing supernets | Architecture matters and a GPU budget exists |
We now have the tooling that turns Sections 21.2 through 21.6 from algorithms on paper into searches that run on a cluster, and we have seen why the search/scheduler separation is the abstraction worth defending. What we have not yet done is decide how much searching to buy. Every trial costs money and GPU-hours, early stopping changes the accounting, and at some point the marginal configuration is not worth its compute. That economics, cost-aware distributed experimentation, is the subject of Section 21.8.
Suppose a library offered only three monolithic optimizers (random, Bayesian, BOHB) with stopping rules baked in, versus a library that offered three search algorithms and four schedulers as independent components. Count the distinct search strategies each library can express. Then argue, with a concrete example such as "TPE proposer with no early stopping," why a combination the monolithic library cannot express might be exactly what a given problem needs. Tie your answer to the search/scheduler separation of Section 1.
Extend Code 21.7.1 with a third scheduler, MedianStopping, whose on_result(rung, score) returns "stop" when score is below the running median of all scores seen at that rung and "promote" otherwise (up to max_rung). Run all three schedulers (FullBudget, ASHA, MedianStopping) with the same RandomSearch and worker pool, and report compute units and best score for each. Explain which scheduler is most aggressive and how that shows up in the compute-versus-quality trade-off, without changing the search algorithm or the engine.
A sweep of $200$ trials each trains for $6$ hours on one GPU, and you have $16$ GPUs. Estimate the wall-clock time to run the full grid with no early stopping. Now assume an ASHA scheduler with reduction factor $\eta = 3$ and three rungs that promotes only the top third at each rung, so most trials stop after the first rung. Estimate the total GPU-hours under ASHA relative to the full grid, state your assumptions about how many trials survive each rung, and compare your estimate to the $2.5\times$ saving in Output 21.7.1. Why might your analytic estimate and the measured saving differ?