"I proposed the next configuration with great confidence. Then I learned the other nineteen workers had already proposed it too, and none of us had finished the first one."
A Surrogate Model With Too Many Suitors
The three classic ways to search a hyperparameter space (grid, random, and Bayesian optimization) sit at different points on a single trade-off: how much you let past results inform the next configuration, and how much parallelism you keep when you do. Grid and random search decide every configuration in advance, so a thousand workers can run a thousand trials with zero coordination, but they learn nothing from the trials already finished. Bayesian optimization fits a model of the objective and chooses each configuration to be maximally informative, spending far fewer trials to reach a good answer, but each suggestion depends on prior results, which makes it sequential by nature. Scaling Bayesian optimization across a cluster therefore means deliberately giving back some of its sample efficiency to recover parallelism. This section makes that tension concrete and measures it.
Section 21.1 established that hyperparameter search is embarrassingly parallel at the level of trials: each configuration trains an independent model, and independent work spreads across machines without communication. That section also warned that embarrassing parallelism is necessary but not sufficient, because a search that proposes bad configurations wastes the cluster no matter how perfectly it parallelizes. This section is about the proposal step itself, the algorithm that decides which configurations to try. The choice of search strategy determines both how few trials you need and how cleanly those trials fan out across workers, and those two goals pull in opposite directions.
We treat the search as minimizing an objective $f(\boldsymbol{\lambda})$, the validation loss obtained by training with hyperparameters $\boldsymbol{\lambda}$ drawn from a search space $\Lambda$. Every evaluation of $f$ is one full (or partial) training run, so evaluations are expensive and few, often dozens to low hundreds, not millions. That scarcity is what separates hyperparameter optimization from ordinary continuous optimization and is exactly why a model of $f$ can pay for itself.
1. Grid Search: Exhaustive and Exponential Beginner
Grid search discretizes each hyperparameter into a small set of candidate values and evaluates every combination. With $D$ hyperparameters and $v$ values per axis, the grid has $v^{D}$ points. Its appeal is total simplicity and perfect parallelism: the full list of configurations is known before any worker starts, so you scatter $v^{D}$ independent trials across the cluster and gather the best, with no proposal logic and no coordination. For two or three axes with a handful of values each, grid search is a perfectly sensible default and is easy to reason about.
The trouble is the exponent. Adding one more hyperparameter multiplies the cost by $v$; going from four axes to eight at five values each takes the grid from $625$ to roughly $390{,}000$ points. No cluster makes that affordable when each point is a training run. Worse, the grid spends its budget uniformly even though hyperparameters are almost never equally important. If the learning rate dominates the final loss and the weight-decay value barely matters, a grid still evaluates every weight-decay setting at every learning rate, burning most of its trials distinguishing configurations that differ only along an axis nobody cares about. The grid is blind to which axes deserve resolution.
Grid and random search achieve perfect parallelism precisely because they refuse to learn from finished trials: with nothing to wait for, everything runs at once. That same refusal is their weakness. The expensive resource in hyperparameter search is not worker time in aggregate but the number of training runs you can afford, and a strategy that ignores results wastes that budget on uninformative configurations. The rest of this chapter is a sustained attempt to buy back trial efficiency without paying too much parallelism for it.
2. Random Search: Better Coverage in High Dimensions Beginner
Random search draws each configuration independently and uniformly from the search space. It keeps every advantage of grid search (the whole batch is decided up front, so it is fully parallel) while discarding the rigid lattice. Bergstra and Bengio (2012) showed that this small change is a large improvement in high dimensions, and the reason is geometric rather than statistical. On a grid, the projection of the trials onto any single axis takes only $v$ distinct values, no matter how many total points you spend; every trial that shares a learning rate is wasted resolution on that axis. Random search, by contrast, gives every trial a distinct value on every axis, so its projection onto the few axes that actually matter is as fine as the total trial count allows.
Put differently: if only a handful of the $D$ hyperparameters meaningfully affect the loss, random search effectively concentrates its full budget on those important axes for free, because it never repeats a coordinate. Figure 21.2.1 shows the effect for a two-dimensional space where only the horizontal axis matters. The grid tries just three distinct horizontal values across nine trials; random search tries nine. This is why random search, not grid search, is the right baseline against which every smarter method in this chapter is measured, and why an honest comparison of a new optimizer always reports its budget against random search of the same size.
3. Bayesian Optimization: Sample-Efficient but Sequential Intermediate
Grid and random search never look at a finished trial before launching the next. Bayesian optimization does the opposite: it builds a probabilistic surrogate model of the objective from all evaluations so far, then uses that model to decide where to evaluate next. The surrogate is a cheap stand-in for the expensive $f$, predicting both a mean $\mu(\boldsymbol{\lambda})$ (where the loss is likely to be low) and an uncertainty $\sigma(\boldsymbol{\lambda})$ (where the model has not yet looked). A Gaussian process is the textbook surrogate; the Tree-structured Parzen Estimator (TPE) used by Optuna and Hyperopt is a popular alternative that scales better to mixed and conditional spaces.
The model chooses the next configuration by maximizing an acquisition function that balances exploitation (sampling where the predicted loss is low) against exploration (sampling where the model is uncertain, in case a better basin hides there). A simple and widely used choice is the lower confidence bound, which for a minimization problem is
$$\boldsymbol{\lambda}_{\text{next}} = \arg\min_{\boldsymbol{\lambda} \in \Lambda} \; \big[\, \mu(\boldsymbol{\lambda}) - \kappa \, \sigma(\boldsymbol{\lambda}) \,\big],$$where the constant $\kappa \ge 0$ tunes the exploration-exploitation balance: $\kappa = 0$ chases the current best prediction greedily, while large $\kappa$ rewards uncertain, unexplored regions. Each cycle evaluates the chosen $\boldsymbol{\lambda}_{\text{next}}$, adds the result to the data, refits the surrogate, and repeats. Because $\boldsymbol{\lambda}_{\text{next}}$ depends on every prior result through $\mu$ and $\sigma$, the loop is inherently sequential: the $(t{+}1)$-th proposal cannot be computed until the $t$-th evaluation reports back. This is the exact opposite of the all-at-once parallelism of grid and random search, and it is the central friction of this section.
The payoff for that friction is sample efficiency. The code below pits random search against a small Bayesian-style optimizer on the same four-dimensional objective at the same trial budget, using a distance-weighted neighbor surrogate as a lightweight stand-in for a Gaussian process and the lower confidence bound acquisition above. Both are given equal trials; the question is who reaches a low loss with fewer of them.
import numpy as np
# 4-D validation-loss-like objective to MINIMIZE. One narrow good basin plus
# rugged sinusoidal ripples, so most random points land in poor regions.
def objective(x):
center = np.array([0.30, 0.65, 0.45, 0.55])
basin = np.sum((x - center) ** 2)
ripple = 0.04 * np.sum(np.sin(5.0 * x) ** 2)
return basin + ripple
DIM, BUDGET = 4, 30
LO, HI = np.zeros(DIM), np.ones(DIM)
def random_search(seed): # fully parallel: all configs up front
rng = np.random.default_rng(seed)
y = np.array([objective(x) for x in rng.uniform(LO, HI, size=(BUDGET, DIM))])
return np.minimum.accumulate(y) # best-so-far curve
# Surrogate: distance-weighted neighbor mean (cheap GP stand-in) plus an
# uncertainty proxy that grows with distance from the nearest sampled point.
def surrogate(cand, X, y, k=5):
d = np.linalg.norm(X - cand, axis=1)
idx = np.argsort(d)[:k]
w = 1.0 / (d[idx] ** 2 + 1e-9)
return np.sum(w * y[idx]) / np.sum(w), d[idx].min() # (mu, sigma)
def bayes_search(seed, n_init=6, n_cand=600, kappa=0.6): # SEQUENTIAL: each pick uses all prior y
rng = np.random.default_rng(seed)
X = rng.uniform(LO, HI, size=(n_init, DIM))
y = np.array([objective(x) for x in X])
curve = list(np.minimum.accumulate(y))
for _ in range(BUDGET - n_init):
cands = rng.uniform(LO, HI, size=(n_cand, DIM))
# Lower confidence bound acquisition: minimize mu - kappa*sigma.
scores = np.array([(lambda m, s: m - kappa * s)(*surrogate(c, X, y)) for c in cands])
nxt = cands[np.argmin(scores)] # depends on EVERY prior result
X = np.vstack([X, nxt]); y = np.append(y, objective(nxt))
curve.append(min(curve[-1], y[-1]))
return np.array(curve)
trials_at = [6, 12, 20, 30]
rs = np.array([[random_search(s)[t - 1] for t in trials_at] for s in range(50)])
bo = np.array([[bayes_search(s)[t - 1] for t in trials_at] for s in range(50)])
print("best-so-far validation loss, mean over 50 runs (lower is better)")
print(f"{'trials':>8} | {'random':>10} | {'bayesian':>10}")
for j, t in enumerate(trials_at):
print(f"{t:>8} | {rs[:, j].mean():>10.4f} | {bo[:, j].mean():>10.4f}")
print()
print(f"bayesian reaches {bo[:, 2].mean():.4f} by trial 20")
print(f"random needs all 30 trials to reach {rs[:, 3].mean():.4f}, still worse")
bayes_search steer toward the good basin, while random_search samples blindly; the inner for loop is what forbids running its trials all at once.best-so-far validation loss, mean over 50 runs (lower is better)
trials | random | bayesian
6 | 0.2546 | 0.2546
12 | 0.1893 | 0.2061
20 | 0.1601 | 0.1060
30 | 0.1446 | 0.0876
bayesian reaches 0.1060 by trial 20
random needs all 30 trials to reach 0.1446, still worse
The numbers tell the whole story of the trade-off. Bayesian optimization spends its first six trials no better than random, because the surrogate has too little data to be trustworthy, then sharply overtakes once the model has something to fit: at twenty trials it has reached a loss that random search does not match at thirty. In wall-clock terms on a single worker, Bayesian optimization is the clear winner, fewer expensive training runs for a better answer. The catch is hidden in the code: random_search could have launched all thirty trials simultaneously across thirty workers, whereas bayes_search must finish trial $t$ before it can even propose trial $t{+}1$. On a thirty-worker cluster, naive Bayesian optimization would leave twenty-nine workers idle.
There is a recurring rite of passage in which an engineer, thrilled that Bayesian optimization needs five times fewer trials, deploys it on a freshly provisioned hundred-GPU cluster and watches utilization sit at one percent. The optimizer is busy being clever, proposing one configuration at a time, while ninety-nine GPUs hum quietly at idle and the bill arrives at full price. Sample efficiency and machine efficiency are not the same number, and the cluster only charges you for the second one.
4. Parallelizing Bayesian Optimization Intermediate
To use a cluster, Bayesian optimization must propose more than one configuration at a time, and every way of doing so trades a little of its sample efficiency back for parallelism. Two families dominate. Batch (synchronous) Bayesian optimization proposes a batch of $q$ points from the current surrogate, evaluates them in parallel on $q$ workers, then refits once all $q$ report. The difficulty is that the obvious choice (take the $q$ best acquisition points) returns $q$ nearly identical configurations clustered on the same predicted optimum, which wastes the batch. Practical batch methods enforce diversity, for example by penalizing candidates near already-selected ones or by using the model to hallucinate a plausible outcome for each pending point so the next selection accounts for it.
Asynchronous Bayesian optimization keeps all workers busy at all times: whenever any worker finishes, the surrogate is refit with the new result and a single fresh configuration is proposed for that worker, while the others keep running. No worker ever waits for the slowest, which matters greatly when training times vary across configurations, a near-certainty since larger models and higher epoch counts run longer. The cost is that each proposal is made against a surrogate missing the results of all the still-running trials, so the model is always working with slightly stale information. This is the same throughput-versus-information tension that runs through the whole book: synchronous methods preserve the most information per decision but stall on the slowest worker, while asynchronous methods maximize throughput at the price of acting on partial knowledge, exactly the choice between synchronous and asynchronous SGD in Chapter 10 and between synchronous and asynchronous actor-learner architectures in Chapter 20.
Hyperparameter search adds a layer of distribution on top of the training it controls: each trial may itself be a data-parallel job from Chapter 15, and the search loop coordinates many such jobs. Yet the governing trade-off is unchanged. Just as data-parallel SGD chooses how often workers synchronize their gradients, distributed Bayesian optimization chooses how often the search synchronizes its surrogate, and both pay for staleness in the currency of efficiency. Recognizing that the throughput-versus-information dilemma reappears at the search level, not just the gradient level, is what lets you reason about a full AutoML system instead of memorizing one more algorithm.
Who: An applied scientist tuning a vision model on a shared sixty-GPU research cluster.
Situation: Each trial trained for about three hours, and a fixed budget of 240 trials had to finish over a weekend before a Monday review.
Problem: A textbook Gaussian-process Bayesian optimizer reached good configurations in far fewer trials than random search, but it proposed one configuration at a time, so it could use only one GPU and would need a month.
Dilemma: Fall back to random search, which fills all sixty GPUs instantly but needs many more trials to match the quality, or keep the sample-efficient optimizer and leave fifty-nine GPUs idle.
Decision: They switched to asynchronous Bayesian optimization, refitting the surrogate each time any worker finished and immediately proposing one new configuration for that freed worker.
How: An Optuna study with a TPE sampler and sixty parallel workers sharing one storage backend let each worker pull its own freshly proposed trial without blocking on the others.
Result: Cluster utilization stayed above ninety percent, the 240-trial budget finished in roughly fourteen hours, and the asynchronous staleness cost only a handful of extra trials versus the idealized sequential run, a trade they would take every time.
Lesson: On a real cluster the right Bayesian optimizer is rarely the most sample-efficient one; it is the one that keeps the workers full while staying smarter than random.
Code 21.2.1 hand-built a surrogate, an acquisition function, and the proposal loop, roughly forty lines that still lacked parallelism. Optuna replaces all of it: you define an objective, pick a sampler (TPE here, the same model-based family), and ask for $n$ trials. Pointing many workers at one shared storage URL turns the identical script into asynchronous distributed Bayesian optimization, with the library handling proposal, surrogate fitting, result storage, and the coordination that keeps every worker busy.
# Each of N machines runs this same script; they share one storage backend.
import optuna
def objective(trial):
lr = trial.suggest_float("lr", 1e-5, 1e-1, log=True)
wd = trial.suggest_float("weight_decay", 1e-6, 1e-2, log=True)
return train_and_validate(lr, wd) # returns the validation loss to MINIMIZE
study = optuna.create_study(
direction="minimize",
sampler=optuna.samplers.TPESampler(), # model-based, like Bayesian optimization
study_name="vision-tune",
storage="postgresql://host/optuna", # shared store = async distributed search
load_if_exists=True,
)
study.optimize(objective, n_trials=240) # workers pull trials concurrently
print(study.best_params, study.best_value)
storage backend is the entire trick: each worker proposes against the latest surrogate and writes its result back, so the cluster stays full without any explicit batching code. scikit-optimize (skopt) and BoTorch offer the same ideas with Gaussian-process surrogates when you want batch (synchronous) proposals instead.5. Choosing a Strategy Intermediate
The three strategies are not a ladder where each strictly dominates the last; they occupy a trade-off, summarized in Table 21.2.1. Grid search is defensible only for two or three axes you genuinely want exhaustively covered. Random search is the right default for high-dimensional spaces, for very large worker counts where coordination is unwelcome, and as the honest baseline every other method must beat. Bayesian optimization earns its complexity when evaluations are expensive enough that cutting their number matters more than keeping every worker perfectly busy, which is to say almost always for deep learning, provided you parallelize it asynchronously rather than running it one trial at a time.
| Strategy | Uses past results? | Trial efficiency | Parallelism | Best when |
|---|---|---|---|---|
| Grid | No | Poor (exponential in $D$) | Perfect, all up front | 2 to 3 axes, exhaustive coverage |
| Random | No | Good in high $D$ | Perfect, all up front | Many axes, huge clusters, baseline |
| Bayesian | Yes | Best per trial | Sequential; recovered via batch / async | Expensive evaluations, modest clusters |
One caveat keeps this section honest: every strategy here decides which configurations to run, but none decides how long to run each one. A search that runs each trial to full convergence still wastes its budget finishing trials that were clearly losing after the first epoch. The largest gains in distributed hyperparameter search come from coupling a good proposal strategy with early stopping of unpromising trials, which is the subject of Section 21.3 on multi-fidelity optimization. The methods of this section choose where to look; multi-fidelity methods choose how hard to look before giving up.
Because every distributed Bayesian optimizer pays a staleness tax for asynchrony, recent work tries to shrink that tax. BoTorch (Balandat et al.) has continued to advance Monte-Carlo acquisition functions whose gradients make large batch proposals tractable, so a single synchronous step can fill dozens of workers with genuinely diverse configurations rather than near-duplicates. A second thread brings large language models into the proposal loop: methods in the lineage of OptFormer and LLM-guided search use a model trained on prior tuning studies to warm-start the surrogate or to suggest configurations directly, cutting the cold-start trials in which classical Bayesian optimization is no better than random. A third pushes cost-awareness into the acquisition function itself, valuing a configuration not only by its expected improvement but by its expected improvement per GPU-hour, so the search prefers cheap-to-evaluate regions when they are nearly as good. We make that cost accounting explicit in this chapter's own Section 21.8 on cost-aware experimentation and return to it when evaluating search systems in Chapter 5.
We now have the proposal half of distributed hyperparameter search: grid and random search trade learning for perfect parallelism, while Bayesian optimization trades parallelism for learning and recovers it through batch and asynchronous variants. The remaining inefficiency is that all three commit a full training run to every configuration they try. Section 21.3 attacks that waste directly, allocating a small budget to many configurations and more budget only to the survivors.
Consider a search space with $D = 6$ hyperparameters, of which only $2$ meaningfully affect the validation loss, and a budget of $64$ trials. Describe the distinct values each strategy explores along an important axis: a grid with $2$ values per axis ($2^6 = 64$ points), and random search with $64$ points. Explain in terms of axis projections (Figure 21.2.1) why random search resolves the important axes far better at the identical budget, and state what would have to be true about the search space for grid search to be the better choice.
Take Code 21.2.1 and sweep the acquisition constant $\kappa$ over $\{0.0, 0.3, 0.6, 1.5, 3.0\}$, reporting the mean best-so-far loss at $30$ trials for each. Explain the shape of the resulting curve: why does $\kappa = 0$ (pure exploitation) underperform, and why does a very large $\kappa$ also underperform by drifting toward pure random search? Identify the $\kappa$ that best balances exploration and exploitation for this objective, and argue why the best value would change if the objective had several widely separated good basins.
You have a budget of $200$ trials and a cluster of $50$ workers; each trial trains for $2$ hours. Estimate the wall-clock time and the worker-hours consumed under three regimes: (a) random search using all $50$ workers; (b) purely sequential Bayesian optimization using one worker; (c) asynchronous Bayesian optimization using all $50$ workers, assuming its staleness costs $10\%$ more trials than the sequential ideal to reach the same loss. Compute the GPU-hour cost of each at a notional \$2 per worker-hour, and argue from your numbers why asynchronous Bayesian optimization is usually the right default on a busy cluster. We formalize this cost accounting in Chapter 3.