Part IV: Parallel Deep Learning and Large Models
Chapter 21: Distributed Hyperparameter Search and AutoML

Distributed Hyperparameter Search and AutoML

Every chapter of this part has scaled a single training run across many machines, holding the configuration fixed: a chosen learning rate, a chosen batch size, a chosen architecture, all decided before the cluster started. This chapter asks the question that production teams actually face once one run works, which is how to find the configuration in the first place. Hyperparameter search looks, at first glance, like the easiest distributed workload in the entire book, because the trials are independent and nothing needs to talk to anything else: launch a hundred configurations on a hundred workers and wait. That first glance is a trap. The trials are independent, but a finite compute budget is not, and the moment you have more configurations to try than accelerators to try them on, the real problem stops being parallelism and becomes allocation: which trials deserve more compute, which should be stopped early, and how a scheduler decides under uncertainty while runs are still in flight. This chapter develops that allocation problem from first principles, from the naive grid through random and Bayesian search, the multi-fidelity idea that you can judge a configuration before it finishes, the successive-halving and Hyperband schedulers that act on that idea, population-based training that mutates configurations mid-flight, the distributed scheduling and early-stopping machinery that makes it all run on a cluster, the Ray Tune and AutoML ecosystem that ships it, and the cost-aware experimentation that decides when the search itself is worth the spend. It closes Part IV by turning every single training run the part built into one point in a search the cluster runs over all of them.

Conceptual illustration for Chapter 21: Distributed Hyperparameter Search and AutoML

"I am one of two hundred trials, all of us convinced our learning rate is the special one. The scheduler does not share our optimism. Half of us will be quietly killed at epoch three, and the cruelest part is that it will probably be right."

A Trial That Plateaued Early
Big Picture

Distributed hyperparameter search is embarrassingly parallel in its trials and genuinely hard in its allocation: the configurations are independent and need no communication, so launching them on a cluster is trivial, but a finite compute budget is shared, so the binding problem is not parallelism but deciding which trials deserve more compute, which to stop early, and how a scheduler makes those calls under uncertainty while runs are still in flight. This is the chapter where "just run more configurations" stops being an answer and becomes a budget. Independence makes the easy part easy: there is no all-reduce here, no parameter server, no gradient to synchronize, just a queue of trials and a pool of workers, and the data-parallel and elastic machinery of the earlier chapters runs inside each trial without the trials ever needing to coordinate. What independence does not buy you is efficiency. With ten thousand configurations and a thousand worker-hours, the question is never "can I run them in parallel" but "which ones should I bother finishing," and that question has no answer until you accept the central idea of the chapter: you can estimate a configuration's final quality from a partial, cheap evaluation, and reallocate compute away from the ones that look bad before they finish. That single idea, multi-fidelity evaluation, is what separates a search that wastes ninety percent of its budget on doomed trials from one that concentrates compute on the promising few. Successive halving and Hyperband turn it into a schedule; population-based training turns it into an evolutionary loop that mutates live configurations instead of discarding them; the distributed scheduler turns it into early-stopping decisions made across a cluster of asynchronous workers reporting partial results at different rates. Underneath all of it sits the same scale-out thesis the book has been building: the hard part of running many experiments is not making them parallel, it is coordinating a shared, finite resource across independent actors that each think they deserve it. The cluster scheduling of Chapter 33 is the general version of this allocation problem; this chapter is its specialization to the case where the jobs being scheduled are themselves trial runs of the model, and where stopping one early is not a failure but the whole point.

Chapter Overview

This is the final chapter of Part IV, and it steps back from any single training run to ask how the configuration that run uses gets chosen at all. Chapters 15 through 19 scaled one model across many machines and the previous chapter built the distributed infrastructure that trains a policy by reinforcement; in every case the hyperparameters arrived fixed, decided by someone before the cluster spun up. This chapter is about that someone's job, mechanized and distributed. The workload it studies is unlike any other in the book in one specific way: the units of work are independent, so the parallelism is free, and that freedom is exactly what makes the chapter's real subject visible. When parallelism costs nothing, the only thing left to engineer is allocation, and the chapter is a sustained study of how to spend a fixed compute budget across many candidate configurations when you cannot afford to finish them all.

The eight sections fall into three movements. The first establishes the problem and the classical toolkit: Section 21.1 shows why search is embarrassingly parallel and why that is precisely not enough, and Section 21.2 develops grid, random, and Bayesian optimization as three answers to where in the configuration space to look. The second movement introduces the idea that reorganizes everything: Section 21.3 develops multi-fidelity optimization, the claim that a cheap partial evaluation predicts a configuration's final quality; Section 21.4 turns that claim into the successive-halving and Hyperband schedules; and Section 21.5 turns it into population-based training, which mutates live configurations rather than discarding the losers. The third movement makes it run on a cluster and pays for it: Section 21.6 builds the distributed trial scheduler and the early-stopping decisions it makes across asynchronous workers, Section 21.7 maps the whole construction onto Ray Tune and the surrounding AutoML ecosystem, and Section 21.8 closes with cost-aware experimentation, the discipline of deciding when the search is worth its own spend.

Read in order, the eight sections take you from "the trials are independent, so the easy part is easy and the hard part is allocation" to "you can place configurations with Bayesian search, evaluate them at low fidelity, schedule compute with Hyperband or population-based training, run the whole thing across a cluster with early stopping, drive it from Ray Tune, and reason about whether the search pays for itself." The argument is cumulative: independence frees the parallelism, multi-fidelity evaluation makes early decisions possible, the schedulers act on those decisions, the distributed machinery runs them at scale, and the cost analysis tells you when to stop searching and start training.

Prerequisites

This chapter assumes the cluster and reliability machinery the earlier chapters built. From Chapter 18: Elastic and Fault-Tolerant Distributed Training you carry checkpointing and the ability to pause, resume, and migrate a run, because multi-fidelity scheduling depends on being able to stop a trial at a checkpoint and either resume it later or discard it cleanly. The chapter also leans forward on the cluster scheduling and resource-allocation ideas developed in Chapter 33: Cluster Infrastructure and Scheduling, since a trial scheduler is a domain-specific scheduler placing jobs on a shared, finite pool of accelerators; you can read this chapter first and Chapter 33 will generalize what you build here. Beyond that the chapter assumes comfortable Python, the basic machine-learning vocabulary of training, validation, learning curves, and overfitting, and a working sense of what a hyperparameter is and why its value matters, but no prior experience with an automated search system. Section 21.1 builds the allocation framing from the ground up before any specific algorithm appears.

Learning Objectives

Remember the Chapter as One Sentence

If you keep one thing from this chapter, keep this: hyperparameter search is the rare distributed workload whose parallelism is free, because the trials are independent, which means the entire engineering problem is allocation, deciding under uncertainty which trials deserve more of a finite compute budget and which to stop early, and every method in the chapter, from Bayesian placement through multi-fidelity evaluation, Hyperband, population-based training, and distributed early-stopping schedulers, exists to spend that budget well. Read forward, the sections build the system in the order it is actually assembled: first the framing that names allocation as the real problem, then the classical placement strategies, then the multi-fidelity idea that makes early decisions possible, then the schedulers that act on it, then the distributed machinery that runs them at scale, and finally the cost analysis that decides when to search at all. Read as a question, the chapter asks of any search at scale: where should I evaluate, how cheaply can I tell a good configuration from a bad one, how do I reallocate compute away from the losers while runs are still in flight, and how do I know the search is worth more than just training the best guess I already have. The roadmap below walks the eight sections that answer it.

Chapter Roadmap

Read the eight sections in order and you will hold a working blueprint for a search that spends its budget well: Section 21.1 names allocation as the real problem behind the easy parallelism, Section 21.2 places configurations with grid, random, and Bayesian search, Section 21.3 introduces the multi-fidelity idea that makes early decisions possible, Sections 21.4 and 21.5 turn that idea into the Hyperband and population-based schedules, Section 21.6 runs them across a cluster with asynchronous early stopping, Section 21.7 maps the whole construction onto Ray Tune, and Section 21.8 decides when the search is worth its cost. The thread to watch is the earlier part reappearing inside each trial: the Chapter 18 checkpointing that lets a scheduler pause and resume a run, the data-parallel step running unchanged inside every configuration, and the cluster allocation of Chapter 33 as the general form of the trial scheduler you build here.

What's Next?

This chapter closes Part IV. Across seven chapters the part took a single model and scaled its training across many machines, through data, model, sharded, and expert parallelism, elastic and fault-tolerant execution, frontier pre-training, distributed reinforcement learning, and finally the search over the configurations themselves; the recurring lesson was that the binding constraint is never one accelerator in isolation but the coordination that keeps a cluster of them productive. With that, the book pivots from building models to serving them. Chapter 22: Per-Node Inference Efficiency: A Prerequisite opens Part V, and it opens it deliberately on one machine. Before any fleet of replicas can be reasoned about, the unit economics of a single node, its quantization, its KV cache, its attention kernels, have to be understood, because distributed serving multiplies per-node behavior across the fleet and a fleet built on an inefficient node is an expensive mistake repeated many times. Read it next, and watch the scale-out story turn around: where Part IV asked how to spread one training run across many machines, Part V starts by asking what one machine actually costs to serve, then multiplies that answer across a distributed inference system.

Bibliography & Further Reading

Search Strategies and Bayesian Optimization

Bergstra, J., Bengio, Y. "Random Search for Hyper-Parameter Optimization." Journal of Machine Learning Research 13, 2012. jmlr.org/papers/v13/bergstra12a.html

The paper that showed random search beats grid search when only a few hyperparameters matter, the surprising baseline that frames the placement question of Section 21.2.

📄 Paper

Snoek, J., Larochelle, H., Adams, R. P. "Practical Bayesian Optimization of Machine Learning Algorithms." arXiv:1206.2944, NeurIPS 2012. arxiv.org/abs/1206.2944

The work that made Gaussian-process Bayesian optimization a practical tool for tuning models, the surrogate-model placement strategy developed in Section 21.2.

📄 Paper

Multi-Fidelity Scheduling

Jamieson, K., Talwalkar, A. "Non-stochastic Best Arm Identification and Hyperparameter Optimization (Successive Halving)." arXiv:1502.07943, AISTATS 2016. arxiv.org/abs/1502.07943

The paper that cast hyperparameter tuning as a best-arm problem and introduced successive halving, the pruning schedule at the core of Section 21.4.

📄 Paper

Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., Talwalkar, A. "Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization." arXiv:1603.06560, JMLR 2018. arxiv.org/abs/1603.06560

The algorithm that hedges successive halving across the cheap-many versus expensive-few trade-off, the central scheduler studied in Section 21.4.

📄 Paper

Falkner, S., Klein, A., Hutter, F. "BOHB: Robust and Efficient Hyperparameter Optimization at Scale." arXiv:1807.01774, ICML 2018. arxiv.org/abs/1807.01774

The method that combines Bayesian optimization with Hyperband, marrying the placement of Section 21.2 to the multi-fidelity scheduling of Section 21.4.

📄 Paper

Distributed Scheduling and Population Methods

Li, L., Jamieson, K., Rostamizadeh, A., Gonina, E., Hardt, M., Recht, B., Talwalkar, A. "A System for Massively Parallel Hyperparameter Tuning (ASHA)." arXiv:1810.05934, MLSys 2020. arxiv.org/abs/1810.05934

The asynchronous successive-halving algorithm that makes early-stopping decisions across a cluster of workers reporting at different rates, the direct reference for Section 21.6.

📄 Paper

Jaderberg, M., Dalibard, V., Osindero, S., Czarnecki, W. M., Donahue, J., Razavi, A., Vinyals, O., Green, T., Dunning, I., Simonyan, K., Fernando, C., Kavukcuoglu, K. "Population Based Training of Neural Networks." arXiv:1711.09846, 2017. arxiv.org/abs/1711.09846

The work that introduced population-based training, the evolutionary exploit-and-explore loop over live configurations developed in Section 21.5.

📄 Paper

Frameworks and AutoML Systems

Liaw, R., Liang, E., Nishihara, R., Moritz, P., Gonzalez, J. E., Stoica, I. "Tune: A Research Platform for Distributed Model Selection and Training." arXiv:1807.05118, 2018. arxiv.org/abs/1807.05118

The platform that packages distributed search, schedulers, and early stopping behind a small API, the primary framework mapped onto the chapter in Section 21.7.

📄 Paper

Akiba, T., Sano, S., Yanase, T., Ohta, T., Koyama, M. "Optuna: A Next-generation Hyperparameter Optimization Framework." arXiv:1907.10902, KDD 2019. arxiv.org/abs/1907.10902

The define-by-run optimization framework with built-in pruning that complements Ray Tune in the ecosystem surveyed in Section 21.7.

📄 Paper

Scaling Hyperparameters to Large Models

Yang, G., Hu, E. J., Babuschkin, I., Sidor, S., Liu, X., Farhi, D., Ryder, N., Pachocki, J., Chen, W., Gao, J. "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer (muP)." arXiv:2203.03466, NeurIPS 2021. arxiv.org/abs/2203.03466

The result that tunes a small model and transfers the hyperparameters to a much larger one, a cost-aware alternative to searching at full scale that informs Section 21.8.

📄 Paper