Part III: Distributed Machine Learning
Chapter 12: Distributed Classical Machine Learning

Distributed Classical Machine Learning

The methods that still win most tabular problems, linear and logistic regression, support vector machines, decision trees, random forests, gradient-boosted ensembles, clustering, and nearest-neighbor search, are not deep networks; this chapter shows how each of them is rebuilt so the data, the model, and the search can be spread across a cluster, and why a boosted-tree split or a nearest-neighbor query becomes a distributed-systems problem the moment the dataset stops fitting on one machine.

Conceptual illustration for Chapter 12: Distributed Classical Machine Learning

"They sent me a hundred million rows and asked for the best split. I do not hold the rows. I hold a histogram of them, gathered from forty workers who each saw a slice, and from that summary alone I must choose where the tree branches."

A Split-Finder Working Only From Aggregates
Big Picture

Chapters 10 and 11 built the engine of distributed learning, the data-parallel gradient and the parameter server that houses it; this chapter spends that engine on the classical algorithms that still solve most real prediction problems, and shows that each one becomes a distinct distributed-systems problem the moment its data outgrows a single machine. Classical machine learning is not a relic. On tabular data, the format of most business records, logs, and scientific tables, a gradient-boosted ensemble or a well-tuned linear model routinely beats a deep network, and the methods of this chapter are the ones a practitioner reaches for first. The chapter walks the canon in order of how its distribution works. It opens with the linear models, regression and logistic regression, whose training is a convex optimization that drops straight onto the data-parallel and parameter-server patterns you already hold: each worker computes a partial gradient on its shard, and an all-reduce or a push-pull sums them. Support vector machines come next, where the same convex-optimization framing meets a communication-efficient distributed solver that keeps most of the work local. Then the chapter turns to trees, where distribution stops being a sum of gradients and becomes a sum of histograms: a decision tree's best split is chosen from aggregated feature statistics, so the distributed primitive is the gather-and-reduce of per-feature histograms, not of gradients. That one idea powers the two great tree ensembles, random forests that train independent trees embarrassingly in parallel and gradient-boosted trees whose sequential dependency makes XGBoost and LightGBM masterworks of distributed split-finding. The chapter closes on the two unsupervised workhorses: distributed clustering, where k-means becomes a MapReduce-shaped reduce over cluster assignments, and approximate nearest-neighbor search, where billion-vector indexes are sharded and queried across a cluster in the structure that Part V will turn into production retrieval. The all-reduce of Chapter 10 and the sharding of Chapter 11 return throughout, now carrying histograms and centroids and index shards instead of dense weight vectors.

Chapter Overview

This is the third chapter of Part III, and it is the one that connects the distribution machinery of the first two chapters to the algorithms most practitioners actually deploy. Chapter 10 gave you the distributed gradient and Chapter 11 gave you the parameter server, but both used a deep-network-shaped model as the running example. This chapter takes the classical canon, the linear models, kernel machines, trees, ensembles, clustering, and nearest-neighbor search that dominate tabular and retrieval workloads, and asks the same question of each: when the data no longer fits on one machine, what exactly has to travel between machines, and which distributed primitive carries it. The answer is different for each family, and that difference is the spine of the chapter.

The seven sections fall into three groups. The first group is the convex learners, where training is an optimization and distribution is a sum of partial gradients: linear and logistic regression in Section 12.1 sit directly on the all-reduce and push-pull you already know, and support vector machines in Section 12.2 add a communication-efficient solver that does most of its work on local data and synchronizes sparingly. The second group is the trees and their ensembles, where the distributed primitive shifts from summing gradients to summing histograms of feature values, because a split is chosen from aggregated statistics rather than from a single gradient. Decision trees in Section 12.3 establish the histogram-based split, random forests in Section 12.4 exploit the fact that independent trees train in parallel with almost no coordination, and gradient boosting in Section 12.5 confronts the sequential dependency that makes distributed boosting hard and that XGBoost and LightGBM solve. The third group is the unsupervised and retrieval methods: distributed clustering in Section 12.6, where k-means is a reduce over assignments in the MapReduce shape of Part II, and approximate nearest-neighbor search in Section 12.7, where the index itself is sharded across machines and a query fans out to all shards and merges their candidates.

Read in order, the seven sections take you from "fit a line to data that does not fit on one disk" to "search a billion vectors in milliseconds across a cluster," and they show that the distributed primitive changes with the algorithm: a gradient sum for the linear models, a histogram gather for the trees, a centroid reduce for clustering, and a sharded fan-out for nearest-neighbor search. This is the chapter where the abstract patterns of Chapters 10 and 11 become a concrete toolkit, and where the approximate nearest-neighbor index built in the last section reappears as the retrieval backbone of distributed vector search in Part V.

Prerequisites

This chapter assumes the two chapters that opened Part III. From Chapter 10: Distributed Optimization you carry the data-parallel gradient, synchronous and asynchronous distributed SGD, and the all-reduce that sums partial gradients across workers; the linear models of Section 12.1 and the SVM solver of Section 12.2 are exactly that machinery applied to convex objectives, and the iterative reduces of clustering in Section 12.6 reuse the same synchronization pattern. From Chapter 4: Communication Primitives for Distributed Training you carry the collective operations, all-reduce, all-gather, and broadcast, that move histograms, centroids, and candidate sets between machines, and the bandwidth-and-latency reasoning that tells you when a histogram gather is cheaper than shipping the data. The chapter assumes comfortable Python and scikit-learn-style model APIs, single-machine versions of each algorithm (ordinary least squares, the SVM margin, the decision-tree split criterion, k-means, and a nearest-neighbor query), and the basic convex-optimization and linear-algebra background refreshed in Appendix A: Mathematical Background. Familiarity with the parameter-server sharding of Chapter 11: Parameter Servers and Distributed Embeddings helps with the sharded indexes of Section 12.7 but is not required.

Learning Objectives

Remember the Chapter as One Sentence

If you keep one thing from this chapter, keep this: every classical learner becomes a distributed-systems problem when its data outgrows one machine, and the craft is recognizing which distributed primitive each one needs, a gradient sum for the linear and kernel models, a histogram gather for trees and boosted ensembles, a centroid reduce for clustering, and a sharded fan-out-and-merge for nearest-neighbor search. Read forward, the sections build that recognition family by family: first the convex learners that drop onto the all-reduce and push-pull you already hold, then the tree methods where distribution means summing histograms of feature values rather than gradients, then the unsupervised and retrieval methods where the data and the index themselves are partitioned across the cluster. Read as a question, the chapter asks of any classical algorithm you want to scale: what summary statistic must travel between machines, and which collective carries it, and that question turns a textbook of single-machine algorithms into a toolkit for the cluster. The roadmap below walks the seven sections that build that toolkit.

Chapter Roadmap

Read the seven sections in order and you will hold a toolkit for scaling the classical canon across a cluster: Section 12.1 and Section 12.2 put the convex learners on the gradient-sum machinery of Chapter 10, Section 12.3 introduces the histogram gather that Sections 12.4 and 12.5 turn into distributed random forests and boosted trees, and Sections 12.6 and 12.7 close on distributed clustering and the sharded nearest-neighbor index. The thread to watch is the collective communication of Chapter 4 reappearing under each algorithm as a different payload, and the approximate-nearest-neighbor index of Section 12.7 returning as the retrieval backbone of Chapter 25.

What's Next?

This chapter scaled the classical canon across a cluster, the linear and kernel models on a gradient sum, the trees and boosted ensembles on a histogram gather, and clustering and nearest-neighbor search on a reduce and a sharded fan-out. Chapter 13: Distributed Graph Machine Learning takes the next step in generality, from data that arrives as independent rows to data that arrives as a graph, where each example is connected to others and the computation must follow the edges. Graph algorithms and graph neural networks raise a distribution problem the methods here did not face: the data cannot be sharded into independent pieces, because partitioning a graph cuts edges that the computation needs, so the chapter develops graph partitioning, the gather-scatter-apply pattern of vertex-centric computation, and the sampling tricks that make message passing tractable at billion-edge scale. Read it next, and watch the clean shard-and-reduce of this chapter give way to the harder problem of computing over data whose pieces refuse to come apart.

Bibliography & Further Reading

Distributed Frameworks and Linear Models

Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., et al. "MLlib: Machine Learning in Apache Spark." Journal of Machine Learning Research, 2016. jmlr.org

The reference for Spark's distributed machine-learning library, describing how linear models, trees, and clustering are implemented over a data-parallel engine, the production setting for most of this chapter.

📄 Paper

Smith, V., Forte, S., Ma, C., Takac, M., Jordan, M. I., Jaggi, M. "CoCoA: A General Framework for Communication-Efficient Distributed Optimization." arXiv:1611.02189, 2016. arxiv.org/abs/1611.02189

The communication-efficient framework that does most of the optimization locally on each shard, the basis for the distributed linear models of Section 12.1 and the SVM solver of Section 12.2.

📄 Paper

Trees, Forests, and Gradient Boosting

Breiman, L. "Random Forests." Machine Learning, 45(1), 2001. link.springer.com

The original random-forest paper, defining the ensemble of independent bootstrap-trained trees whose embarrassing parallelism makes Section 12.4 the easiest distribution in the chapter.

📄 Paper

Chen, T., Guestrin, C. "XGBoost: A Scalable Tree Boosting System." KDD, 2016. arxiv.org/abs/1603.02754

The system paper for the most widely used gradient-boosting library, whose distributed histogram-based split-finding is the centerpiece of Section 12.5.

📄 Paper

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., et al. "LightGBM: A Highly Efficient Gradient Boosting Decision Tree." NeurIPS, 2017. papers.nips.cc

The boosting library whose histogram algorithm, gradient-based sampling, and feature bundling cut the communication and compute of distributed split-finding, the second pillar of Section 12.5.

📄 Paper

Distributed Clustering

Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S. "Scalable K-Means++." arXiv:1203.6402, 2012. arxiv.org/abs/1203.6402

The parallel initialization (k-means||) that replaces the sequential k-means++ seeding with a small number of distributed sampling rounds, the scalable initializer of Section 12.6.

📄 Paper

Approximate Nearest-Neighbor Search

Johnson, J., Douze, M., Jegou, H. "Billion-Scale Similarity Search with GPUs (FAISS)." arXiv:1702.08734, 2017. arxiv.org/abs/1702.08734

The FAISS paper for billion-vector similarity search on GPUs, the engine behind the sharded indexes and fan-out queries of Section 12.7.

📄 Paper

Malkov, Y. A., Yashunin, D. A. "Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs (HNSW)." arXiv:1603.09320, 2016. arxiv.org/abs/1603.09320

The HNSW graph index that gives logarithmic-time approximate search and is the per-shard structure most distributed nearest-neighbor systems in Section 12.7 build on.

📄 Paper

Jegou, H., Douze, M., Schmid, C. "Product Quantization for Nearest Neighbor Search." IEEE TPAMI, 33(1), 2011. ieeexplore.ieee.org

The product-quantization method that compresses vectors so a billion of them fit in memory across a cluster, the storage backbone of the indexes in Section 12.7.

📄 Paper

Subramanya, S. J., Devvrit, F., Simhadri, H. V., Krishnaswamy, R., Kadekodi, R. "DiskANN: Fast Accurate Billion-Point Nearest Neighbor Search on a Single Node." NeurIPS, 2019. papers.nips.cc

The DiskANN index that pushes billion-point search onto SSD-backed nodes, the cost-efficient building block that shards into the distributed retrieval of Section 12.7.

📄 Paper