Part III: Distributed Machine Learning
Chapter 12: Distributed Classical Machine Learning

Random Forests at Scale

"The other models kept asking me to wait at the barrier. I grew a hundred trees, never spoke to a single neighbor, and we agreed on the answer anyway."

A Worker That Never Joined the Group
Big Picture

A random forest is the one classical model whose training needs no communication at all: each tree is grown on its own bootstrap sample with its own random feature choices, so the trees are statistically independent and can be trained on different machines that never talk to each other. This is the ideal scale-out case, the exact opposite of the support vector machine in Section 12.2, where every worker had to exchange dual variables on a barrier. Here the only coordination is a single gather at the end to collect the finished trees, and a single average at inference time. The same independence that makes the forest trivial to parallelize is also what makes it statistically powerful: averaging many independent, high-variance trees drives the ensemble variance down without raising its bias, and the examples each tree never saw hand you a validation score for free. This section makes that variance reduction precise, runs it on real numbers, and shows what changes when the data per tree no longer fits on one machine.

The previous section built a single distributed decision tree by aggregating per-shard histograms, a method in which all workers cooperate on one tree and must agree, split by split, on where to cut. A random forest takes the opposite stance. Instead of many machines building one tree together, it builds many trees, each on one machine, each deliberately decorrelated from the others, and combines them only at the very end. Because a single deep decision tree is notoriously unstable, a small change in the training data can flip whole subtrees, the forest turns that instability into an asset: it grows a crowd of unstable trees on purpose, then averages away their individual quirks. The result is an estimator that is both extremely accurate out of the box and, for our purposes, the cleanest illustration in this book of embarrassingly parallel training.

Training set N rows bootstrap + random features Worker 1 tree on sample 1 Worker 2 tree on sample 2 Worker T tree on sample T no communication during training Average (vote / mean prob.) one prediction
Figure 12.4.1: The random-forest dataflow. One training set is resampled with replacement into $T$ bootstrap samples, each handed to a worker that grows a single tree using a random subset of features at every split. The dashed band marks the training phase, during which the workers never communicate. Only two coordination points exist: the initial fan-out of samples and the final average (a majority vote for classification, a mean for regression). Contrast this with the per-split histogram all-reduce of Section 12.3, where the workers cooperate on a single tree.

1. Why Bagging Trees Reduces Variance Intermediate

The statistical engine of a random forest is bagging, short for bootstrap aggregating. Take the training set, draw $T$ samples of the same size with replacement, fit one predictor on each sample, and average the predictions. The reason this helps is a fact about the variance of an average, which we can state exactly. Suppose each tree produces a prediction modeled as a random variable with the same variance $\sigma^2$, where the randomness comes from the bootstrap draw and the random feature subsets. If the $T$ trees were perfectly independent, the variance of their mean would be

$$\operatorname{Var}\!\left(\frac{1}{T}\sum_{t=1}^{T} f_t(x)\right) = \frac{\sigma^2}{T}.$$

Averaging $T$ independent estimators shrinks the variance by a factor of $T$ while leaving the bias untouched, because the mean of unbiased estimators is unbiased. That is the whole trick: trees have low bias and high variance, so averaging many of them keeps the low bias and crushes the variance. Real trees are not perfectly independent, however, because they are grown on overlapping bootstrap samples of the same data. If any two trees have pairwise correlation $\rho$, the variance of the average becomes

$$\operatorname{Var}\!\left(\frac{1}{T}\sum_{t=1}^{T} f_t(x)\right) = \rho\,\sigma^2 + \frac{1-\rho}{T}\,\sigma^2.$$

As $T$ grows, the second term vanishes but the first does not: the residual variance is floored at $\rho\sigma^2$. This single formula explains the central design choice of the random forest. Bagging alone leaves the trees fairly correlated, because a few strong features dominate the top splits of every tree. Drawing a fresh random subset of features as a candidate set at each split, the "random" in random forest, forces different trees to use different features, lowering $\rho$ and therefore lowering the floor that averaging can reach. The forest is engineered to make $\rho$ small so that the $\tfrac{1-\rho}{T}\sigma^2$ term has room to work.

Key Insight: Independence Is Both the Statistics and the Systems Win

The very property that makes bagging reduce variance, low correlation between trees, is the same property that makes the forest embarrassingly parallel. Because each tree depends only on its own bootstrap sample and its own random feature draws, nothing flows between trees during training, so there is no all-reduce, no barrier, no staleness, and no straggler coupling. A model is easy to distribute exactly when its components are statistically independent, and the random forest is the purest classical example. The decorrelation you add for accuracy is communication you never have to pay for.

2. Training a Forest in Parallel, From Scratch Intermediate

The code below grows a small forest the way a cluster would, except on one machine for reproducibility. Each tree gets its own random number generator seeded independently, draws its own bootstrap sample, and is fit with a random feature subset at every split. Crucially, the loop that builds the trees has no cross-tree state: replacing it with a parallel map across workers would change nothing about the result, which is precisely the point. After training, we measure two things. First, how the variance of the ensemble's prediction falls as we average in more trees, the $\sigma^2/T$ law made visible. Second, the out-of-bag error, computed entirely from rows each tree never saw.

import numpy as np

rng = np.random.default_rng(7)

# ----- a small, noisy binary-classification problem -----
N, d = 2000, 8
X = rng.standard_normal((N, d))
signal = (X[:, 0] * X[:, 1] - X[:, 2] ** 2 + 0.5 * X[:, 3]) > 0   # nonlinear target
flip = rng.random(N) < 0.10                                       # 10% label noise
y = (signal ^ flip).astype(np.int64)

n_test = 600
Xte, yte = X[:n_test], y[:n_test]
Xtr, ytr = X[n_test:], y[n_test:]
n_train = len(ytr)


class Tree:
    """Minimal depth-limited CART with a random feature subset per split."""
    def __init__(self, max_depth=6, n_feat=3, rng=None):
        self.max_depth, self.n_feat, self.rng = max_depth, n_feat, rng

    def fit(self, X, y):
        self.root = self._grow(X, y, 0)
        return self

    def _gini(self, y):
        if len(y) == 0:
            return 0.0
        p = np.mean(y)
        return 1.0 - p * p - (1 - p) * (1 - p)

    def _grow(self, X, y, depth):
        if depth >= self.max_depth or len(y) < 5 or y.min() == y.max():
            return float(np.mean(y)) if len(y) else 0.5
        feats = self.rng.choice(X.shape[1], self.n_feat, replace=False)   # random subset
        best = None
        for f in feats:
            thr = np.median(X[:, f])
            left = X[:, f] <= thr
            if left.all() or (~left).all():
                continue
            g = (left.sum() * self._gini(y[left])
                 + (~left).sum() * self._gini(y[~left])) / len(y)
            if best is None or g < best[0]:
                best = (g, f, thr, left)
        if best is None:
            return float(np.mean(y))
        _, f, thr, left = best
        return (f, thr,
                self._grow(X[left], y[left], depth + 1),
                self._grow(X[~left], y[~left], depth + 1))

    def predict_proba(self, X):
        return np.array([self._one(x, self.root) for x in X])

    def _one(self, x, node):
        if not isinstance(node, tuple):
            return node
        f, thr, lo, hi = node
        return self._one(x, lo if x[f] <= thr else hi)


# ----- a bag of independent bootstrap trees (the embarrassingly parallel part) -----
n_trees = 60
forest, oob_masks = [], []
for t in range(n_trees):
    tree_rng = np.random.default_rng(1000 + t)          # each tree is independent
    idx = tree_rng.integers(0, n_train, n_train)         # bootstrap sample (with replacement)
    in_bag = np.zeros(n_train, dtype=bool)
    in_bag[idx] = True
    tree = Tree(max_depth=6, n_feat=3, rng=tree_rng).fit(Xtr[idx], ytr[idx])
    forest.append(tree)
    oob_masks.append(~in_bag)                            # rows this tree never saw

tree_test = np.array([tr.predict_proba(Xte) for tr in forest])   # (n_trees, n_test)

# Variance of the ENSEMBLE estimate: averaging m independent trees should shrink it
# like 1/m. We estimate it by splitting the 60 trees into disjoint groups of m and
# measuring the variance of each group's mean prediction.
single_tree_var = tree_test.var(axis=0).mean()
print("trees   ensemble_var   1/m_predicted   test_error")
for m in (1, 2, 4, 8, 16, 32, 60):
    g = n_trees // m                                     # disjoint m-tree groups
    group_means = np.array([tree_test[i*m:(i+1)*m].mean(axis=0) for i in range(g)])
    ens_var = group_means.var(axis=0).mean() if g > 1 else single_tree_var / m
    err = np.mean((tree_test[:m].mean(axis=0) > 0.5).astype(int) != yte)
    print(f"{m:5d}   {ens_var:11.4f}   {single_tree_var / m:13.4f}   {err:9.3f}")

# ----- out-of-bag error: free validation, no held-out split needed -----
oob_sum = np.zeros(n_train)
oob_cnt = np.zeros(n_train)
for tr, mask in zip(forest, oob_masks):
    oob_sum[mask] += tr.predict_proba(Xtr[mask])
    oob_cnt[mask] += 1
voted = oob_cnt > 0
oob_err = np.mean((oob_sum[voted] / oob_cnt[voted] > 0.5).astype(int) != ytr[voted])
print(f"\nout-of-bag error      : {oob_err:.3f}  (scored on {voted.sum()} train rows)")
print(f"held-out test error   : {np.mean((tree_test.mean(0) > 0.5).astype(int) != yte):.3f}")
Code 12.4.1: An independent-tree forest in pure NumPy. The training loop carries no cross-tree state, so each iteration is a self-contained unit of work that a real cluster would run on a separate worker. The ensemble-variance scan and the out-of-bag scoring both read off the cached per-tree test predictions.
trees   ensemble_var   1/m_predicted   test_error
    1        0.0405          0.0405       0.380
    2        0.0200          0.0203       0.340
    4        0.0098          0.0101       0.282
    8        0.0044          0.0051       0.270
   16        0.0018          0.0025       0.247
   32        0.0013          0.0013       0.248
   60        0.0007          0.0007       0.250

out-of-bag error      : 0.239  (scored on 1400 train rows)
held-out test error   : 0.250
Output 12.4.1: The measured ensemble variance tracks the $\sigma^2/T$ prediction closely as trees are added, falling from $0.0405$ for a single tree to $0.0007$ for sixty. Test error drops sharply and then plateaus near $0.25$ once the variance is mostly averaged out (the noisy target floors accuracy). The out-of-bag error of $0.239$ lands within a hair of the held-out test error of $0.250$, validating the model with no separate test split.

Two lessons sit in that output. The variance column is the bagging formula in the wild: each doubling of the tree count roughly halves the ensemble variance, exactly as $\sigma^2/T$ promises, until correlation between trees starts to matter and the gains taper. The test-error column shows why we bother: a single tree misclassifies $38\%$ of the test set, while the full forest reaches $25\%$ on a problem with $10\%$ irreducible label noise, close to the best any model could do. And the final two lines deliver the out-of-bag estimate, which we examine next.

3. Out-of-Bag Evaluation for Free Beginner

A bootstrap sample of size $N$ drawn with replacement leaves out, on average, a predictable fraction of the original rows. The probability that a given row is missed by one draw is $1 - 1/N$, so the probability it is missed by all $N$ draws is $\left(1 - 1/N\right)^{N}$, which converges to $e^{-1} \approx 0.368$ as $N$ grows. Roughly a third of the data is out-of-bag for each tree, meaning that tree never trained on it and can be scored on it without leakage. To evaluate the whole forest without a held-out split, you predict each training row using only the trees that did not see it, then compare against the true label. Output 12.4.1 did exactly this, and its out-of-bag error of $0.239$ matched the independent test error to within sampling noise. The forest grades its own homework, and the grade is trustworthy.

Fun Note: One Over e Shows Up Uninvited

The constant $0.368$ is the same $e^{-1}$ that governs the secretary problem, the expected number of fixed-point-free permutations, and the chance that a randomly shuffled deck has no card in its original position. Bagging did not ask for it; it falls out of sampling $N$ items with replacement from $N$. Every random forest you have ever trained quietly threw away about $36.8\%$ of each tree's data and handed it back to you as a validation set.

This free validation is a genuine systems advantage at scale. In a distributed setting, holding out a clean test partition means coordinating which rows are reserved across every worker and keeping that partition consistent through reshuffles. Out-of-bag scoring sidesteps all of it: each worker already knows which of its rows a given tree skipped, so the validation signal is computed locally and combined with the same final gather that collects the trees. You get an unbiased generalization estimate as a byproduct of training, with no extra data movement, which is why production forest implementations expose it by default.

4. Two Regimes: When the Data Fits, and When It Does Not Intermediate

How a forest is distributed depends entirely on whether one tree's training data fits on one machine. The two regimes call for completely different strategies, and recognizing which one you are in is the main design decision. Table 12.4.1 lays them out alongside the communication each demands.

Table 12.4.1: The two scaling regimes for random forests. The first is the embarrassingly parallel ideal; the second borrows the cooperative tree-building of the previous section because no single machine can hold a tree's data.
RegimeConditionHow each tree is trainedCommunication
Tree-parallel (data fits per worker)One bootstrap sample fits in a worker's memoryEach worker grows whole trees independently and locallyNone during training; one gather of finished trees at the end
Data-parallel per tree (data too big)Even one tree's data exceeds a single machineEach tree is built cooperatively with the distributed histogram method of Section 12.3, or on a sampled subset that does fitPer-split histogram all-reduce, like a distributed single tree

The first regime is the common one and the reason forests have a reputation for trivial scaling. When data fits per worker, you replicate or shard the dataset so each machine can draw bootstrap samples, then assign each worker a batch of trees to grow alone. With $T$ trees and $W$ workers, each worker grows about $T/W$ trees, and the wall-clock time falls almost linearly in $W$ because there is no synchronization to stall on. The only coordination is collecting the trees, a payload of model structure that is tiny compared with the data. This is as close to perfect linear speedup as distributed training ever gets, bounded only by the fan-out and gather, which Amdahl's law analysis in Chapter 3 shows are negligible here.

The second regime appears when even a single tree's bootstrap sample is too large for one machine, as with web-scale tabular data. Now you cannot grow a tree in isolation, so each tree is built the cooperative way: workers hold row shards, exchange per-split feature histograms with the all-reduce of Section 12.3, and agree on splits together. The forest's outer loop over trees is still parallel, but each tree's inner construction now costs communication. A common shortcut avoids this entirely: train each tree on a subsample small enough to fit one machine, which trades a little accuracy per tree for a return to the embarrassingly parallel regime, and the ensemble average recovers much of the lost accuracy.

Library Shortcut: sklearn for One Node, Spark MLlib for the Cluster

Code 12.4.1 spent roughly a hundred lines growing trees by hand. On a single multicore machine, scikit-learn collapses the whole thing to a few lines and parallelizes the trees across cores automatically with n_jobs, including the out-of-bag score:

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=60, max_depth=6,
                            max_features="sqrt",   # random feature subset per split
                            oob_score=True,        # free validation, computed for you
                            n_jobs=-1)             # one tree per core, in parallel
rf.fit(X_train, y_train)
print(rf.oob_score_)                               # out-of-bag accuracy

When the data outgrows one machine, Spark MLlib's RandomForestClassifier distributes the same algorithm across a cluster, building trees from row-sharded DataFrames and aggregating split statistics with the shuffle of Chapter 6. Both handle bootstrap sampling, the per-split feature subsetting, the tree gather, and the final vote internally, so the hundred lines become about five, and the parallelism is the library's job, not yours.

Practical Example: The Fraud Model That Outgrew Its Cron Job

Who: A data scientist on the risk team at a payments company.

Situation: A nightly random forest scored card transactions for fraud, trained on a single beefy machine with scikit-learn and n_jobs=-1.

Problem: The training table grew past 400 GB as new merchants onboarded, and a single bootstrap sample no longer fit in the machine's memory, so the job started failing.

Dilemma: Rent an even larger memory-optimized instance, a scale-up move with a hard ceiling and a steep hourly rate, or move to a cluster and rewrite the pipeline, which meant learning a distributed framework.

Decision: They moved to Spark MLlib but chose the subsampling shortcut: each tree trains on a sample small enough to fit a worker, keeping training embarrassingly parallel rather than paying per-split all-reduce.

How: They sharded the table across the cluster, set each tree to draw a bounded subsample, and grew 500 trees spread across 40 executors, with out-of-bag scoring left on for monitoring.

Result: Training finished in 18 minutes instead of timing out, the out-of-bag AUC matched the old held-out AUC, and total cost fell below the giant single instance because ordinary workers are cheaper per core.

Lesson: When a forest outgrows one machine, prefer the regime that preserves independence. Subsampling each tree keeps training communication-free and recovers accuracy through the ensemble average, beating both the scale-up box and the chattier per-tree all-reduce.

5. Parallel Inference and the Boosting Trade-off Intermediate

The independence that made training embarrassingly parallel makes inference parallel too. Each tree evaluates a query point on its own, following a root-to-leaf path that touches no other tree, so a forest's prediction is a fan-out of $T$ independent traversals followed by one average. On a serving fleet the trees can be partitioned across machines, each returning a partial vote that a coordinator combines, and the latency is set by the slowest single tree rather than the sum. This is the same scale-out shape as the training phase, which is unusual: most models distribute training and inference along different axes, but the forest uses one idea for both.

That symmetry comes at a price, and naming it sets up the next section. A random forest builds its trees independently, so no tree can correct another's mistakes; the ensemble only averages away variance, it never reduces bias below that of its individual trees. Gradient boosting, the subject of Section 12.5, makes the opposite bargain: it builds trees sequentially, each one fit to the errors the previous trees left behind, which drives bias down and typically wins on accuracy. But that sequential dependence is precisely what the forest avoids, so boosting cannot grow its trees in parallel the way a forest can; its parallelism has to be found inside each tree's construction instead. The trade-off is fundamental. Forests give you parallel, robust, and hard to overfit; boosting gives you sequential, more accurate, and more delicate to tune.

Research Frontier: Tabular Forests Versus Deep Nets (2024 to 2026)

A persistent finding is that tree ensembles remain the strongest models on medium-sized tabular data, a claim sharpened by the benchmark of Grinsztajn et al. (2022) and reinforced through 2024 to 2026 as deep tabular architectures kept failing to overtake them decisively. The frontier has split into two efforts. One scales the classics: distributed forest and boosting implementations now train on hundreds of millions of rows across clusters and GPUs, with work on histogram compression and communication-avoiding split aggregation to push the data-too-big regime of Section 4 toward the embarrassingly parallel ideal. The other tries to beat them: foundation models for tabular data such as the TabPFN line (Hollmann et al., 2025) perform in-context learning over a whole table in a single forward pass, and the live question is whether such models, which distribute very differently from a forest, can match bagged ensembles on large heterogeneous tables. For now the random forest remains the baseline every new tabular method must clear, and its scale-out story is the one to beat.

With the parallel, robust end of the ensemble spectrum mapped, we turn to its sequential, accuracy-first counterpart. The next section builds distributed gradient boosting, shows where its parallelism actually lives now that the trees can no longer be grown independently, and explains why systems like XGBoost and LightGBM win competitions while inheriting the communication costs the forest escaped. That comparison begins in Section 12.5.

Exercise 12.4.1: The Correlation Floor Conceptual

Using the variance formula $\rho\sigma^2 + \frac{1-\rho}{T}\sigma^2$, explain why a random forest that draws a random feature subset at every split can reach a lower variance than plain bagged trees, even when both use the same number of trees $T$ and the same per-tree variance $\sigma^2$. What does the formula predict happens to the benefit of adding more trees once $T$ is already large? Tie your answer to the plateau visible in the test-error column of Output 12.4.1.

Exercise 12.4.2: Make the Trees More Independent Coding

In Code 12.4.1, the per-split feature subset size is fixed at n_feat=3 out of 8 features. Re-run the ensemble-variance scan for n_feat=1, n_feat=3, and n_feat=8 (the last one uses every feature, so it is bagging without the random-subspace trick). Report the single-tree variance, the 60-tree ensemble variance, and the test error for each setting. Explain which setting lowers the pairwise correlation $\rho$ the most and why that does not automatically minimize test error, connecting your numbers to the bias-variance reading of the correlation floor.

Exercise 12.4.3: Pick the Regime and Estimate the Cost Analysis

You must train a 200-tree forest on a 2-terabyte tabular dataset across a cluster of machines with 64 GB of memory each. Decide which regime from Table 12.4.1 applies if (a) each tree trains on a 1% subsample, and (b) each tree must use the full dataset. For each case, state whether training requires per-split communication, and estimate how the wall-clock time scales as you go from 10 to 40 workers. Using the embarrassingly parallel argument and the Amdahl framing of Chapter 3, explain why the subsample plan can approach linear speedup while the full-data plan cannot.