"The other models kept asking me to wait at the barrier. I grew a hundred trees, never spoke to a single neighbor, and we agreed on the answer anyway."
A Worker That Never Joined the Group
A random forest is the one classical model whose training needs no communication at all: each tree is grown on its own bootstrap sample with its own random feature choices, so the trees are statistically independent and can be trained on different machines that never talk to each other. This is the ideal scale-out case, the exact opposite of the support vector machine in Section 12.2, where every worker had to exchange dual variables on a barrier. Here the only coordination is a single gather at the end to collect the finished trees, and a single average at inference time. The same independence that makes the forest trivial to parallelize is also what makes it statistically powerful: averaging many independent, high-variance trees drives the ensemble variance down without raising its bias, and the examples each tree never saw hand you a validation score for free. This section makes that variance reduction precise, runs it on real numbers, and shows what changes when the data per tree no longer fits on one machine.
The previous section built a single distributed decision tree by aggregating per-shard histograms, a method in which all workers cooperate on one tree and must agree, split by split, on where to cut. A random forest takes the opposite stance. Instead of many machines building one tree together, it builds many trees, each on one machine, each deliberately decorrelated from the others, and combines them only at the very end. Because a single deep decision tree is notoriously unstable, a small change in the training data can flip whole subtrees, the forest turns that instability into an asset: it grows a crowd of unstable trees on purpose, then averages away their individual quirks. The result is an estimator that is both extremely accurate out of the box and, for our purposes, the cleanest illustration in this book of embarrassingly parallel training.
1. Why Bagging Trees Reduces Variance Intermediate
The statistical engine of a random forest is bagging, short for bootstrap aggregating. Take the training set, draw $T$ samples of the same size with replacement, fit one predictor on each sample, and average the predictions. The reason this helps is a fact about the variance of an average, which we can state exactly. Suppose each tree produces a prediction modeled as a random variable with the same variance $\sigma^2$, where the randomness comes from the bootstrap draw and the random feature subsets. If the $T$ trees were perfectly independent, the variance of their mean would be
$$\operatorname{Var}\!\left(\frac{1}{T}\sum_{t=1}^{T} f_t(x)\right) = \frac{\sigma^2}{T}.$$Averaging $T$ independent estimators shrinks the variance by a factor of $T$ while leaving the bias untouched, because the mean of unbiased estimators is unbiased. That is the whole trick: trees have low bias and high variance, so averaging many of them keeps the low bias and crushes the variance. Real trees are not perfectly independent, however, because they are grown on overlapping bootstrap samples of the same data. If any two trees have pairwise correlation $\rho$, the variance of the average becomes
$$\operatorname{Var}\!\left(\frac{1}{T}\sum_{t=1}^{T} f_t(x)\right) = \rho\,\sigma^2 + \frac{1-\rho}{T}\,\sigma^2.$$As $T$ grows, the second term vanishes but the first does not: the residual variance is floored at $\rho\sigma^2$. This single formula explains the central design choice of the random forest. Bagging alone leaves the trees fairly correlated, because a few strong features dominate the top splits of every tree. Drawing a fresh random subset of features as a candidate set at each split, the "random" in random forest, forces different trees to use different features, lowering $\rho$ and therefore lowering the floor that averaging can reach. The forest is engineered to make $\rho$ small so that the $\tfrac{1-\rho}{T}\sigma^2$ term has room to work.
The very property that makes bagging reduce variance, low correlation between trees, is the same property that makes the forest embarrassingly parallel. Because each tree depends only on its own bootstrap sample and its own random feature draws, nothing flows between trees during training, so there is no all-reduce, no barrier, no staleness, and no straggler coupling. A model is easy to distribute exactly when its components are statistically independent, and the random forest is the purest classical example. The decorrelation you add for accuracy is communication you never have to pay for.
2. Training a Forest in Parallel, From Scratch Intermediate
The code below grows a small forest the way a cluster would, except on one machine for reproducibility. Each tree gets its own random number generator seeded independently, draws its own bootstrap sample, and is fit with a random feature subset at every split. Crucially, the loop that builds the trees has no cross-tree state: replacing it with a parallel map across workers would change nothing about the result, which is precisely the point. After training, we measure two things. First, how the variance of the ensemble's prediction falls as we average in more trees, the $\sigma^2/T$ law made visible. Second, the out-of-bag error, computed entirely from rows each tree never saw.
import numpy as np
rng = np.random.default_rng(7)
# ----- a small, noisy binary-classification problem -----
N, d = 2000, 8
X = rng.standard_normal((N, d))
signal = (X[:, 0] * X[:, 1] - X[:, 2] ** 2 + 0.5 * X[:, 3]) > 0 # nonlinear target
flip = rng.random(N) < 0.10 # 10% label noise
y = (signal ^ flip).astype(np.int64)
n_test = 600
Xte, yte = X[:n_test], y[:n_test]
Xtr, ytr = X[n_test:], y[n_test:]
n_train = len(ytr)
class Tree:
"""Minimal depth-limited CART with a random feature subset per split."""
def __init__(self, max_depth=6, n_feat=3, rng=None):
self.max_depth, self.n_feat, self.rng = max_depth, n_feat, rng
def fit(self, X, y):
self.root = self._grow(X, y, 0)
return self
def _gini(self, y):
if len(y) == 0:
return 0.0
p = np.mean(y)
return 1.0 - p * p - (1 - p) * (1 - p)
def _grow(self, X, y, depth):
if depth >= self.max_depth or len(y) < 5 or y.min() == y.max():
return float(np.mean(y)) if len(y) else 0.5
feats = self.rng.choice(X.shape[1], self.n_feat, replace=False) # random subset
best = None
for f in feats:
thr = np.median(X[:, f])
left = X[:, f] <= thr
if left.all() or (~left).all():
continue
g = (left.sum() * self._gini(y[left])
+ (~left).sum() * self._gini(y[~left])) / len(y)
if best is None or g < best[0]:
best = (g, f, thr, left)
if best is None:
return float(np.mean(y))
_, f, thr, left = best
return (f, thr,
self._grow(X[left], y[left], depth + 1),
self._grow(X[~left], y[~left], depth + 1))
def predict_proba(self, X):
return np.array([self._one(x, self.root) for x in X])
def _one(self, x, node):
if not isinstance(node, tuple):
return node
f, thr, lo, hi = node
return self._one(x, lo if x[f] <= thr else hi)
# ----- a bag of independent bootstrap trees (the embarrassingly parallel part) -----
n_trees = 60
forest, oob_masks = [], []
for t in range(n_trees):
tree_rng = np.random.default_rng(1000 + t) # each tree is independent
idx = tree_rng.integers(0, n_train, n_train) # bootstrap sample (with replacement)
in_bag = np.zeros(n_train, dtype=bool)
in_bag[idx] = True
tree = Tree(max_depth=6, n_feat=3, rng=tree_rng).fit(Xtr[idx], ytr[idx])
forest.append(tree)
oob_masks.append(~in_bag) # rows this tree never saw
tree_test = np.array([tr.predict_proba(Xte) for tr in forest]) # (n_trees, n_test)
# Variance of the ENSEMBLE estimate: averaging m independent trees should shrink it
# like 1/m. We estimate it by splitting the 60 trees into disjoint groups of m and
# measuring the variance of each group's mean prediction.
single_tree_var = tree_test.var(axis=0).mean()
print("trees ensemble_var 1/m_predicted test_error")
for m in (1, 2, 4, 8, 16, 32, 60):
g = n_trees // m # disjoint m-tree groups
group_means = np.array([tree_test[i*m:(i+1)*m].mean(axis=0) for i in range(g)])
ens_var = group_means.var(axis=0).mean() if g > 1 else single_tree_var / m
err = np.mean((tree_test[:m].mean(axis=0) > 0.5).astype(int) != yte)
print(f"{m:5d} {ens_var:11.4f} {single_tree_var / m:13.4f} {err:9.3f}")
# ----- out-of-bag error: free validation, no held-out split needed -----
oob_sum = np.zeros(n_train)
oob_cnt = np.zeros(n_train)
for tr, mask in zip(forest, oob_masks):
oob_sum[mask] += tr.predict_proba(Xtr[mask])
oob_cnt[mask] += 1
voted = oob_cnt > 0
oob_err = np.mean((oob_sum[voted] / oob_cnt[voted] > 0.5).astype(int) != ytr[voted])
print(f"\nout-of-bag error : {oob_err:.3f} (scored on {voted.sum()} train rows)")
print(f"held-out test error : {np.mean((tree_test.mean(0) > 0.5).astype(int) != yte):.3f}")
trees ensemble_var 1/m_predicted test_error
1 0.0405 0.0405 0.380
2 0.0200 0.0203 0.340
4 0.0098 0.0101 0.282
8 0.0044 0.0051 0.270
16 0.0018 0.0025 0.247
32 0.0013 0.0013 0.248
60 0.0007 0.0007 0.250
out-of-bag error : 0.239 (scored on 1400 train rows)
held-out test error : 0.250
Two lessons sit in that output. The variance column is the bagging formula in the wild: each doubling of the tree count roughly halves the ensemble variance, exactly as $\sigma^2/T$ promises, until correlation between trees starts to matter and the gains taper. The test-error column shows why we bother: a single tree misclassifies $38\%$ of the test set, while the full forest reaches $25\%$ on a problem with $10\%$ irreducible label noise, close to the best any model could do. And the final two lines deliver the out-of-bag estimate, which we examine next.
3. Out-of-Bag Evaluation for Free Beginner
A bootstrap sample of size $N$ drawn with replacement leaves out, on average, a predictable fraction of the original rows. The probability that a given row is missed by one draw is $1 - 1/N$, so the probability it is missed by all $N$ draws is $\left(1 - 1/N\right)^{N}$, which converges to $e^{-1} \approx 0.368$ as $N$ grows. Roughly a third of the data is out-of-bag for each tree, meaning that tree never trained on it and can be scored on it without leakage. To evaluate the whole forest without a held-out split, you predict each training row using only the trees that did not see it, then compare against the true label. Output 12.4.1 did exactly this, and its out-of-bag error of $0.239$ matched the independent test error to within sampling noise. The forest grades its own homework, and the grade is trustworthy.
The constant $0.368$ is the same $e^{-1}$ that governs the secretary problem, the expected number of fixed-point-free permutations, and the chance that a randomly shuffled deck has no card in its original position. Bagging did not ask for it; it falls out of sampling $N$ items with replacement from $N$. Every random forest you have ever trained quietly threw away about $36.8\%$ of each tree's data and handed it back to you as a validation set.
This free validation is a genuine systems advantage at scale. In a distributed setting, holding out a clean test partition means coordinating which rows are reserved across every worker and keeping that partition consistent through reshuffles. Out-of-bag scoring sidesteps all of it: each worker already knows which of its rows a given tree skipped, so the validation signal is computed locally and combined with the same final gather that collects the trees. You get an unbiased generalization estimate as a byproduct of training, with no extra data movement, which is why production forest implementations expose it by default.
4. Two Regimes: When the Data Fits, and When It Does Not Intermediate
How a forest is distributed depends entirely on whether one tree's training data fits on one machine. The two regimes call for completely different strategies, and recognizing which one you are in is the main design decision. Table 12.4.1 lays them out alongside the communication each demands.
| Regime | Condition | How each tree is trained | Communication |
|---|---|---|---|
| Tree-parallel (data fits per worker) | One bootstrap sample fits in a worker's memory | Each worker grows whole trees independently and locally | None during training; one gather of finished trees at the end |
| Data-parallel per tree (data too big) | Even one tree's data exceeds a single machine | Each tree is built cooperatively with the distributed histogram method of Section 12.3, or on a sampled subset that does fit | Per-split histogram all-reduce, like a distributed single tree |
The first regime is the common one and the reason forests have a reputation for trivial scaling. When data fits per worker, you replicate or shard the dataset so each machine can draw bootstrap samples, then assign each worker a batch of trees to grow alone. With $T$ trees and $W$ workers, each worker grows about $T/W$ trees, and the wall-clock time falls almost linearly in $W$ because there is no synchronization to stall on. The only coordination is collecting the trees, a payload of model structure that is tiny compared with the data. This is as close to perfect linear speedup as distributed training ever gets, bounded only by the fan-out and gather, which Amdahl's law analysis in Chapter 3 shows are negligible here.
The second regime appears when even a single tree's bootstrap sample is too large for one machine, as with web-scale tabular data. Now you cannot grow a tree in isolation, so each tree is built the cooperative way: workers hold row shards, exchange per-split feature histograms with the all-reduce of Section 12.3, and agree on splits together. The forest's outer loop over trees is still parallel, but each tree's inner construction now costs communication. A common shortcut avoids this entirely: train each tree on a subsample small enough to fit one machine, which trades a little accuracy per tree for a return to the embarrassingly parallel regime, and the ensemble average recovers much of the lost accuracy.
Code 12.4.1 spent roughly a hundred lines growing trees by hand. On a single multicore machine, scikit-learn collapses the whole thing to a few lines and parallelizes the trees across cores automatically with n_jobs, including the out-of-bag score:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=60, max_depth=6,
max_features="sqrt", # random feature subset per split
oob_score=True, # free validation, computed for you
n_jobs=-1) # one tree per core, in parallel
rf.fit(X_train, y_train)
print(rf.oob_score_) # out-of-bag accuracy
When the data outgrows one machine, Spark MLlib's RandomForestClassifier distributes the same algorithm across a cluster, building trees from row-sharded DataFrames and aggregating split statistics with the shuffle of Chapter 6. Both handle bootstrap sampling, the per-split feature subsetting, the tree gather, and the final vote internally, so the hundred lines become about five, and the parallelism is the library's job, not yours.
Who: A data scientist on the risk team at a payments company.
Situation: A nightly random forest scored card transactions for fraud, trained on a single beefy machine with scikit-learn and n_jobs=-1.
Problem: The training table grew past 400 GB as new merchants onboarded, and a single bootstrap sample no longer fit in the machine's memory, so the job started failing.
Dilemma: Rent an even larger memory-optimized instance, a scale-up move with a hard ceiling and a steep hourly rate, or move to a cluster and rewrite the pipeline, which meant learning a distributed framework.
Decision: They moved to Spark MLlib but chose the subsampling shortcut: each tree trains on a sample small enough to fit a worker, keeping training embarrassingly parallel rather than paying per-split all-reduce.
How: They sharded the table across the cluster, set each tree to draw a bounded subsample, and grew 500 trees spread across 40 executors, with out-of-bag scoring left on for monitoring.
Result: Training finished in 18 minutes instead of timing out, the out-of-bag AUC matched the old held-out AUC, and total cost fell below the giant single instance because ordinary workers are cheaper per core.
Lesson: When a forest outgrows one machine, prefer the regime that preserves independence. Subsampling each tree keeps training communication-free and recovers accuracy through the ensemble average, beating both the scale-up box and the chattier per-tree all-reduce.
5. Parallel Inference and the Boosting Trade-off Intermediate
The independence that made training embarrassingly parallel makes inference parallel too. Each tree evaluates a query point on its own, following a root-to-leaf path that touches no other tree, so a forest's prediction is a fan-out of $T$ independent traversals followed by one average. On a serving fleet the trees can be partitioned across machines, each returning a partial vote that a coordinator combines, and the latency is set by the slowest single tree rather than the sum. This is the same scale-out shape as the training phase, which is unusual: most models distribute training and inference along different axes, but the forest uses one idea for both.
That symmetry comes at a price, and naming it sets up the next section. A random forest builds its trees independently, so no tree can correct another's mistakes; the ensemble only averages away variance, it never reduces bias below that of its individual trees. Gradient boosting, the subject of Section 12.5, makes the opposite bargain: it builds trees sequentially, each one fit to the errors the previous trees left behind, which drives bias down and typically wins on accuracy. But that sequential dependence is precisely what the forest avoids, so boosting cannot grow its trees in parallel the way a forest can; its parallelism has to be found inside each tree's construction instead. The trade-off is fundamental. Forests give you parallel, robust, and hard to overfit; boosting gives you sequential, more accurate, and more delicate to tune.
A persistent finding is that tree ensembles remain the strongest models on medium-sized tabular data, a claim sharpened by the benchmark of Grinsztajn et al. (2022) and reinforced through 2024 to 2026 as deep tabular architectures kept failing to overtake them decisively. The frontier has split into two efforts. One scales the classics: distributed forest and boosting implementations now train on hundreds of millions of rows across clusters and GPUs, with work on histogram compression and communication-avoiding split aggregation to push the data-too-big regime of Section 4 toward the embarrassingly parallel ideal. The other tries to beat them: foundation models for tabular data such as the TabPFN line (Hollmann et al., 2025) perform in-context learning over a whole table in a single forward pass, and the live question is whether such models, which distribute very differently from a forest, can match bagged ensembles on large heterogeneous tables. For now the random forest remains the baseline every new tabular method must clear, and its scale-out story is the one to beat.
With the parallel, robust end of the ensemble spectrum mapped, we turn to its sequential, accuracy-first counterpart. The next section builds distributed gradient boosting, shows where its parallelism actually lives now that the trees can no longer be grown independently, and explains why systems like XGBoost and LightGBM win competitions while inheriting the communication costs the forest escaped. That comparison begins in Section 12.5.
Using the variance formula $\rho\sigma^2 + \frac{1-\rho}{T}\sigma^2$, explain why a random forest that draws a random feature subset at every split can reach a lower variance than plain bagged trees, even when both use the same number of trees $T$ and the same per-tree variance $\sigma^2$. What does the formula predict happens to the benefit of adding more trees once $T$ is already large? Tie your answer to the plateau visible in the test-error column of Output 12.4.1.
In Code 12.4.1, the per-split feature subset size is fixed at n_feat=3 out of 8 features. Re-run the ensemble-variance scan for n_feat=1, n_feat=3, and n_feat=8 (the last one uses every feature, so it is bagging without the random-subspace trick). Report the single-tree variance, the 60-tree ensemble variance, and the test error for each setting. Explain which setting lowers the pairwise correlation $\rho$ the most and why that does not automatically minimize test error, connecting your numbers to the bias-variance reading of the correlation floor.
You must train a 200-tree forest on a 2-terabyte tabular dataset across a cluster of machines with 64 GB of memory each. Decide which regime from Table 12.4.1 applies if (a) each tree trains on a 1% subsample, and (b) each tree must use the full dataset. For each case, state whether training requires per-split communication, and estimate how the wall-clock time scales as you go from 10 to 40 workers. Using the embarrassingly parallel argument and the Amdahl framing of Chapter 3, explain why the subsample plan can approach linear speedup while the full-data plan cannot.