Part III: Distributed Machine Learning
Chapter 12: Distributed Classical Machine Learning

Distributed Gradient Boosting

"Each of us learns only what the trees before us got wrong. You cannot rush ahead and fix a mistake that has not been made yet."

A Boosting Round Waiting Its Turn
Big Picture

Gradient-boosted decision trees are the workhorse of tabular machine learning, and they scale in a way that is the exact opposite of the random forest you just met: the trees cannot be built in parallel, so all the parallelism has to happen inside each tree. Boosting adds trees one at a time, and every tree is fit to the residual mistakes of the ensemble so far. That sequential dependence kills the embarrassingly parallel trick that made random forests (Section 12.4) easy to distribute. The systems that conquered this, XGBoost and LightGBM, instead parallelize within a round: each worker holds a shard of the data, builds a histogram of gradients and Hessians over its rows, and the cluster all-reduces those histograms so every worker picks the same split from the same global statistics. This section shows why the math forces that design, proves with a runnable demo that the distributed answer equals the single-machine one, and explains why these models still beat deep networks on most tables of numbers.

The previous section built random forests, where hundreds of trees are grown independently on bootstrap samples and then averaged. Independence is what made them trivial to distribute: hand each worker a subset of trees, grow them in isolation, collect the forest at the end, and never communicate during training. Gradient boosting throws that gift away. It also grows many trees, but it grows them in a strict sequence, each one correcting the ensemble that precedes it. You cannot grow tree number fifty until trees one through forty-nine exist, because tree fifty is defined entirely by their collective error. The parallelism that distributes boosting therefore lives at a completely different level, and finding it is the whole story of this section.

1. Why Boosting Refuses to Parallelize Across Trees Intermediate

Gradient boosting builds an additive model. After $t-1$ rounds the ensemble predicts a score $F_{t-1}(x)$, and round $t$ adds one more tree $f_t$ scaled by a learning rate $\eta$, so that $F_t(x) = F_{t-1}(x) + \eta\, f_t(x)$. The new tree is not fit to the labels; it is fit to how wrong the current ensemble is. Concretely, for a differentiable loss $\ell(y, F)$ we compute, for every training example, the first and second derivatives of the loss with respect to the current score,

$$g_i = \frac{\partial \ell(y_i, F_{t-1}(x_i))}{\partial F_{t-1}(x_i)}, \qquad h_i = \frac{\partial^2 \ell(y_i, F_{t-1}(x_i))}{\partial F_{t-1}(x_i)^2},$$

the gradient $g_i$ and the Hessian $h_i$. The tree $f_t$ is then grown to predict a step that reduces a second-order Taylor approximation of the loss around the current scores. The dependence is total: $g_i$ and $h_i$ are functions of $F_{t-1}$, which is the sum of every tree built so far. Change one earlier tree and every later gradient changes. This is exactly the sequential structure of the synchronous-versus-asynchronous trade-off introduced in Chapter 10, except here the sequence is intrinsic to the algorithm rather than a choice about how to schedule updates.

Key Insight: The Parallelism Moved Down a Level

Random forests parallelize across trees because the trees are independent; boosting cannot, because each tree depends on all previous trees through the gradient. So distributed boosting moves the parallelism down one level: trees are still built one after another, but the construction of each single tree is distributed across the cluster. The unit of parallel work is no longer "a tree" but "a split-finding pass within a tree", and the way you parallelize that pass is the distributed histogram method from Section 12.3, run once per boosting round.

2. Splitting on Gradients: the Histogram Inside Each Round Intermediate

Inside one round, growing the tree means repeatedly choosing the best feature and threshold to split a node. Boosting scores a candidate split not by Gini impurity or variance, as an ordinary decision tree would, but by how much it reduces the second-order loss. If a node holds a set of examples and a candidate split sends some to the left child and the rest to the right, define $G = \sum g_i$ and $H = \sum h_i$ summed over the examples in a child. The standard XGBoost split gain is

$$\text{gain} = \frac{1}{2}\left[\frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda}\right] - \gamma,$$

where $\lambda$ is an $L_2$ regularizer on leaf weights and $\gamma$ is a complexity penalty per split. The optimal weight assigned to a leaf holding totals $(G, H)$ is $w^\star = -G / (H + \lambda)$. The crucial observation for distribution is that the gain depends on the data only through the sums $G_L, H_L, G_R, H_R$. To evaluate every candidate threshold for a feature, you do not need the raw rows; you need the gradient sum and Hessian sum in each bin of that feature. That is precisely a histogram: bin the feature values once, then accumulate $g_i$ and $h_i$ into per-bin buckets. Sweeping the bins left to right gives every candidate split's $(G_L, H_L)$ as a running prefix sum, so one pass over a histogram of $B$ bins evaluates all $B-1$ thresholds.

This is where the data-parallel structure of Chapter 4 reappears. A histogram is additive: the global gradient/Hessian histogram over all rows equals the sum of the per-shard histograms, exactly as the global gradient in Section 1.1 was the sum of per-shard partial gradients. So each worker bins its own shard, builds a small local histogram of size (number of features $\times$ number of bins), and the cluster all-reduces those histograms. After the all-reduce every worker holds the identical global histogram and therefore independently picks the identical best split, with no further communication. Figure 12.5.1 traces the full pattern: trees in sequence, and within each tree a single all-reduce of histograms across the shards.

Boosting is sequential across rounds: each tree fits the gradient of the ensemble so far Tree 1 round 1 Tree 2 round 2 Tree t round t ... Tree T round T one round, expanded: parallelism lives inside the tree Worker 1 / shard 1 bin rows, build local (g, h) histogram Worker 2 / shard 2 bin rows, build local (g, h) histogram Worker K / shard K bin rows, build local (g, h) histogram All-reduce histograms sum per-bin g and h across shards Pick split by gain same global histogram, same split every worker holds the identical global histogram, so all pick the same threshold no raw rows are ever shipped, only the small per-feature histograms
Figure 12.5.1: Distributed gradient boosting has two nested loops. The outer loop (top) is strictly sequential: tree $t$ cannot start until tree $t-1$ has updated the ensemble, because the gradients of round $t$ depend on it. The inner loop (bottom, one round expanded) is where the cluster parallelizes: each of the $K$ workers builds a local gradient/Hessian histogram over its data shard, a single all-reduce sums them into one global histogram, and every worker then picks the identical split. Only the compact histograms cross the network, never the raw rows.

3. A Runnable Distributed Boosting Round Intermediate

The code below implements a few rounds of gradient boosting from scratch in pure NumPy, with the split-finding step distributed exactly as Figure 12.5.1 describes. To keep the geometry visible we use a single numeric feature, bin it into a shared quantile sketch once, and grow a one-split stump per round (the depth-one case shows the all-reduce without burying it in tree-recursion bookkeeping). Each round computes the log-loss gradient $g_i = p_i - y_i$ and Hessian $h_i = p_i(1-p_i)$ at the current scores, every shard builds its local histogram, the histograms are summed (the all-reduce), and the global histogram drives the split gain and leaf weights from Section 2. A second function repeats the same boosting over all the data in one pass, with no shards, so we can check the two agree.

import numpy as np

rng = np.random.default_rng(7)
N, K, ROUNDS, BINS = 4000, 4, 6, 16
LR, LAMBDA = 0.3, 1.0

# One numeric feature, a non-linear target -> binary labels.
x = rng.uniform(-3.0, 3.0, N)
prob = 1.0 / (1.0 + np.exp(-(1.5 * np.sin(1.3 * x) + 0.6 * x)))
y = (rng.uniform(size=N) < prob).astype(np.float64)

# Shared bin edges (a pre-agreed quantile sketch). Each example -> a bin id.
edges = np.quantile(x, np.linspace(0, 1, BINS + 1))
bin_id = np.clip(np.digitize(x, edges[1:-1]), 0, BINS - 1)
shards = np.array_split(np.arange(N), K)             # K disjoint data shards

def best_split_from_hist(g_hist, h_hist, lam):       # sweep bins, max the gain
    G_tot, H_tot = g_hist.sum(), h_hist.sum()
    best_b, best_gain, GL, HL = -1, -np.inf, 0.0, 0.0
    for b in range(BINS - 1):
        GL += g_hist[b]; HL += h_hist[b]
        GR, HR = G_tot - GL, H_tot - HL
        gain = GL*GL/(HL+lam) + GR*GR/(HR+lam) - G_tot*G_tot/(H_tot+lam)
        if gain > best_gain: best_gain, best_b = gain, b
    return best_b, best_gain

def fit_distributed():
    F = np.zeros(N)                                  # ensemble score, all examples
    for r in range(ROUNDS):
        p = 1.0 / (1.0 + np.exp(-F))
        grad, hess = p - y, p * (1 - p)              # log-loss gradient and Hessian
        # ---- each shard builds a LOCAL gradient/Hessian histogram over bins ----
        g_local = np.zeros((K, BINS)); h_local = np.zeros((K, BINS))
        for k, s in enumerate(shards):
            np.add.at(g_local[k], bin_id[s], grad[s])
            np.add.at(h_local[k], bin_id[s], hess[s])
        # ---- ALL-REDUCE: sum the per-shard histograms into ONE global one ----
        g_hist, h_hist = g_local.sum(axis=0), h_local.sum(axis=0)
        b, gain = best_split_from_hist(g_hist, h_hist, LAMBDA)
        GL, HL = g_hist[:b+1].sum(), h_hist[:b+1].sum()
        GR, HR = g_hist[b+1:].sum(), h_hist[b+1:].sum()
        wL, wR = -GL/(HL+LAMBDA), -GR/(HR+LAMBDA)    # optimal leaf weights
        F = F + LR * np.where(bin_id <= b, wL, wR)   # add the stump to the ensemble
        p = np.clip(1/(1+np.exp(-F)), 1e-12, 1-1e-12)
        loss = float(np.mean(-(y*np.log(p) + (1-y)*np.log(1-p))))
        print(f"round {r+1}: split@bin {b:2d}  gain {gain:8.2f}  "
              f"leaf(L,R)=({wL:+.3f},{wR:+.3f})  log-loss {loss:.5f}")
    return F
Code 12.5.1: One distributed boosting round, from scratch. The only communication is g_local.sum(axis=0) and h_local.sum(axis=0), which stand in for the all-reduce of the per-shard histograms; on a real cluster those two sums travel over the network and every other line runs locally on each worker.

Running it prints the per-round log-loss and then compares the distributed final scores against a single-machine baseline that boosts over all the rows in one pass.

data N=4000  shards K=4  rounds=6  bins=16
start log-loss      : 0.69315

round 1: split@bin  7  gain  1597.03  leaf(L,R)=(-1.287,+1.238)  log-loss 0.59115
round 2: split@bin  8  gain   821.44  leaf(L,R)=(-0.832,+1.025)  log-loss 0.53857
round 3: split@bin  6  gain   490.42  leaf(L,R)=(-0.846,+0.636)  log-loss 0.50717
round 4: split@bin  7  gain   288.66  leaf(L,R)=(-0.598,+0.569)  log-loss 0.48867
round 5: split@bin  8  gain   167.70  leaf(L,R)=(-0.409,+0.522)  log-loss 0.47794
round 6: split@bin  6  gain   101.94  leaf(L,R)=(-0.439,+0.312)  log-loss 0.47141

final log-loss (K=4 shards, all-reduced) : 0.47141
final log-loss (single machine)            : 0.47141
max abs score difference                   : 1.78e-15
Output 12.5.1: The log-loss drops every round as boosting corrects its own residuals, and the four-shard all-reduced model matches the single-machine model to $1.78 \times 10^{-15}$, the floating-point rounding floor. Summing histograms across shards changes the result by nothing that matters, just as summing partial gradients did in Section 1.1.

The agreement to fifteen decimal places is the same exactness guarantee that data parallelism gave us for gradient descent, now riding on the additivity of histograms instead of the additivity of gradients. The workers never share rows; they share a sum of small per-bin statistics, and the resulting splits are bit-for-bit the ones a single machine would have chosen. That is what lets a distributed XGBoost or LightGBM job claim it produces the same model as its single-node counterpart, only faster and on data that would never fit on one box.

Thesis Thread: One All-Reduce, A Third Method Built On It

The all-reduce of Section 1.1 keeps earning its keep. It summed partial gradients for data-parallel training, Chapter 10 formalized that synchronous step for distributed optimization, and here it sums partial histograms so that boosted trees pick splits from global statistics. Three different learning algorithms, one collective. When you meet a distributed classical-ML method, the productive question is always the same one Section 1.1 posed: which quantity is additive across shards, and therefore which collective combines it? For boosting the answer is the gradient/Hessian histogram, all-reduced once per round.

4. Data-Parallel and Feature-Parallel Modes Advanced

Code 12.5.1 distributes by rows, the data-parallel mode: each worker owns a shard of examples, builds histograms over its rows, and the cluster all-reduces. This is the default in distributed XGBoost and LightGBM because the communication volume per round depends only on the histogram size (features $\times$ bins), not on the number of rows, so it scales to enormous datasets. There is a second axis. In feature-parallel mode each worker owns a subset of the columns, finds the best split among its own features, and the workers exchange only their local best splits to agree on a global winner; the chosen split then has to be broadcast so everyone can partition their rows. Feature parallelism helps when there are very many features but it forces the workers to communicate row-partitioning information every split, which is why row partitioning dominates in practice. Real systems combine both, and they overlap the histogram all-reduce with computation just as the communication-avoiding training of Chapter 10 overlaps gradient all-reduce with the backward pass.

The two flagship systems made boosting scale through a stack of complementary ideas. XGBoost introduced an approximate split-finding algorithm that proposes candidate thresholds from a weighted quantile sketch (weighting each example by its Hessian $h_i$, because high-Hessian points matter more to the second-order objective), plus a sparsity-aware split finder that learns a default direction for missing values and skips absent entries, which is what lets it tear through sparse one-hot and click data. LightGBM pushed the histogram idea further with leaf-wise (best-first) tree growth instead of level-wise, gradient-based one-side sampling (GOSS), which keeps all the large-gradient examples and subsamples the small-gradient ones to shrink the per-round work, and exclusive feature bundling (EFB), which packs mutually-exclusive sparse features into a single column to cut the histogram cost. All of these accelerate the same inner loop: build histograms, all-reduce them, split.

Practical Example: The Fraud Model That Outgrew One Machine

Who: A data scientist on the risk team at a payments processor.

Situation: A gradient-boosted fraud model trained on a single 64-core machine with single-node XGBoost, on 40 million transactions with 300 engineered features.

Problem: The training table grew past 400 million rows after a merger, and the histograms plus the feature matrix no longer fit in one machine's memory; training started swapping to disk and a single fit ballooned past nine hours.

Dilemma: Subsample the data back down to what one machine could hold, accepting a weaker model on the rare fraud cases, or move to distributed training across a cluster and keep every row.

Decision: They kept every row and switched to data-parallel distributed boosting across eight workers, because fraud lives in the tail and subsampling would have thrown away the positive examples that mattered most.

How: They sharded the transactions across eight nodes, switched the XGBoost training call to its distributed backend, and let the framework all-reduce the gradient/Hessian histograms each round; the feature engineering and scoring code did not change.

Result: A full fit dropped to about 70 minutes, the model used all 400 million rows, and a held-out check confirmed the distributed model scored identically to a single-machine fit on a subset, exactly as Output 12.5.1 predicts for the split decisions.

Lesson: When the binding ceiling is data volume and the model is a boosted tree, do not subsample to fit one machine; all-reduce the histograms across many and keep the tail your model is there to catch.

Library Shortcut: xgboost and lightgbm Distribute the Histogram for You

Code 12.5.1 built and summed histograms by hand. In production you describe the data and ask for a distributed fit; the framework shards the rows, builds the per-worker histograms, all-reduces them every round, and grows full multi-level trees with regularization, early stopping, and missing-value handling. Both major libraries expose a one-call distributed mode (illustrative usage shown here):

# XGBoost on a Dask cluster: rows are sharded, histograms all-reduced per round.
import xgboost as xgb
from xgboost import dask as dxgb

dtrain = dxgb.DaskDMatrix(client, X, y)            # X, y are distributed Dask arrays
out = dxgb.train(client, {"tree_method": "hist", "objective": "binary:logistic",
                          "eta": 0.3, "lambda": 1.0},
                 dtrain, num_boost_round=200)       # cluster all-reduces histograms

# LightGBM in data-parallel mode (one process per worker):
import lightgbm as lgb
params = {"objective": "binary", "tree_learner": "data",   # data-parallel histograms
          "num_leaves": 63, "learning_rate": 0.05}
booster = lgb.train(params, lgb.Dataset(X_local, y_local), num_boost_round=200)
Code 12.5.2: The roughly thirty lines of manual histogram building, all-reduce, and split scoring in Code 12.5.1 collapse to a single train call. The library owns the histogram all-reduce, the quantile sketch, the sparsity-aware splits, and the tree recursion that Section 2 only sketched at depth one.

5. Why Boosted Trees Still Beat Deep Nets on Tables Beginner

It is worth asking why this fifteen-year-old family of models remains the default for tabular problems when deep learning has swept image, text, and audio. Tabular data is heterogeneous: columns mix scales, units, and meanings, many features are uninformative, and the useful signal often sits in sharp, axis-aligned thresholds ("flag the transaction if amount exceeds X and the account age is under Y"). Decision trees model exactly those axis-aligned, non-smooth interactions natively, while a neural network must learn them through smooth compositions of weighted sums, which is a harder fit on the modest dataset sizes typical of tabular problems. Boosting then squeezes the residual signal out greedily, round after round. The result is a model that needs little preprocessing, tolerates irrelevant and correlated features, handles missing values directly, and trains fast. For the great majority of business tables, that combination still wins on accuracy per unit of engineering effort, which is why distributing boosting well, rather than replacing it, is the high-value move.

Fun Note: The Leaderboard Tells On Us

For years the running joke at tabular-data competitions was that the winning solution is a gradient-boosted tree, the second place is a different gradient-boosted tree, and the elaborate deep-learning entry is somewhere on page two of the leaderboard wondering where its embeddings went wrong. The joke endures because it keeps being roughly true on structured data.

Research Frontier: Deep Tabular Learning and In-Context Tables (2024 to 2026)

The "trees still win on tables" verdict is under active, well-funded assault. Large benchmark studies (Grinsztajn et al. and follow-ups through 2024 to 2025) keep finding that tuned gradient-boosted trees match or beat deep tabular networks on most medium-sized datasets, while a parallel line tries to close the gap. The most striking recent entry is TabPFN (Hollmann et al., the v2 published in Nature, 2025), a transformer pre-trained on millions of synthetic tabular tasks that performs Bayesian-style in-context prediction in a single forward pass, no per-dataset training, and reports state-of-the-art accuracy on small and medium tables, with active 2025 work extending it past its original size limits and toward distributed and out-of-core inference. Whether such foundation-model-for-tables approaches finally dethrone boosting on large data, where the all-reduced histogram machinery of this section still rules, is one of the open empirical questions of distributed classical ML right now.

With boosting distributed by all-reducing histograms inside each round, we have now covered the three big supervised tree families: single trees (Section 12.3), forests of independent trees (Section 12.4), and the sequential, communication-coupled boosted ensemble here. The chapter turns next from supervised prediction to unsupervised structure. Section 12.6 distributes clustering, where the quantity that becomes additive across shards is no longer a gradient or a histogram but the per-cluster point sums that define each centroid.

Exercise 12.5.1: Why Not Just Distribute the Trees? Conceptual

Random forests (Section 12.4) distribute by handing each worker a subset of trees to grow independently, with no communication during training. Explain precisely why the identical strategy fails for gradient boosting: which quantity that worker $k$ needs to grow "its" tree is unavailable until the other workers' trees exist? Then state what boosting distributes instead of trees, and identify the one collective operation it relies on per round, naming the additive quantity that collective combines.

Exercise 12.5.2: Grow a Real Tree, Not a Stump Coding

Code 12.5.1 grows a depth-one stump each round. Extend fit_distributed to grow a depth-two tree per round: after the root split, build a fresh gradient/Hessian histogram for each child node (over only the rows that fall into it), all-reduce each child's histogram across the shards, and pick a second-level split per child. Confirm that your distributed depth-two model still matches a single-machine depth-two baseline to floating-point precision, and report how the per-round log-loss compares to the stump version. Explain why the number of all-reduces per round now grows with the number of nodes at the deepest level being split.

Exercise 12.5.3: Communication Cost of a Boosting Round Analysis

A data-parallel boosting round all-reduces one gradient histogram and one Hessian histogram, each of size (number of features) $\times$ (number of bins). For $F = 500$ features, $B = 256$ bins, 8-byte accumulators, and $T = 1000$ rounds, estimate the total bytes each worker sends and receives across a full fit, treating one all-reduce as moving roughly twice the histogram size. Compare that to the cost of shipping a 400-million-row, 500-feature table even once. Argue from these two numbers why the histogram all-reduce, not the data size, sets the communication budget, and why halving the bin count $B$ is a cheaper lever than reducing rows. Relate your estimate to the alpha-beta communication model of Chapter 4.