Section 12.1: Distributed Linear and Logistic Regression

"They asked each of us to summarize our shard in a single small table. We added the tables together, solved once, and the answer was exactly right. I have never felt so useful and so replaceable at the same time."
A Worker That Emitted Its Sufficient Statistics

Big Picture

The simplest classical learners reveal a pattern that the rest of distributed machine learning leans on: when training a model reduces to summing a fixed-size quantity over examples, distribution becomes exact, cheap, and one round long. Linear regression makes this vivid. Its solution depends on the data only through two sums, $X^\top X$ and $X^\top y$, so each worker can compress its entire shard into a small fixed-size table, an all-reduce adds the tables, and one solve recovers the identical global model, no matter how the examples were split. Logistic regression has no closed form, but its gradient is still a sum over examples, so the same combine step drives distributed gradient descent. This section names that pattern, the sufficient-statistics view of distributed training, and shows exactly where it ends and iteration begins.

In the previous chapters of Part III we built the general machinery of distributed training: gradient aggregation and its convergence behavior in Chapter 10, and the parameter-server pattern for sharded model state in Chapter 11. Now we apply that machinery to specific models, starting with the two oldest and most transparent ones. Linear and logistic regression are not interesting because they are powerful; they are interesting because they expose, with nothing hidden, why some models distribute perfectly in a single communication round while others require many. The dividing line is whether the training computation collapses into a sum of per-example statistics, and once you can see that line you can predict the distributed cost of a method before writing a line of code.

1. Linear Regression as a Sum Over Examples Beginner

Ordinary least squares fits a weight vector $w$ that minimizes the squared error between predictions $Xw$ and targets $y$, where $X$ is the $N \times d$ design matrix of $N$ examples and $d$ features. Setting the gradient of the squared-error objective to zero gives the normal equations, a linear system whose solution is the global optimum in closed form:

$$L(w) = \tfrac{1}{2}\lVert Xw - y \rVert_2^2, \qquad X^\top X\, w = X^\top y, \qquad w^\star = \left(X^\top X\right)^{-1} X^\top y.$$

Stare at the two quantities the solution actually needs. The matrix $X^\top X$ is $d \times d$ and the vector $X^\top y$ is $d \times 1$; neither grows with the number of examples $N$. More importantly, both are sums over examples. Writing $x_i$ for the $i$-th row of $X$ and $y_i$ for its target,

$$X^\top X = \sum_{i=1}^{N} x_i x_i^\top, \qquad X^\top y = \sum_{i=1}^{N} x_i\, y_i.$$

This is the whole game. A sum over examples does not care how the examples are grouped, so we can split the $N$ rows into $K$ disjoint shards, hand shard $k$ to worker $k$, and let each worker accumulate its own partial sums $\sum_{i \in S_k} x_i x_i^\top$ and $\sum_{i \in S_k} x_i y_i$ over only the data it holds. Each worker then ships a tiny fixed-size pair of statistics, $d^2 + d$ numbers, regardless of how many millions of rows its shard contained. Adding the $K$ partial matrices and the $K$ partial vectors reconstructs the exact global $X^\top X$ and $X^\top y$, and a single solve yields the same $w^\star$ a single machine would have computed. The combine step is summing one array per worker, which is precisely the all-reduce primitive of Section 4.3, and the per-shard accumulate-then-merge structure is the mergeable aggregation pattern of Section 6.8 applied to matrices instead of counts.

Key Insight: Sufficient Statistics Decouple Communication From Data Size

The quantities a worker must send, $X^\top X$ and $X^\top y$, have size $d^2 + d$, which depends on the number of features but not on the number of examples. A shard with one thousand rows and a shard with one billion rows emit tables of identical size. Communication is therefore decoupled from data volume: you can scale the dataset arbitrarily without sending one extra byte across the network. This is the defining advantage of a model whose training reduces to a fixed-size sufficient statistic, and it is why exact distributed linear regression costs a single all-reduce rather than a streaming flood of raw data.

2. The Per-Shard Statistics Picture Beginner

Figure 12.1.1 makes the dataflow concrete. The full dataset is partitioned by rows across workers; each worker reduces its shard, no matter how large, to one small $X^\top X$ block and one short $X^\top y$ vector; a single all-reduce sums those blocks into the global statistics; and one solve on the summed system produces the exact model. The picture is worth holding in mind because nearly every method in this chapter is a variation on it, differing only in what the per-shard statistic is and whether one round suffices.

Figure 12.1.1: The sufficient-statistics pattern for distributed linear regression. Each worker (orange) reduces its row shard, of any size, to a fixed $d \times d$ matrix $X^\top X_k$ and length-$d$ vector $X^\top y_k$. One all-reduce sums these across workers, and a single solve on the summed normal equations (green) returns a model identical to the single-machine answer. Only $d^2 + d$ numbers leave each worker.

3. The Exact One-Round Algorithm, From Scratch Intermediate

The code below realizes Figure 12.1.1 in pure NumPy. It builds a regression dataset, computes the single-machine normal-equations solution as ground truth, then simulates $K$ workers that each see only their shard, emit the pair $\left(X^\top X_k,\, X^\top y_k\right)$, sum those statistics (the all-reduce), and solve once. It also reports the communication accounting: how many numbers each worker sends versus how many it would send if it naively shipped its raw rows instead.

import numpy as np

rng = np.random.default_rng(7)
N, d, K = 200_000, 12, 16          # examples, features (incl. bias), workers
X = rng.standard_normal((N, d))
X[:, 0] = 1.0                       # intercept column
w_true = rng.standard_normal(d)
y = X @ w_true + 0.5 * rng.standard_normal(N)

# --- Single-machine normal equations: w = (X^T X)^{-1} X^T y ---
XtX_full = X.T @ X
Xty_full = X.T @ y
w_single = np.linalg.solve(XtX_full, Xty_full)

# --- Distributed: each worker sees ONLY its shard, emits sufficient statistics ---
shards = np.array_split(np.arange(N), K)
partials = []
for s in shards:
    Xs, ys = X[s], y[s]
    partials.append((Xs.T @ Xs, Xs.T @ ys))      # (d x d), (d,) per worker

# all-reduce: sum the per-shard statistics across workers (one round)
XtX_dist = sum(p[0] for p in partials)
Xty_dist = sum(p[1] for p in partials)
w_dist = np.linalg.solve(XtX_dist, Xty_dist)

# communication accounting
floats_per_worker = d * d + d
print(f"examples N             : {N}")
print(f"features d             : {d}")
print(f"workers K              : {K}")
print(f"floats sent / worker   : {floats_per_worker}  (d*d + d, independent of N)")
print(f"naive raw-data floats  : {N * d}  (if shards shipped data instead)")
print(f"max|w_dist - w_single| : {np.max(np.abs(w_dist - w_single)):.3e}")
print(f"relative error         : {np.linalg.norm(w_dist - w_single)/np.linalg.norm(w_single):.3e}")
print(f"identical solution     : {np.allclose(w_dist, w_single)}")

Code 12.1.1: Exact distributed linear regression by all-reducing $X^\top X$ and $X^\top y$. The sixteen workers never share rows; they share only their summed statistics, and the combined solve is checked against the single-machine answer.

examples N             : 200000
features d             : 12
workers K              : 16
floats sent / worker   : 156  (d*d + d, independent of N)
naive raw-data floats  : 2400000  (if shards shipped data instead)
max|w_dist - w_single| : 5.818e-14
relative error         : 1.825e-14
identical solution     : True

Output 12.1.1: The distributed model equals the single-machine model to floating-point rounding ($\sim 6 \times 10^{-14}$). Each of the sixteen workers transmitted 156 numbers, roughly four orders of magnitude fewer than the 2.4 million it would have sent by shipping raw rows, and one all-reduce sufficed.

The result in Output 12.1.1 is the same kind of exactness we met for the data-parallel gradient in Section 1.3, now for an entire closed-form model rather than a single gradient step. The model is identical, the communication is one round, and the bytes per worker are fixed at 156 numbers no matter how many rows the shard held. This is the gold standard a distributed learner can aspire to, and it is achievable precisely because least squares is a sum over examples with a closed-form solve. The relevant cost is no longer accuracy; it is the $O(d^2)$ size of the statistic and the $O(d^3)$ solve, which is why this exact route is ideal for many features in moderate dimension and gives way to iteration when $d$ is very large.

Thesis Thread: One Primitive, From Counts to Models

The all-reduce that summed counts in MapReduce (Section 6.8) and gradients in data-parallel training (Section 4.3) here sums $d \times d$ matrices and produces a complete trained model in one round. Nothing about the primitive changed; only the payload did. Whenever you meet a new classical learner in this chapter, the first question to ask is what fixed-size quantity each worker must contribute to the shared sum, because that quantity, and whether one sum of it suffices, determines the entire distributed cost.

Fun Note: The Model That Fits in a Postcard

A worker holding a billion rows in this example still mails back the same 156 numbers as a worker holding ten. You could write each worker's contribution on a postcard, sum the postcards by hand, and solve the resulting twelve-by-twelve system on paper. The dataset is enormous; its sufficient statistic is stationery-sized.

4. Logistic Regression Has No Postcard, But It Has a Gradient Intermediate

Logistic regression replaces the squared-error objective with the log-loss of a sigmoid, $\sigma(z) = 1/(1 + e^{-z})$, which models the probability of a binary label. Its objective and gradient are

$$L(w) = -\frac{1}{N}\sum_{i=1}^{N}\Big[\, y_i \log \sigma(x_i^\top w) + (1 - y_i)\log\big(1 - \sigma(x_i^\top w)\big)\Big], \qquad \nabla L(w) = \frac{1}{N}\sum_{i=1}^{N}\big(\sigma(x_i^\top w) - y_i\big)\,x_i.$$

There is no closed-form solver here: the sigmoid makes the normal-equations trick unavailable, so we cannot reduce the whole fit to one fixed table and one solve. But look at the gradient. It is, once again, a sum over examples. At any current weight $w$, each worker can compute its partial gradient over only its shard, the all-reduce sums those partials, and one synchronized gradient-descent step moves $w$. The catch is that the statistic now depends on the current $w$, so it is not sufficient once and for all; it must be recomputed and re-communicated at every iteration. Distributed logistic regression is therefore exact per step but iterative overall, which is exactly the distributed-SGD setting developed in Section 10.3. Each round is an all-reduce of a length-$d$ gradient; convergence takes many rounds rather than one.

When $d$ is modest, you can do better than plain gradient descent by aggregating richer per-shard statistics. Distributed L-BFGS keeps the communication pattern identical, an all-reduce of the summed gradient, but uses a limited-memory curvature approximation built from a short history of those aggregated gradients to take far better-conditioned steps, converging in tens of rounds where SGD might need thousands. The lesson generalizes cleanly: a model with a closed-form sufficient statistic distributes in one round; a model whose only summable quantity is the gradient distributes in many rounds, one all-reduce each. The number of rounds, not the existence of a combine step, is what separates these two regression cousins.

Practical Example: A Churn Model Across Forty Regional Warehouses

Who: A data scientist at a retail analytics firm building a customer-churn logistic regression.

Situation: Training data lived in forty regional data warehouses, partitioned by geography, and totaled several terabytes of customer-event features.

Problem: Centralizing the rows into one machine for a single sklearn fit meant moving terabytes nightly across regions, which blew past both the network budget and a data-residency policy that forbade copying raw customer rows out of region.

Dilemma: Either ship raw data to a central trainer, fast to code but expensive and non-compliant, or run a distributed fit where each region computes statistics locally and only aggregates leave the region, compliant but requiring a distributed training loop.

Decision: They kept the data in place and distributed the fit, because the gradient is a sum over examples and only the length-$d$ aggregate needs to cross regions, not the rows.

How: Each region computed its partial gradient on local rows; a coordinator all-reduced the forty length-$d$ vectors and ran an L-BFGS step; the loop iterated for about sixty rounds until the loss flattened.

Result: Nightly cross-region traffic fell from terabytes to a few megabytes of gradient aggregates, the residency policy was satisfied because no raw row ever left its region, and the fitted coefficients matched a (policy-violating) centralized reference fit to four decimals.

Lesson: When only the gradient must be summed, the data can stay where it lives. This same insight, communicating statistics instead of data, becomes the foundation of federated learning in Chapter 14.

5. The General Lesson: Summable Statistics Are Cheap to Distribute Intermediate

The two regressions in this section bracket a principle that orders the entire chapter. If a model's training reduces to summing a fixed-size per-example statistic, and that summed statistic determines the model in closed form, distribution is exact and costs one all-reduce; linear regression is the archetype. If the only summable quantity is a gradient that depends on the current parameters, distribution is still exact per step but iterative, costing one all-reduce per round; logistic regression is the archetype. The communication payload in both cases is bounded by the statistic's size, not the data's, which is why classical models with low-dimensional sufficient statistics scale so gracefully even on enormous datasets.

This framing tells you what to look for in every later method. Naive Bayes and Gaussian discriminant models reduce to per-class sums and counts, one round like linear regression. The mean update in k-means clustering (Section 12.6) is a sum of points per cluster, exact per iteration. Tree-based methods aggregate histograms of feature-value statistics rather than raw rows, the trick that makes distributed gradient boosting practical in Section 12.5. In each case the question is identical to the one we asked here: what fixed-size quantity does a worker contribute to the shared sum, and does one sum finish the job or must we iterate. Answering it predicts the distributed cost before any code is written.

Library Shortcut: Spark MLlib and scikit-learn Hide the All-Reduce

In Code 12.1.1 we summed the per-shard statistics by hand. Production tools do exactly this internally and expose a one-line fit. Spark MLlib partitions the rows across the cluster, accumulates each partition's contribution, and reduces them with a tree-aggregate, the all-reduce of Figure 12.1.1 under a different name:

from pyspark.ml.regression import LinearRegression
from pyspark.ml.classification import LogisticRegression

# `train` is a distributed Spark DataFrame with columns "features", "label".
lin = LinearRegression(featuresCol="features", labelCol="label").fit(train)   # normal-equations / IRLS, distributed
log = LogisticRegression(featuresCol="features", labelCol="label",
                         maxIter=100).fit(train)                              # distributed L-BFGS under the hood

Code 12.1.2: The same two models as one call each. MLlib's LinearRegression aggregates $X^\top X$ and $X^\top y$ across partitions and solves once; its LogisticRegression runs distributed L-BFGS, all-reducing a gradient per iteration. On one machine, scikit-learn's LinearRegression and LogisticRegression fit the identical models, and the roughly twenty lines of manual aggregation in Code 12.1.1 collapse to a single .fit() that handles partitioning, the tree-aggregate, and the solve.

Research Frontier: Communication-Efficient Distributed GLMs (2024 to 2026)

Even one all-reduce per iteration can dominate when models are wide or workers are many, so recent work compresses or eliminates rounds for generalized linear models. Surrogate-likelihood methods in the CSL and DANE lineage build a local second-order surrogate on each worker so that very few communication rounds match the centralized fit; 2024 and 2025 analyses extend these one-shot and few-shot estimators to heterogeneous, non-IID shards with provable guarantees. A parallel thread pushes single-round federated GLMs: workers exchange compressed Hessian sketches or quantized sufficient statistics so a coordinator reconstructs a near-exact model in one pass, with differential-privacy noise added to the aggregates for cross-silo settings. The recurring research target is the very quantity this section foregrounds, the size and number of the statistics that must be summed, driven toward the one-round ideal that linear regression already enjoys. We pick up the privacy-preserving version of this story in Chapter 14.

Linear and logistic regression, the humblest models in the book, have handed us the organizing idea for distributed classical machine learning: find the summable statistic, count the rounds. The next section carries the same question to support vector machines, where the per-example contributions are constrained by a quadratic program and the combine step becomes more interesting. That tour begins in Section 12.2.

Exercise 12.1.1: Identify the Sufficient Statistic Conceptual

For each model, state the fixed-size per-example quantity each worker would sum and whether one all-reduce finishes the fit or iteration is required: (a) ridge regression, which adds $\lambda \lVert w \rVert_2^2$ to the least-squares objective; (b) a Gaussian naive Bayes classifier; (c) Poisson regression fit by maximum likelihood. For each, give the size of the statistic in terms of the feature count $d$ (and class count $c$ where relevant), and explain why (a) and (b) need only one round while (c) does not.

Exercise 12.1.2: Distributed Ridge in One Round Coding

Starting from Code 12.1.1, implement distributed ridge regression. Each worker still emits $\left(X^\top X_k,\, X^\top y_k\right)$; after the all-reduce, solve $\left(X^\top X + \lambda I\right) w = X^\top y$ for a chosen $\lambda > 0$. Verify against a single-machine ridge solve that the distributed coefficients match to floating-point rounding, and confirm that adding the regularizer changed neither the per-worker message size nor the number of communication rounds. Then time the solve as you grow $d$ from 12 to 500 and report where the $O(d^3)$ solve, rather than communication, starts to dominate.

Exercise 12.1.3: Rounds Versus Features Analysis

Suppose you must fit a model on $K = 1000$ workers over a network where one all-reduce of a length-$d$ vector costs $T(d)$ seconds. Exact linear regression sends a $d^2 + d$ payload once; distributed logistic regression sends a length-$d$ gradient for each of $R$ iterations. Write the total communication time for each as a function of $d$, $R$, and $T$, and find the value of $R$ at which logistic regression's communication overtakes linear regression's. Discuss how this crossover shifts as $d$ grows, and why a curvature-aware method like L-BFGS that shrinks $R$ can be worth its extra per-step bookkeeping. Reference the cost models of Section 4.3.