Appendix A
Mathematical Background

Mathematical Background

"They told me the math would not change once the work was split across a thousand of us. They were right, which was somehow the most reassuring and the most humbling thing I had ever computed."

An All-Reduce That Has Seen Some Gradients
Big Picture

Every distributed-AI method in this book is a way of computing one of four mathematical objects across many machines: a matrix product, an expectation, a gradient step, or an information quantity. The algorithms look intricate because of the machines, the network, and the failures, but the underlying mathematics is the same linear algebra, probability, optimization, and information theory you would use on one laptop. This appendix is a refresher on exactly those four, written so that each idea points forward to the chapter that distributes it. Read it as a map: when a later chapter shards a matrix multiply, averages a gradient across workers, or compresses an update to save bandwidth, the operation being scaled out is defined here. We keep the treatment compact and operational, because the goal is not to re-teach a first course but to fix the notation and the few identities the rest of the book leans on.

This book assumes you have met vectors, probability, and calculus before, so the sections below move quickly. What they emphasize is the small set of facts that make distribution work: that a matrix product decomposes into independent block products, that an average of unbiased estimates is still unbiased, that a gradient is a vector you can sum across workers, and that information has a unit you can spend or save. Each of the four sections is keyed to a difficulty badge and ends pointing at the chapter where the idea does its real work. A worked numerical demonstration and a short set of exercises close the appendix.

A.1 Linear Algebra for Distributed AI Beginner

A vector $x \in \mathbb{R}^d$ is an ordered list of $d$ real numbers; a matrix $A \in \mathbb{R}^{m \times n}$ is a rectangular array. Almost all of the arithmetic an AI system performs reduces to one operation on these objects: the matrix product. For $A \in \mathbb{R}^{m \times k}$ and $B \in \mathbb{R}^{k \times n}$, the product $C = AB \in \mathbb{R}^{m \times n}$ has entries

$$C_{ij} = \sum_{p=1}^{k} A_{ip}\, B_{pj}.$$

A fully connected layer applies $y = Wx + b$; an attention block is a sequence of matrix products; a convolution unrolls into one. In libraries this dense product is called GEMM, for general matrix multiply, and it dominates the run time of training and inference because its cost grows as $m\,k\,n$ while the data it touches grows only as $mk + kn$. That high ratio of arithmetic to memory traffic is exactly what accelerators are built to exploit, which is why so much of distributed deep learning is, underneath, the question of how to spread one enormous GEMM across many devices.

Key Insight: A Matrix Product Is a Sum of Independent Block Products

Partition $A$ into row blocks and $B$ into column blocks, or split the shared dimension $k$ into pieces. Either way the product $C = AB$ becomes a sum of smaller products that can be computed on different machines and combined by addition, with no approximation. This single algebraic fact is what makes sharded matrix multiply (Chapter 16) exact: each device holds a slab of $A$ or $B$, multiplies locally, and the partial results are summed or gathered to reconstruct $C$.

To see the block decomposition concretely, split the shared dimension $k$ into $T$ contiguous tiles, so $A = [A^{(1)} \cdots A^{(T)}]$ by columns and $B = [B^{(1)}; \cdots; B^{(T)}]$ by rows. Then the product is the sum of the tile products,

$$C = AB = \sum_{t=1}^{T} A^{(t)} B^{(t)},$$

and each term $A^{(t)} B^{(t)}$ uses disjoint slabs of the inputs. Give tile $t$ to worker $t$, and the final summation over $t$ is the same all-reduce that Chapter 4 studies as a first-class operation. Figure A.1 shows this tiling: the shared dimension is cut into colored strips, each strip becomes one worker's local product, and a reduction adds the strips back together. Tiling along the other dimensions instead splits the output $C$ into blocks that never need to be summed, only concatenated, which is the pattern behind tensor-parallel layers.

A (m by k) 3 column strips × B (k by n) 3 row strips = A¹B¹ + A²B² + A³B³ one partial product per worker reduce (sum) C = m by n
Figure A.1: Block-tiled matrix multiply. Splitting the shared dimension $k$ into three strips turns $C = AB$ into three independent partial products $A^{(t)}B^{(t)}$, one per worker, summed by a reduce step. The same reduction is the all-reduce of Chapter 4; splitting the outer dimensions instead concatenates output blocks rather than summing them.

Two more linear-algebra objects appear throughout the book. A norm measures the size of a vector; the Euclidean (or $\ell_2$) norm is $\lVert x \rVert_2 = \sqrt{\sum_i x_i^2}$, and gradient clipping, convergence bounds, and compression error are all stated in terms of norms. The gradient itself is a vector: for a scalar loss $L(w)$ with parameters $w \in \mathbb{R}^P$, the gradient $\nabla L(w) \in \mathbb{R}^P$ collects the partial derivatives $\partial L / \partial w_i$. Because the training loss is an average over examples and differentiation is linear, the gradient is an average of per-example gradients, which is precisely why it can be summed across workers and divided by the worker count. That averaging is the all-reduce at the heart of data-parallel training in Chapter 4 and Chapter 10.

Library Shortcut: The Matrix Product Is One Operator

You never write the triple loop of the GEMM definition. NumPy and PyTorch expose the product as the @ operator (or matmul), dispatching to a tuned BLAS or cuBLAS kernel, and a tensor norm is a single call:

import numpy as np
A = np.random.randn(512, 256)
B = np.random.randn(256, 128)
C = A @ B                              # GEMM, dispatched to optimized BLAS
print(C.shape)                         # (512, 128)
print(np.linalg.norm(A[0]))           # Euclidean norm of one row vector

# Block-tiled product over the shared dimension equals the full product.
T = 4
tiles = np.array_split(np.arange(256), T)
C_blocks = sum(A[:, t] @ B[t, :] for t in tiles)   # sum of partial products
print(np.allclose(C_blocks, C))        # True
Code A.1: The matrix product, a vector norm, and the block-tiled decomposition in NumPy. The final allclose check is the exactness claim of the Key Insight above, and it is the same identity that sharded matmul (Chapter 16) relies on across devices.

A.2 Probability and Statistics Beginner

A random variable $X$ takes values with probabilities; its expectation $\mathbb{E}[X]$ is the long-run average and its variance $\operatorname{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2]$ measures spread. Expectation is linear, $\mathbb{E}[aX + bY] = a\,\mathbb{E}[X] + b\,\mathbb{E}[Y]$, and for independent variables variances add. These two facts carry most of the statistical reasoning in distributed training. The central object is the mini-batch gradient. If the full-data loss is the average $L(w) = \frac{1}{N}\sum_{i} \ell_i(w)$, then sampling a batch $\mathcal{B}$ of $B$ examples uniformly and forming $g_{\mathcal{B}}(w) = \frac{1}{B}\sum_{i \in \mathcal{B}} \nabla \ell_i(w)$ gives an unbiased estimate of the full gradient,

$$\mathbb{E}[\,g_{\mathcal{B}}(w)\,] = \nabla L(w),$$

because each sampled term has expectation equal to the population mean of the per-example gradients. This is the single fact that licenses stochastic gradient descent: a cheap noisy gradient computed on a handful of examples points, on average, in exactly the same direction as the expensive exact gradient over all $N$ examples. The noise around that average has variance proportional to $1/B$, so larger batches give steadier but more expensive estimates, a trade-off that Chapter 10 turns into the central tuning knob of distributed optimization.

Key Insight: Mini-Batch Gradients Are Unbiased, So Averaging More of Them Only Reduces Noise

Because $\mathbb{E}[g_{\mathcal{B}}] = \nabla L$, averaging $K$ independent mini-batch gradients (for example, one per worker) leaves the expectation unchanged but divides the variance by $K$. Distributing the gradient over more workers does not bias the estimate; it sharpens it. This is why a data-parallel step on $K$ workers with batch $B$ each behaves like a single step with batch $KB$, and why the law of large numbers, $\frac{1}{M}\sum_{m} g^{(m)} \to \nabla L$ as $M \to \infty$, is the statistical backbone of scaling out training.

Two further results shape how we read distributed measurements. The law of large numbers says the sample mean of independent draws converges to the true expectation as the sample grows; the central limit theorem (CLT) sharpens this by saying the error of that sample mean is approximately Gaussian with standard deviation shrinking as $1/\sqrt{M}$ for $M$ samples. Concentration inequalities give the same message as one-sided guarantees that a sum stays close to its mean with high probability. Together they explain both why mini-batch training converges and why an evaluation metric estimated on a finite test set comes with a confidence interval of width proportional to $1/\sqrt{M}$. When Chapter 5 reports an accuracy with error bars, those bars are a direct application of the CLT, and variance-reduction techniques (control variates, larger or importance-weighted batches) are how training and evaluation both buy a tighter estimate for the same compute. The demonstration in Section A.3's neighborhood below makes the unbiasedness concrete by averaging thousands of mini-batch gradients and watching them converge to the full-data gradient.

A.3 Convex Optimization and Gradient Methods Intermediate

Training is the minimization of a loss $L(w)$ over parameters $w$. The workhorse is gradient descent, which repeatedly steps against the gradient,

$$w_{t+1} = w_t - \eta\, \nabla L(w_t),$$

where the learning rate $\eta > 0$ controls the step size. Stochastic gradient descent (SGD) replaces the exact gradient with the unbiased mini-batch estimate $g_{\mathcal{B}}(w_t)$ from Section A.2, trading exactness per step for many more steps per second. A function is convex if its graph lies below every chord, equivalently if $L(\lambda u + (1-\lambda)v) \le \lambda L(u) + (1-\lambda)L(v)$; for a convex loss, any point where the gradient vanishes is a global minimum, so gradient descent with a small enough learning rate is guaranteed to converge to the best possible solution. Figure A.2 contrasts the two regimes: a convex bowl, where every descent path reaches the unique floor, and a nonconvex landscape with several valleys, where the path you take depends on where you start.

Convex: one global minimum every path reaches the floor Nonconvex: several local minima where you start decides where you stop
Figure A.2: Convex versus nonconvex loss surfaces. On a convex bowl (left) gradient descent reaches the unique global minimum from any start; on a nonconvex surface (right) the path settles in whichever basin it began near. Deep networks are nonconvex, yet the same gradient step and the same distributed-averaging machinery of Chapter 10 apply.

Deep networks are not convex, so these guarantees do not transfer literally; in practice SGD still finds low-loss solutions, and the convex theory remains the right intuition pump for why it works and how to set the learning rate. Three refinements carry over to the nonconvex case and recur in the book. A learning rate too large diverges and too small crawls, so schedules that warm up then decay $\eta$ are standard. Momentum accumulates a running average of past gradients, $v_{t+1} = \beta v_t + \nabla L(w_t)$ followed by $w_{t+1} = w_t - \eta\, v_{t+1}$, which damps oscillation across steep directions and accelerates progress along shallow ones. Adaptive methods such as Adam rescale each coordinate by its recent gradient magnitude. The distributed connection is direct: in data-parallel SGD every worker computes a mini-batch gradient, the workers all-reduce them to a common averaged gradient, and each then applies the identical update, so the cluster takes exactly the step a single machine with the combined batch would take. Chapter 10 develops this into the full theory of distributed optimization, including the asynchronous and local-update variants that relax the all-reduce.

The code below makes the unbiasedness of the mini-batch gradient (Section A.2) concrete and shows why it underwrites SGD. It computes the exact full-data gradient of a linear-regression loss, then averages twenty thousand independent mini-batch gradients and measures how close the average lands.

import numpy as np

rng = np.random.default_rng(7)
N, d, B = 50_000, 20, 256        # examples, features, mini-batch size
X = rng.standard_normal((N, d))
w_true = rng.standard_normal(d)
y = X @ w_true + 0.1 * rng.standard_normal(N)
w = rng.standard_normal(d)        # arbitrary evaluation point

# Full-data gradient of the mean-squared-error loss.
full = (2.0 / N) * (X.T @ (X @ w - y))

# Mini-batch gradient on a uniformly sampled batch of size B.
def batch_grad():
    idx = rng.integers(0, N, size=B)              # sample with replacement
    Xb, yb = X[idx], y[idx]
    return (2.0 / B) * (Xb.T @ (Xb @ w - yb))

M = 20_000
avg = np.mean([batch_grad() for _ in range(M)], axis=0)   # average over M batches

print("batch size B               :", B)
print("number of batches averaged :", M)
print("||full gradient||          :", f"{np.linalg.norm(full):.4f}")
print("||mean batch - full||      :", f"{np.linalg.norm(avg - full):.2e}")
print("relative error             :", f"{np.linalg.norm(avg - full)/np.linalg.norm(full):.2e}")
Code A.2: Numerical check that the mini-batch gradient is an unbiased estimate of the full-data gradient. Each batch sees only $256$ of $50{,}000$ examples, yet the average over many batches reproduces the exact gradient.
batch size B               : 256
number of batches averaged : 20000
||full gradient||          : 14.5014
||mean batch - full||      : 2.83e-02
relative error             : 1.95e-03
Output A.2: The averaged mini-batch gradient matches the full gradient to a relative error of about $2 \times 10^{-3}$, and that error shrinks like $1/\sqrt{M}$ exactly as the central limit theorem predicts. The mini-batch direction is correct on average, which is all SGD and its data-parallel form in Chapter 10 require.

A.4 Information Theory Essentials Intermediate

Information theory supplies the units in which classification losses and communication costs are measured. For a discrete distribution $p$ over outcomes, the entropy

$$H(p) = -\sum_{x} p(x)\, \log_2 p(x)$$

is the average number of bits needed to encode a sample from $p$; it is largest when $p$ is uniform (maximum uncertainty) and zero when $p$ is a point mass (no uncertainty). When the truth is $p$ but a model predicts $q$, the cross-entropy $H(p, q) = -\sum_x p(x) \log_2 q(x)$ is the cost of coding $p$-distributed data with the code optimized for $q$, and the Kullback-Leibler (KL) divergence

$$D_{\mathrm{KL}}(p \,\Vert\, q) = \sum_{x} p(x)\, \log_2 \frac{p(x)}{q(x)} = H(p, q) - H(p) \ge 0$$

is the extra cost paid for using the wrong code. KL divergence is zero exactly when $p = q$ and is the gap that classifier training drives toward zero: minimizing cross-entropy loss against one-hot labels is minimizing this divergence. Mutual information $I(X; Y) = H(X) - H(X \mid Y)$ measures how many bits observing $Y$ reveals about $X$, and underlies feature-selection and representation-learning objectives.

Key Insight: Bits Are a Currency You Spend on Both Loss and Bandwidth

The same logarithm that defines cross-entropy loss defines the limit of lossless compression: a source of entropy $H$ bits per symbol cannot be encoded in fewer than $H$ bits on average, and a coder that uses $q$ instead of the true $p$ wastes $D_{\mathrm{KL}}(p \Vert q)$ bits per symbol. This duality is why Chapter 10 can treat gradient compression as an information-theoretic problem: every gradient all-reduce moves bits over the network, and reducing those bits (through quantization or sparsification) is bounded below by the information the gradient actually carries.

The practical payoff threads through two chapters. Cross-entropy and KL divergence are the loss functions and the evaluation metrics that Chapter 5 uses to compare distributed models, where perplexity is simply $2^{H(p,q)}$ and a smaller cross-entropy means a model that needs fewer bits to express its surprise at the data. On the systems side, the entropy of a gradient sets a floor on how far gradient compression can shrink the bytes per all-reduce before information loss starts to hurt convergence, which is the lever that communication-avoiding training in Chapter 10 pulls. A model that has learned its data well has low cross-entropy and therefore low residual entropy in its predictions; a gradient that has converged carries little information and therefore compresses well. Information theory is what lets us say both of those sentences with a number attached.

Library Shortcut: Cross-Entropy and KL in One Call Each

The classification loss and the divergence are standard library functions; you write the formula only to understand it, never to run it:

import torch
import torch.nn.functional as F

logits = torch.randn(8, 5)                 # 8 examples, 5 classes (unnormalized scores)
labels = torch.randint(0, 5, (8,))         # integer class targets

loss = F.cross_entropy(logits, labels)     # mean cross-entropy in nats
print(float(loss))

# KL divergence between two predicted distributions p and q (nats).
logp = F.log_softmax(torch.randn(8, 5), dim=1)
q    = F.softmax(torch.randn(8, 5), dim=1)
kl = F.kl_div(logp, q, reduction="batchmean")
print(float(kl))                           # >= 0, zero only when p == q
Code A.3: Cross-entropy and KL divergence in PyTorch. These compute in nats (natural log) rather than bits; divide by $\ln 2$ to convert. The same two quantities serve as the training loss in Chapter 10 and as evaluation metrics in Chapter 5.

These four bodies of mathematics, linear algebra for the matrix products, probability for the gradient estimates, optimization for the descent steps, and information theory for the bits, are the entire mathematical toolkit the book distributes. Everything that follows adds machines, networks, and failure to operations defined here; the operations themselves do not change. The exercises below check that the load-bearing identities are at your fingertips.

Exercise A.1: Block Multiply Equals Full Multiply Coding

Take random matrices $A \in \mathbb{R}^{200 \times 120}$ and $B \in \mathbb{R}^{120 \times 80}$. Split the shared dimension $120$ into $T = 5$ contiguous tiles and reconstruct $C = AB$ as the sum of the five partial products $A^{(t)}B^{(t)}$, as in Code A.1 and Figure A.1. Verify the result matches A @ B to floating-point precision. Then split the output instead, by row blocks of $A$, and confirm those partials are concatenated rather than summed. Explain in one sentence why the first decomposition needs an all-reduce and the second does not.

Exercise A.2: Variance Falls Like One Over the Batch Analysis

Using the setup of Code A.2, estimate the variance of a single mini-batch gradient for batch sizes $B \in \{16, 64, 256, 1024\}$ by drawing many batches at each size and measuring the average squared distance from full. Plot or tabulate the variance against $B$ and confirm it falls roughly as $1/B$. Then argue, using the law of large numbers from Section A.2, why running $K$ data-parallel workers at batch $B$ each gives the same variance as one worker at batch $KB$, and connect this to the all-reduce step of Chapter 10.

Exercise A.3: Cross-Entropy as a Bit Count Conceptual

A language model assigns probability $q$ to the next token while the true next token occurs with empirical frequency $p$. Show that the cross-entropy $H(p, q)$ equals $H(p) + D_{\mathrm{KL}}(p \Vert q)$, so the loss decomposes into an irreducible part (the data's own entropy) and a reducible part (the model's mismatch). Explain why driving training loss to zero is impossible whenever $H(p) > 0$, and relate the residual entropy to how well a converged gradient compresses under the bandwidth argument of Section A.4.