"I have been called a node, a worker, a replica, a shard, and once, memorably, a peer. The math never minded, as long as everyone agreed which letter I was."
A Worker Answering to Several Names
A book that spans data processing, distributed training, model sharding, inference serving, and multi-agent reasoning will reuse the same handful of symbols and terms in every part, and the only way that reuse stays trustworthy is if the symbols mean one thing throughout. This appendix is the single place where that agreement is written down. Section C.1 fixes the mathematical notation, the same $K$ workers and the same $\eta$ learning rate you met in Chapter 1 and will meet again in Chapter 15 and beyond. Section C.2 defines the vocabulary, every collective, every parallelism, every failure mode, with a link to the chapter that owns each one. Read it once now as a map, and return to it whenever a symbol or a word stops being obvious.
The forty-one chapters of this book draw on big-data systems, numerical optimization, deep-learning frameworks, serving infrastructure, and game theory, fields that each arrived with their own notation and their own dialect. To keep the narrative coherent we chose one symbol per concept and one definition per term, then held to that choice across every part. This appendix records those choices so that any chapter can be opened on its own without re-deriving what $B$ or $b$ means or what separates a shard from a replica. The two sections below are reference tables: Section C.1 is the symbol dictionary, and Section C.2 is the term dictionary.
C.1 Notation Beginner
Table C.1.1 lists the mathematical symbols that recur across the book, with the meaning we hold fixed everywhere they appear. Scalars are lower-case italic, vectors and parameter collections are bold or named explicitly in context, and counts of machines or examples are upper-case. Where a chapter needs a local symbol of its own, it defines it on first use and does not overwrite any entry below. The learning rate $\eta$, the worker count $K$, and the global batch $B$ are the symbols you will see most often, because they appear in nearly every training and scaling argument from Chapter 3 onward.
| Symbol | Meaning | First or primary use |
|---|---|---|
| $N$ | Number of training examples (dataset size); the data dimension that is partitioned across machines. | Ch 1 |
| $d$ | Dimension of a single feature vector or input. | Ch 1 |
| $K$ | Number of workers, data-parallel replicas, or devices cooperating on one job. | Ch 15 |
| $p$ | Number of processors, nodes, or processes in a parallel cost model (the HPC convention). | Ch 3 |
| $P$ | Number of model parameters; the per-worker payload of a gradient all-reduce. | Ch 16 |
| $w$, $\theta$ | Model parameters (weights). The two letters are interchangeable; a chapter picks one and stays with it. | Ch 10 |
| $g$, $\nabla L(w)$ | Gradient of the loss with respect to the parameters; the quantity that data parallelism averages. | Ch 1 |
| $L(w)$, $\ell$ | Average loss over the dataset, and the per-example loss $\ell(w; x_i, y_i)$. | Ch 1 |
| $\eta$ | Learning rate (step size) of a gradient method. | Ch 10 |
| $B$ | Global batch size: the total number of examples contributing to one optimizer step across all workers. | Ch 15 |
| $b$ | Per-worker (local) batch size, so that $B = K \cdot b$ for a balanced data-parallel job. | Ch 15 |
| $T$ | Number of optimizer steps, training iterations, or time steps, depending on context. | Ch 10 |
| $S$, $S_p$ | Speedup: the ratio of single-machine time to parallel time on $p$ machines. | Ch 3 |
| $E$, $E_p$ | Parallel efficiency, $E_p = S_p / p$; the fraction of ideal speedup actually realized. | Ch 3 |
| $\alpha$, $\beta$ | Latency and per-byte cost in the $\alpha$-$\beta$ communication model: time to send $m$ bytes is $\alpha + \beta m$. | Ch 4 |
| $\varepsilon$, $\delta$ | Privacy budget parameters of $(\varepsilon, \delta)$-differential privacy; smaller $\varepsilon$ means stronger privacy. | Ch 14 |
| $f$ | Number of faulty or Byzantine participants a protocol must tolerate. | Ch 2 |
| $x_i$, $y_i$ | The $i$-th input example and its label or target. | Ch 1 |
Two naming conventions are worth stating explicitly because they recur in prose throughout the book. First, "scale out" is a verb (to scale out a workload is to spread it across more machines) and "scale-out" is the hyphenated adjective or noun for that strategy; the same rule applies to "scale up" and "scale-up." Second, the actors in a distributed job carry fixed roles: a worker performs the partitioned computation, a coordinator (or master, or driver) sequences and combines the workers, a shard is one partition of data or parameters assigned to a worker, and a replica is a full copy of the model or service that several workers each hold. A worker may own a shard and host a replica at the same time; the words name roles, not separate machines.
The chapters of this book are meant to stack: data parallelism (Chapter 15) combines with model sharding (Chapter 16) and expert parallelism (Chapter 17) inside a single foundation-model training run (Chapter 19). That stacking only reads cleanly if $K$ means the same thing in all three chapters and $B = K \cdot b$ holds everywhere. Consistent notation is not cosmetic; it is the precondition for reasoning about combined systems, where a single argument has to carry symbols from three different chapters at once without ambiguity.
C.2 Glossary Beginner
The definitions below are deliberately short, one or two sentences each, enough to recall a term you have met and to point you at the chapter that develops it in full. Each entry ends with a link to its owning chapter, following the chapter map of the book. Terms are alphabetized; closely paired concepts (speedup and efficiency, all-gather and reduce-scatter) are defined together where the pairing is the point.
- All-gather
- A collective in which every worker contributes one shard and all workers end up holding the full concatenation of every shard; the gathering half of a sharded parameter update. Chapter 4.
- All-reduce
- The collective that sums (or otherwise reduces) one vector held on each worker and leaves the identical result on every worker; the engine of gradient synchronization in data-parallel training. Chapter 4.
- Asynchronous SGD
- A distributed optimizer in which workers push gradients and pull parameters without waiting for one another, trading some gradient staleness for the elimination of synchronization barriers. Chapter 10.
- Byzantine fault
- A failure in which a participant behaves arbitrarily or maliciously, sending wrong or conflicting messages rather than simply crashing; the hardest fault class a protocol can be asked to tolerate. Chapter 2.
- Checkpoint
- A saved snapshot of model parameters and optimizer state from which a job can resume after a failure or preemption; the unit of progress that elastic and fault-tolerant training protects. Chapter 18.
- Collective
- A communication operation involving a whole group of workers at once (all-reduce, all-gather, broadcast, all-to-all), as opposed to a point-to-point send between two of them. Chapter 4.
- Consensus
- The problem of getting a set of machines to agree on a single value or sequence of decisions despite failures and message delays; solved by protocols such as Paxos and Raft. Chapter 2.
- Continuous batching
- An LLM serving technique that admits and retires requests at the granularity of individual decode steps rather than whole batches, keeping the accelerator busy across requests of different lengths; the scheduling idea at the heart of vLLM. Chapter 24.
- Data parallelism
- The distribution strategy in which every worker holds a full copy of the model and processes a different shard of the data, then averages gradients by all-reduce; the most widely used form of distributed training. Chapter 15.
- Decentralized learning
- Distributed training with no central coordinator, in which workers exchange updates only with neighbors over a communication graph, often by gossip. Chapter 14.
- Differential privacy
- A formal guarantee, parameterized by $(\varepsilon, \delta)$, that the presence or absence of any single record changes the output distribution of a computation by a bounded amount, achieved by adding calibrated noise. Chapter 14.
- Distributed inference
- Serving predictions from a model whose computation or replicas span several machines, because one machine cannot hold the model or meet the request throughput. Chapter 23.
- Elastic training
- Training that can add or remove workers mid-run, rescaling the job to the machines currently available and resuming from a checkpoint when membership changes. Chapter 18.
- Embedding table
- A large lookup matrix mapping discrete ids (words, items, users) to dense vectors; often too large for one machine and therefore sharded across a parameter server. Chapter 11.
- Expert parallelism (Mixture of Experts)
- A sparse model design in which each token is routed to a few of many expert subnetworks, with the experts placed on different machines and connected by an all-to-all collective. Chapter 17.
- FedAvg
- The federated-averaging algorithm: clients take several local SGD steps on their own data, and a server averages the resulting models weighted by client data size. Chapter 14.
- Federated learning
- Training a shared model across many clients that keep their data local, exchanging only model updates with a coordinator so that raw data never leaves the device. Chapter 14.
- FSDP (Fully Sharded Data Parallel)
- A training strategy that shards parameters, gradients, and optimizer state across data-parallel workers, gathering each layer's parameters with all-gather only when needed; the ZeRO idea expressed in PyTorch. Chapter 16.
- Gang scheduling
- A cluster-scheduling policy that starts all the workers of a distributed job at once or not at all, because a partially placed collective job cannot make progress. Chapter 33.
- Gossip
- A decentralized communication pattern in which each worker periodically exchanges and averages state with a few random or neighboring peers, spreading information through the group without a central node. Chapter 14.
- Gradient compression
- Reducing the bytes moved per all-reduce by quantizing, sparsifying, or low-rank-factoring the gradient, lowering the communication tax on data-parallel training. Chapter 10.
- HNSW and ANN
- Approximate nearest neighbor (ANN) search returns vectors close to a query without exhaustive comparison; HNSW (Hierarchical Navigable Small World) is the graph-based index that makes such search fast at scale. Chapter 25.
- KV cache
- The stored key and value tensors from previous tokens that let a transformer decode the next token without recomputing the whole sequence; its memory cost dominates LLM serving economics. Chapter 22.
- MapReduce
- A programming model that expresses a distributed computation as a map over records followed by a shuffle and a reduce over grouped keys, with the framework handling partitioning and fault tolerance. Chapter 6.
- Model parallelism
- Splitting a single model across devices because its parameters and activations do not fit on one; the umbrella term covering tensor, pipeline, and sharded parallelism. Chapter 16.
- Parameter server
- An architecture in which one or more server processes hold the authoritative model parameters and workers push gradients to and pull parameters from them; the classic asynchronous-training and large-embedding pattern. Chapter 11.
- Pipeline parallelism
- Model parallelism that places consecutive layers (stages) on different devices and streams micro-batches through them like an assembly line, overlapping stages to hide the per-stage idle time. Chapter 16.
- Quantization
- Representing weights or activations in fewer bits (8-bit, 4-bit) to shrink memory and bandwidth at a small accuracy cost; a per-node efficiency technique that distribution then multiplies across a fleet. Chapter 22.
- RAG (Retrieval-Augmented Generation)
- Answering a query by first retrieving relevant documents from a vector store and then conditioning a language model on them, combining distributed retrieval with distributed generation. Chapter 25.
- Reduce-scatter
- The collective that reduces one vector per worker and leaves each worker holding only its slice of the result; combined with all-gather it composes a full all-reduce and underlies sharded training. Chapter 4.
- Replica
- A full copy of a model or service held by a worker so that several machines can train or serve in parallel; in data parallelism every worker is a replica. Chapter 15.
- Ring all-reduce
- A bandwidth-optimal all-reduce that arranges workers in a ring and passes partial sums around it, so the bytes each worker sends do not grow with the number of workers. Chapter 4.
- Scale-out
- Adding more machines and dividing the work among them; the distribution-first strategy this book leads with. Chapter 1.
- Scale-up
- Making one machine more capable (a bigger accelerator, more memory, a faster intra-node interconnect); treated as a per-node prerequisite that distribution multiplies, not as the main event. Chapter 22.
- Scatter-gather
- A two-phase pattern that scatters a computation's parts out to workers and gathers the results back, the basic skeleton of a coordinator-driven distributed job. Chapter 4.
- Sharding
- Partitioning data or model parameters into disjoint pieces (shards), each assigned to one worker, so that no single machine holds the whole. Chapter 8.
- Shuffle
- The all-to-all redistribution of records by key that sits between the map and reduce phases, grouping every value for a key onto one machine; the most expensive and most network-bound step of MapReduce. Chapter 6.
- Speedup and efficiency
- Speedup $S_p$ is single-machine time divided by parallel time on $p$ machines; efficiency $E_p = S_p / p$ is the fraction of that ideal realized, the honest scorecard for any scale-out claim. Chapter 3.
- Straggler
- A worker that runs slower than the rest and, under synchronous coordination, holds up the whole group; the practical reason synchronous training rarely reaches ideal speedup. Chapter 3.
- Tensor parallelism
- Model parallelism that splits an individual layer's matrix multiplications across devices, with each device computing a slice of the same operation and a collective stitching the slices back together. Chapter 16.
- vLLM
- A high-throughput LLM serving system built on paged KV-cache management (PagedAttention) and continuous batching, the production reference for the serving ideas of Part V. Chapter 24.
- Vector search
- Finding the stored embeddings nearest to a query embedding, the retrieval step behind semantic search and RAG, distributed across shards when the index outgrows one machine. Chapter 25.
This glossary is intentionally a starting point rather than a complete dictionary: each term's owning chapter develops it with the diagrams, code, and cost models that a single sentence cannot carry. When a later chapter introduces a finer term (PagedAttention, ZeRO stage 3, Nash equilibrium, micro-batch), it defines that term in place and ties it back to the entry here that it specializes. Treat Section C.2 as the index of first definitions and the chapters as the full treatment.