
A practitioner's guide to distributing data, training, models, inference, coordination, and decision-making across many machines.
Modern AI is distributed AI. A single machine can no longer hold the data, the model, the inference traffic, or the fleet of agents that today's systems demand. This book is one connected journey from big-data algorithms to distributed intelligence, organized around six axes of distribution. It leads with scale-out, treats single-node efficiency as a clearly labeled per-node prerequisite, and builds every primitive, from the all-reduce collective to consensus, elastic recovery, and agent orchestration, through the AI operation that uses it, closing with end-to-end case studies and a capstone you design yourself.
Each part stands on the one before it; together they carry one system from a single split gradient to a swarm acting as one.
The vocabulary every later part reuses: what scale-out is, distributed-systems concepts, scalability and performance models, the communication primitives, and how to evaluate distributed AI with rigor.
5 chapters · 43 sections IIThe data layer that feeds everything: MapReduce and distributed algorithms, Spark and DataFrames, distributed storage and data loading, and stream processing for online AI.
4 chapters · 36 sections IIITraining distributed by hand: distributed optimization, parameter servers and terabyte embeddings, classical and graph ML at scale, and federated and decentralized learning.
5 chapters · 43 sections IVThe heart of large-model training: data, model, pipeline, sharded, and expert parallelism; elastic and fault-tolerant training; foundation models; distributed RL; and distributed HPO.
7 chapters · 62 sections VPer-node efficiency as a labeled prerequisite, then multiplied across the fleet: distributed inference systems, LLM serving with vLLM, distributed retrieval and vector search, and MLOps.
5 chapters · 44 sections VIDistributing the intelligence itself: distributed AI foundations, game theory, multi-agent reinforcement learning, swarm intelligence, and LLM agent orchestration.
6 chapters · 55 sections VIIThe substrate everything runs on, and how it stays alive: cluster infrastructure and scheduling, edge, fog, and on-device AI, and reliable, secure, privacy-preserving distributed AI.
3 chapters · 26 sections VIIIThe whole book assembled into systems: web-scale RAG, federated medical AI, distributed recommendation, multi-agent robotics, agentic LLM applications, and a capstone you design.
6 chapters · 57 sectionsFive habits, kept in every chapter from the first split gradient to the last agent.
Every concept built from first principles is paired with a small program that runs and prints real numbers, never an isolated snippet, so distribution is something you measure rather than assume.
After each from-scratch build, a shortcut callout shows the same task in a few lines of PySpark, PyTorch DDP and FSDP, DeepSpeed, Ray, or vLLM, and names exactly what the framework handles for you.
Big-picture framings, key insights, research frontiers, practical examples, and cross-references are typeset as distinct boxes, so you can read deep or skim fast and never miss a trap.
Each chapter closes with typed exercises and buildable project ideas that extend its worked systems, scaling from quick checks to capstone-sized distributed builds.
The MapReduce shuffle becomes all-reduce, parameter-server sharding becomes ZeRO and FSDP, data parallelism becomes expert parallelism, and per-node KV-cache economics return multiplied across the serving fleet. One story, told at every scale.