Front Matter · How to Read This Book
4 entries- F1PrefaceWhy scale out, how the eight parts fit together, the notation, and the suggested course paths.
- F2Mental Models: A Visual GlossaryThe book's recurring concepts in pictures: ring all-reduce, sharding, the retrieve-rank funnel, gang scheduling, and more.
- F3About the AuthorsWho wrote this book and how.
- F4Copyright & LegalEdition, license, and attribution.
Part I · Foundations of Distributed AI
5 chapters · 43 sectionsThe signal vocabulary every later part reuses: scale-out, distributed-systems concepts, performance models, communication primitives, and evaluation.
-
1What Is Scale-Out AI? The thesis, the six axes, and scale-out versus scale-up.
- 1.1 From Artificial Intelligence to Distributed AI
- 1.2 The Six Axes of Distribution
- 1.3 Scale-Out vs Scale-Up
- 1.4 Centralized, Decentralized, and Hybrid Architectures
- 1.5 Batch, Streaming, Online, and Interactive AI
- 1.6 Throughput, Latency, Cost, and Reliability
- 1.7 Examples of Distributed AI Systems
- 1.8 The Distributed AI Design Space
-
2Distributed Systems Concepts for AI CAP, consistency, consensus, and faults, in AI terms.
- 2.1 Processes, Nodes, Workers, and Coordinators
- 2.2 Communication, Synchronization, and Coordination
- 2.3 Partitioning, Sharding, and Replication
- 2.4 Fault Tolerance and Recovery
- 2.5 Consistency Models: From Parameter Staleness to the CAP Trade-off
- 2.6 Coordination and Consensus in the Control Plane
- 2.7 Stragglers and Bottlenecks
- 2.8 Data Locality and Compute Locality
- 2.9 Distributed System Patterns for AI
-
3Scalability and Performance Models Amdahl, Gustafson, speedup, efficiency, and the roofline.
- 3.1 What Does It Mean to Scale?
- 3.2 Horizontal and Vertical Scaling
- 3.3 Strong and Weak Scaling
- 3.4 Latency, Throughput, and Tail Latency
- 3.5 Amdahl's Law and Gustafson's Law
- 3.6 Work, Depth, and Parallelism
- 3.7 The Roofline Model
- 3.8 Communication Cost Models
- 3.9 Scaling Efficiency and Cost-Awareness
-
4Communication Primitives for Distributed Training All-reduce, all-gather, and the collectives that move gradients.
- 4.1 Why Communication, Not Compute, Bounds Distributed Training
- 4.2 The Communication Substrate
- 4.3 All-Reduce: Synchronizing Gradients in Data-Parallel SGD
- 4.4 All-Reduce Algorithms, and Why Ring All-Reduce Mattered for Deep Learning
- 4.5 All-Gather and Reduce-Scatter: The Primitives Behind ZeRO and FSDP
- 4.6 All-to-All: Routing Tokens in Mixture-of-Experts
- 4.7 Broadcast and Gather: Weight and Experience Movement
- 4.8 Communication Libraries: NCCL, MPI, and Gloo
- 4.9 Topology-Aware Placement
- 4.10 Overlapping Communication with the Backward Pass, and Gradient Bucketing
-
5Evaluating Distributed AI Systems Speedup, efficiency, and the discipline of holding quality constant.
- 5.1 Why Distribution Needs Its Own Evaluation Discipline
- 5.2 Speedup, Efficiency, and Scalability Curves
- 5.3 Throughput, Goodput, and Tail Latency / SLOs
- 5.4 Communication-to-Computation Ratio
- 5.5 Cost, Utilization, and Energy Accounting
- 5.6 Benchmarking Methodology and Pitfalls
- 5.7 Reproducible Measurement on Clusters
Part II · Distributed Data Processing for AI
4 chapters · 36 sectionsThe data layer that feeds everything: MapReduce, Spark, distributed storage and loading, and stream processing for online AI.
-
6The MapReduce Model and Distributed Algorithms Map, shuffle, reduce, and the algorithms they express.
- 6.1 Motivation for MapReduce
- 6.2 The Map, Shuffle, and Reduce Pattern
- 6.3 Key-Value Computation: Word Count and Inverted Indexing
- 6.4 Aggregation, Filtering, and Secondary Sorting
- 6.5 Distributed Sorting and Joins
- 6.6 Top-K, Matrix Multiplication, and PageRank
- 6.7 MinHash and Locality-Sensitive Hashing
- 6.8 Approximate Algorithms at Scale
- 6.9 Fault Tolerance, Limits, and Why the Model Still Matters
-
7Spark and Distributed DataFrames RDDs, lazy evaluation, and the shuffle at scale.
-
8Distributed Storage and Data Loading Sharding, replication, and feeding the training loop.
- 8.1 Why the Storage Layer Determines Scale
- 8.2 Object Storage and Distributed Filesystems
- 8.3 Columnar Formats and the Lakehouse
- 8.4 Data Layout, Partitioning, and Compaction
- 8.5 Sharded Training Data and the DataLoader Bottleneck
- 8.6 Streaming and WebDataset-Style Pipelines
- 8.7 Distributed Preprocessing
- 8.8 Data Leakage and Correctness in Distributed Pipelines
- 8.9 Data Versioning and Lineage
-
9Stream Processing and Online AI Event time, windows, watermarks, and online learning.
- 9.1 Batch vs Stream Processing
- 9.2 Events, Streams, and Windows
- 9.3 Event Time and Processing Time
- 9.4 Watermarks and Late Events
- 9.5 Kafka-Style Distributed Logs
- 9.6 Spark Structured Streaming and Flink
- 9.7 Online Feature Computation
- 9.8 Distributed Real-Time Inference Pipelines
- 9.9 Concept Drift and Distributed Monitoring
Part III · Distributed Machine Learning
5 chapters · 43 sectionsTraining distributed by hand: optimization, parameter servers and embeddings, classical and graph ML, and federated learning.
-
10Distributed Optimization Synchronous and asynchronous SGD, and gradient compression.
- 10.1 Empirical Risk Minimization at Scale
- 10.2 Mini-Batch Stochastic Gradient Descent
- 10.3 Synchronous Distributed SGD
- 10.4 Asynchronous Distributed SGD
- 10.5 Gradient Aggregation and All-Reduce SGD
- 10.6 Stale and Delayed Gradients
- 10.7 Communication-Efficient Optimization
- 10.8 Large-Batch Training and Learning-Rate Scaling
- 10.9 Communication Complexity and Lower Bounds
- 10.10 Convergence and Practical Trade-Offs
-
11Parameter Servers and Distributed Embeddings Push-pull training and terabyte embedding tables.
- 11.1 Motivation for Parameter Servers
- 11.2 Push-Pull Architecture
- 11.3 Centralized and Sharded Parameter Servers
- 11.4 Synchronous and Asynchronous Updates
- 11.5 Bounded Staleness
- 11.6 Sparse Models and Distributed Embedding Tables
- 11.7 Terabyte-Scale Embeddings
- 11.8 Fault Tolerance in Parameter Servers
- 11.9 Parameter Servers vs All-Reduce in Modern Systems
-
12Distributed Classical Machine Learning Trees, linear models, and clustering across a cluster.
-
13Distributed Graph Machine Learning Graph partitioning, message passing, and distributed GNNs.
- 13.1 Why Graphs Are Hard to Distribute
- 13.2 Graph Partitioning
- 13.3 The Pregel / Vertex-Centric Model
- 13.4 Distributed Graph Analytics
- 13.5 Distributed Graph Neural Networks
- 13.6 Distributed Neighbor Sampling
- 13.7 Mini-Batch vs Full-Graph Distributed Training
- 13.8 Frameworks and Systems for Distributed Graph ML
-
14Federated and Decentralized Learning FedAvg, non-IID data, and secure aggregation.
Part IV · Parallel Deep Learning and Large Models
7 chapters · 62 sectionsData, model, pipeline, sharded, and expert parallelism; elastic training; foundation models; distributed RL; and distributed HPO.
-
15Data-Parallel Deep Learning Replicas, all-reduce SGD, and large-batch training.
- 15.1 Why Deep Learning Needs Distributed Training
- 15.2 Single-GPU, Multi-GPU, and Multi-Node Training
- 15.3 Data Parallelism
- 15.4 Gradient Synchronization and All-Reduce
- 15.5 Gradient Bucketing and Communication/Computation Overlap
- 15.6 PyTorch Distributed Data Parallel
- 15.7 Horovod and the Broader Ecosystem
- 15.8 Mixed Precision as a Per-Node Enabler
- 15.9 Practical Bottlenecks and Scaling Efficiency
-
16Model, Pipeline, and Sharded Parallelism Tensor and pipeline parallelism, ZeRO, and FSDP.
- 16.1 When the Model No Longer Fits on One Device
- 16.2 Tensor Parallelism
- 16.3 Pipeline Parallelism
- 16.4 Sharded Data Parallelism: ZeRO Stages 1-3
- 16.5 PyTorch FSDP
- 16.6 DeepSpeed and Megatron-LM
- 16.7 Sequence and Context Parallelism
- 16.8 Activation Checkpointing as a Per-Node Enabler
- 16.9 3D and 4D Parallelism
- 16.10 Choosing and Tuning a Parallelism Strategy
-
17Expert Parallelism and Sparse Distributed Models Mixture-of-experts routing and all-to-all communication.
- 17.1 Dense vs Sparse Scaling
- 17.2 The Mixture-of-Experts Layer
- 17.3 Routing and Gating
- 17.4 Expert Parallelism: Sharding Experts Across Nodes
- 17.5 All-to-All Communication for Token Routing
- 17.6 Load Balancing Across Experts
- 17.7 Capacity Factors, Token Dropping, and Stability
- 17.8 Serving Distributed MoE Models
- 17.9 Trade-Offs vs Dense Distributed Models
-
18Elastic and Fault-Tolerant Distributed Training Checkpointing, elasticity, and surviving node failure.
- 18.1 Failure Is the Norm at Thousand-GPU Scale
- 18.2 Distributed Checkpointing
- 18.3 Restart, Replay, and Determinism
- 18.4 Elastic Training
- 18.5 Straggler Detection and Mitigation
- 18.6 Preemption and Spot-Instance Training
- 18.7 Memory Offload Across the Hierarchy
- 18.8 Monitoring and Debugging Distributed Training
-
19Training Foundation Models at Scale 3D parallelism, scaling laws, and training stability.
- 19.1 Foundation Models as Distributed Systems
- 19.2 Scaling Laws
- 19.3 Distributed Dataset Construction
- 19.4 Distributed Deduplication and Data Quality
- 19.5 Tokenization at Scale
- 19.6 Orchestrating Distributed Pretraining
- 19.7 Distributed Fine-Tuning
- 19.8 Distributed Alignment: A Systems View
- 19.9 Energy, Cost, and Responsible Scaling
-
20Distributed Reinforcement Learning Infrastructure Actors, learners, and distributed replay at scale.
- 20.1 Why RL Is a Distributed-Systems Problem
- 20.2 The Actor-Learner Architecture
- 20.3 Distributed Experience Collection
- 20.4 Distributed Replay Buffers
- 20.5 Off-Policy Correction at Scale
- 20.6 Ape-X, R2D2, and SEED RL Designs
- 20.7 Synchronous vs Asynchronous RL Systems
- 20.8 Scaling Bottlenecks: Sampling vs Learning Throughput
- 20.9 Frameworks and Practice
-
21Distributed Hyperparameter Search and AutoML Parallel search, Hyperband, and population-based training.
- 21.1 Why Search Is Embarrassingly Parallel, and Why That Is Not Enough
- 21.2 Grid, Random, and Bayesian Optimization
- 21.3 Multi-Fidelity Optimization
- 21.4 Successive Halving and Hyperband
- 21.5 Population-Based Training
- 21.6 Distributed Trial Scheduling and Early Stopping
- 21.7 Ray Tune and the AutoML Ecosystem
- 21.8 Cost-Aware Distributed Experimentation
Part V · Distributed Inference and Serving
5 chapters · 44 sectionsPer-node efficiency as a labeled prerequisite, multiplied across the fleet: inference systems, LLM serving, vector search, and MLOps.
-
22Per-Node Inference Efficiency: A Prerequisite Quantization, the paged KV cache, and the per-node prerequisite.
- 22.1 Why One Node's Efficiency Determines Fleet Cost
- 22.2 Quantization
- 22.3 Pruning and Sparsity
- 22.4 Knowledge Distillation
- 22.5 KV Cache and Paged Attention
- 22.6 FlashAttention and Efficient Attention
- 22.7 Continuous Batching and Speculative Decoding
- 22.8 Compilation and Kernel Optimization
- 22.9 From Per-Node Numbers to Fleet Sizing
-
23Distributed Inference Systems Load balancing, batching, and autoscaling replicas.
- 23.1 Why Model Serving Differs from Web Serving
- 23.2 Replicas, Load Balancing, and Batch-Aware Routing
- 23.3 Online vs Batch Inference Across a Fleet
- 23.4 Autoscaling on GPU Utilization and Queue Depth
- 23.5 Multi-Model and Multi-Tenant GPU Serving
- 23.6 Large-Model Loading, Cold Starts, and Warm Pools
- 23.7 Availability, Failover, and Redundancy
- 23.8 Serving Frameworks and Practice
-
24Distributed LLM Serving vLLM, continuous batching, and paged attention.
- 24.1 Why Large-Model Serving Spans Many Machines
- 24.2 Tensor-Parallel Inference
- 24.3 Pipeline-Parallel and Multi-Node Inference
- 24.4 Distributed and Paged KV Cache
- 24.5 Prefill/Decode Disaggregation
- 24.6 Request Scheduling and Continuous Batching Across Nodes
- 24.7 Prefix Caching and Multi-LoRA Fleets
- 24.8 Serving Distributed MoE Models
- 24.9 Inference Engines and Practice
-
25Distributed Retrieval and Vector Search Approximate nearest neighbor, sharded indexes, and scatter-gather.
- 25.1 Retrieval-Augmented Generation as a Distributed System
- 25.2 Distributed Embedding Pipelines
- 25.3 Vector Databases
- 25.4 Approximate Nearest Neighbor Search
- 25.5 Index Sharding and Replication
- 25.6 Distributed Hybrid Search
- 25.7 Multi-Stage Retrieval and Distributed Reranking
- 25.8 Distributed Caching for Retrieval
- 25.9 Evaluating Distributed Retrieval
-
26MLOps for Distributed AI Pipelines, registries, monitoring, and drift across a fleet.
- 26.1 Operating AI Across a Fleet
- 26.2 Distributed Data and Training Pipelines
- 26.3 Model and Prompt Registries
- 26.4 CI/CD for Distributed ML
- 26.5 Distributed Experiment Tracking
- 26.6 Fleet-Wide Monitoring and Observability
- 26.7 Distributed Drift Detection
- 26.8 A/B Testing and Shadow Deployment at Scale
- 26.9 Rollbacks, Incident Response, and Guardrails
Part VI · Distributed AI and Multi-Agent Systems
6 chapters · 55 sectionsDistributing the intelligence itself: distributed AI, game theory, multi-agent RL, swarm intelligence, and agent orchestration.
-
27Distributed Artificial Intelligence Agents, coordination, and distributed problem solving.
- 27.1 History of Distributed Artificial Intelligence
- 27.2 Distributed Problem Solving
- 27.3 Centralized, Decentralized, and Hybrid AI
- 27.4 Blackboard Systems
- 27.5 The Contract-Net Protocol
- 27.6 Distributed Constraint Optimization
- 27.7 Coordination and Cooperation
- 27.8 Distributed Knowledge and Belief
- 27.9 DAI in Modern AI Systems
-
28Game-Theoretic Foundations for Multi-Agent AI Equilibria, auctions, and mechanism design for agents.
-
29Multi-Agent Systems Agent architectures, negotiation, and coordination.
- 29.1 What Is an Agent?
- 29.2 Agent Architectures
- 29.3 Multi-Agent Environments
- 29.4 Communication
- 29.5 Coordination
- 29.6 Negotiation
- 29.7 Coalition Formation
- 29.8 Task Allocation
- 29.9 Consensus
- 29.10 Trust and Reputation
-
30Multi-Agent Reinforcement Learning Markov games, CTDE, and value decomposition.
- 30.1 From Reinforcement Learning to MARL
- 30.2 Markov Games
- 30.3 Cooperative, Competitive, and Mixed Settings
- 30.4 Independent Learners
- 30.5 Centralized Training with Decentralized Execution
- 30.6 Value Decomposition
- 30.7 Policy Gradient Methods in MARL
- 30.8 Credit Assignment
- 30.9 Non-Stationarity
- 30.10 Distributed MARL Training
-
31Swarm Intelligence and Collective Behavior Flocking, ant colonies, particle swarms, and emergence.
-
32Distributed Agent Orchestration Tool-using LLM agents, planners, and the MCP and A2A protocols.
- 32.1 LLM Agents as Distributed Components
- 32.2 Tool Use and Function Calling
- 32.3 Planner-Executor and Role-Specialized Agents
- 32.4 Parallel and Distributed Multi-Agent Workflows
- 32.5 Debate, Critique, and Reflection Across Agents
- 32.6 Agent Communication Protocols (MCP and A2A)
- 32.7 Shared State and Distributed Memory
- 32.8 Distributed Orchestration Engines
- 32.9 Evaluating Distributed Agentic Systems
- 32.10 Cost, Latency, and Reliability at Scale
Part VII · Cluster, Edge, and Reliable Infrastructure
3 chapters · 26 sectionsThe substrate everything runs on, and how it stays alive: cluster scheduling, edge and on-device AI, and reliable, secure distributed AI.
-
33Cluster Infrastructure and Scheduling Accelerators, Kubernetes, and gang scheduling.
- 33.1 Anatomy of an AI Cluster
- 33.2 Compute: CPUs, GPUs, TPUs, and Accelerator Instances
- 33.3 Containers and Kubernetes for AI
- 33.4 Batch Schedulers: Slurm, Kubernetes Batch, and Volcano
- 33.5 Gang Scheduling and Collective-Aware Placement
- 33.6 Multi-Tenant GPU Sharing: MIG, MPS, and Time-Slicing
- 33.7 Ray Clusters and the Object Store
- 33.8 Spot and Preemptible Scheduling for Cost Optimization
- 33.9 Managed Platforms: Databricks, SageMaker, and Vertex AI
-
34Edge, Fog, and On-Device Distributed AI Split inference, federated edge, and real-time deadlines.
-
35Reliable and Secure Distributed AI Fault tolerance, Byzantine robustness, and differential privacy.
- 35.1 Reliability in Distributed AI
- 35.2 Fault Tolerance and Recovery
- 35.3 Security in Distributed AI
- 35.4 Data and Model Poisoning in Distributed and Federated Settings
- 35.5 Byzantine-Robust Aggregation
- 35.6 Privacy and Differential Privacy in Distributed Learning
- 35.7 Auditability and Governance Across a Fleet
- 35.8 Bias and Environmental Cost at Scale
Part VIII · Case Studies and Capstone Projects
6 chapters · 57 sectionsThe whole book assembled into systems: web-scale RAG, federated medical AI, recommendation, robotics, agentic apps, and a capstone.
-
36Web-Scale Text Processing and Distributed RAG Crawl, clean, embed, shard, retrieve, and generate at web scale.
-
37Federated Medical AI Training a clinical model across hospitals without moving data.
- 37.1 Problem Definition
- 37.2 Multi-Hospital Data
- 37.3 Privacy Constraints
- 37.4 Federated Learning Setup
- 37.5 Data Heterogeneity
- 37.6 Secure Aggregation
- 37.7 Monitoring and Drift Across Sites
- 37.8 Safety and Responsibility
- 37.9 Project Extension
-
38Distributed Recommendation at Scale Sharded embeddings and the retrieve-then-rank funnel.
- 38.1 Problem Definition
- 38.2 Distributed User and Item Embeddings
- 38.3 Sharded Candidate Generation
- 38.4 Distributed Ranking Models
- 38.5 Feature Stores
- 38.6 Real-Time Personalization
- 38.7 Online Evaluation
- 38.8 System Architecture
- 38.9 Project Extension
-
39Multi-Agent Robotics and Drone Swarms Decentralized coordination, multi-agent RL, and sim-to-real.
- 39.1 Problem Definition
- 39.2 Multi-Robot Coordination
- 39.3 Distributed Task Allocation
- 39.4 Communication Constraints
- 39.5 Shared Situational Awareness
- 39.6 Decentralized Control
- 39.7 Multi-Agent Reinforcement Learning
- 39.8 Simulation-to-Real Transfer
- 39.9 Safety and Failure Modes
- 39.10 Project Extension
-
40Distributed LLM and Agentic Applications Document pipelines, RAG, a vLLM fleet, and agent orchestration.
- 40.1 Problem Definition
- 40.2 Distributed Document Processing
- 40.3 Embedding Pipelines
- 40.4 Sharded Vector Search
- 40.5 RAG at Scale
- 40.6 Distributed Agent Orchestration
- 40.7 Distributed Model Serving with vLLM
- 40.8 Cost Control Across the Fleet
- 40.9 Evaluation
- 40.10 Project Extension
-
41Capstone Project Design Choose, baseline, design, measure, and present a scale-out system.
- 41.1 Choosing a Distributed AI Problem
- 41.2 Defining the Distribution Axis
- 41.3 Building a Single-Machine Baseline
- 41.4 Designing the Distributed Version
- 41.5 Selecting Tools and Infrastructure
- 41.6 Evaluation Metrics: Speedup, Efficiency, and Cost
- 41.7 Cost and Performance Analysis
- 41.8 Reproducibility Package
- 41.9 Final Report
- 41.10 Final Presentation
Back Matter · Appendices
4 appendicesA self-contained math refresher, the companion cluster lab, the notation and glossary, and a catalogue of datasets and benchmarks.