Contents | Scaling Out AI: From Big Data Algorithms to Distributed Intelligence

Front Matter · How to Read This Book

5 entries

F1
PrefaceWhy scale out, how the eight parts fit together, the notation, and the suggested course paths.
F2
Mental Models: A Visual GlossaryThe book's recurring concepts in pictures: ring all-reduce, sharding, the retrieve-rank funnel, gang scheduling, and more.
F3
About the AuthorsWho wrote this book and how.
F4
About the Hands-On AI Science SeriesThe nine-book series this volume belongs to, and where to find the others.
F5
Copyright & LegalEdition, license, and attribution.

Part I · Foundations of Distributed AI

5 chapters · 43 sections

The signal vocabulary every later part reuses: scale-out, distributed-systems concepts, performance models, communication primitives, and evaluation.

1
What Is Scale-Out AI? The thesis, the six axes, and scale-out versus scale-up.
2
Distributed Systems Concepts for AI CAP, consistency, consensus, and faults, in AI terms.
3
Scalability and Performance Models Amdahl, Gustafson, speedup, efficiency, and the roofline.
1. 3.1 What Does It Mean to Scale?
2. 3.2 Horizontal and Vertical Scaling
3. 3.3 Strong and Weak Scaling
4. 3.4 Latency, Throughput, and Tail Latency
5. 3.5 Amdahl's Law and Gustafson's Law
6. 3.6 Work, Depth, and Parallelism
7. 3.7 The Roofline Model
8. 3.8 Communication Cost Models
9. 3.9 Scaling Efficiency and Cost-Awareness
4
Communication Primitives for Distributed Training All-reduce, all-gather, reduce-scatter, ring attention, and DiLoCo: the collectives that move gradients, sequences, and sparse updates.
5
Evaluating Distributed AI Systems Speedup, efficiency, and the discipline of holding quality constant.

Part II · Distributed Data Processing for AI

4 chapters · 36 sections

The data layer that feeds everything: MapReduce, Spark, distributed storage and loading, and stream processing for online AI.

6
The MapReduce Model and Distributed Algorithms Map, shuffle, reduce, and the algorithms they express.
7
Spark and Distributed DataFrames RDDs, lazy evaluation, and the shuffle at scale.
1. 7.1 From MapReduce to Spark
2. 7.2 Resilient Distributed Datasets
3. 7.3 DataFrames and Spark SQL
4. 7.4 Lazy Evaluation and DAG Execution
5. 7.5 Transformations and Actions
6. 7.6 Partitioning and Caching
7. 7.7 Joins, Shuffles, and Data Skew
8. 7.8 PySpark for AI Workloads
9. 7.9 Spark Performance Tuning
8
Distributed Storage and Data Loading Sharding, replication, and feeding the training loop.
9
Stream Processing and Online AI Event time, windows, watermarks, and online learning.
1. 9.1 Batch vs Stream Processing
2. 9.2 Events, Streams, and Windows
3. 9.3 Event Time and Processing Time
4. 9.4 Watermarks and Late Events
5. 9.5 Kafka-Style Distributed Logs
6. 9.6 Spark Structured Streaming and Flink
7. 9.7 Online Feature Computation
8. 9.8 Distributed Real-Time Inference Pipelines
9. 9.9 Concept Drift and Distributed Monitoring

Part III · Distributed Machine Learning

5 chapters · 43 sections

Training distributed by hand: optimization, parameter servers and embeddings, classical and graph ML, and federated learning.

10
Distributed Optimization Synchronous and asynchronous SGD, and gradient compression.
1. 10.1 Empirical Risk Minimization at Scale
2. 10.2 Mini-Batch Stochastic Gradient Descent
3. 10.3 Synchronous Distributed SGD
4. 10.4 Asynchronous Distributed SGD
5. 10.5 Gradient Aggregation and All-Reduce SGD
6. 10.6 Stale and Delayed Gradients
7. 10.7 Communication-Efficient Optimization
8. 10.8 Large-Batch Training and Learning-Rate Scaling
9. 10.9 Communication Complexity and Lower Bounds
10. 10.10 Convergence and Practical Trade-Offs
11
Parameter Servers and Distributed Embeddings Push-pull training and terabyte embedding tables.
1. 11.1 Motivation for Parameter Servers
2. 11.2 Push-Pull Architecture
3. 11.3 Centralized and Sharded Parameter Servers
4. 11.4 Synchronous and Asynchronous Updates
5. 11.5 Bounded Staleness
6. 11.6 Sparse Models and Distributed Embedding Tables
7. 11.7 Terabyte-Scale Embeddings
8. 11.8 Fault Tolerance in Parameter Servers
9. 11.9 Parameter Servers vs All-Reduce in Modern Systems
12
Distributed Classical Machine Learning Trees, linear models, and clustering across a cluster.
1. 12.1 Distributed Linear and Logistic Regression
2. 12.2 Distributed Support Vector Machines
3. 12.3 Distributed Decision Trees
4. 12.4 Random Forests at Scale
5. 12.5 Distributed Gradient Boosting
6. 12.6 Distributed Clustering
7. 12.7 Distributed Approximate Nearest Neighbors
13
Distributed Graph Machine Learning Graph partitioning, message passing, and distributed GNNs.
1. 13.1 Why Graphs Are Hard to Distribute
2. 13.2 Graph Partitioning
3. 13.3 The Pregel / Vertex-Centric Model
4. 13.4 Distributed Graph Analytics
5. 13.5 Distributed Graph Neural Networks
6. 13.6 Distributed Neighbor Sampling
7. 13.7 Mini-Batch vs Full-Graph Distributed Training
8. 13.8 Frameworks and Systems for Distributed Graph ML
14
Federated and Decentralized Learning FedAvg, non-IID data, and secure aggregation.
1. 14.1 Motivation for Federated Learning
2. 14.2 Cross-Device and Cross-Silo Learning
3. 14.3 FedAvg and Its Variants
4. 14.4 Non-IID Data
5. 14.5 Communication Constraints
6. 14.6 Privacy and Secure Aggregation
7. 14.7 Personalized Federated Learning
8. 14.8 Decentralized Learning
9. 14.9 Edge and On-Device Learning

Part IV · Parallel Deep Learning and Large Models

7 chapters · 62 sections

Data, model, pipeline, sharded, and expert parallelism; elastic training; foundation models; distributed RL; and distributed HPO.

15
Data-Parallel Deep Learning Replicas, all-reduce SGD, large-batch training, and FP8/MXFP8 precision frontiers.
16
Model, Pipeline, and Sharded Parallelism Tensor and pipeline parallelism, ZeRO, FSDP2, and activation checkpointing cost models.
1. 16.1 When the Model No Longer Fits on One Device
2. 16.2 Tensor Parallelism
3. 16.3 Pipeline Parallelism
4. 16.4 Sharded Data Parallelism: ZeRO Stages 1-3
5. 16.5 PyTorch FSDP
6. 16.6 DeepSpeed and Megatron-LM
7. 16.7 Sequence and Context Parallelism
8. 16.8 Activation Checkpointing as a Per-Node Enabler
9. 16.9 3D and 4D Parallelism
10. 16.10 Choosing and Tuning a Parallelism Strategy
17
Expert Parallelism and Sparse Distributed Models Mixture-of-experts routing, all-to-all communication, and auxiliary-loss-free load balancing.
1. 17.1 Dense vs Sparse Scaling
2. 17.2 The Mixture-of-Experts Layer
3. 17.3 Routing and Gating
4. 17.4 Expert Parallelism: Sharding Experts Across Nodes
5. 17.5 All-to-All Communication for Token Routing
6. 17.6 Load Balancing Across Experts
7. 17.7 Capacity Factors, Token Dropping, and Stability
8. 17.8 Serving Distributed MoE Models
9. 17.9 Trade-Offs vs Dense Distributed Models
18
Elastic and Fault-Tolerant Distributed Training Checkpointing, elasticity, and surviving node failure.
1. 18.1 Failure Is the Norm at Thousand-GPU Scale
2. 18.2 Distributed Checkpointing
3. 18.3 Restart, Replay, and Determinism
4. 18.4 Elastic Training
5. 18.5 Straggler Detection and Mitigation
6. 18.6 Preemption and Spot-Instance Training
7. 18.7 Memory Offload Across the Hierarchy
8. 18.8 Monitoring and Debugging Distributed Training
19
Training Foundation Models at Scale Data curation at scale, 3D parallelism, scaling laws, and the DeepSeek-V3 frontier recipe.
1. 19.1 Foundation Models as Distributed Systems
2. 19.2 Scaling Laws
3. 19.3 Distributed Dataset Construction
4. 19.4 Distributed Deduplication and Data Quality
5. 19.5 Tokenization at Scale
6. 19.6 Orchestrating Distributed Pretraining
7. 19.7 Distributed Fine-Tuning
8. 19.8 Distributed Alignment: A Systems View
9. 19.9 Energy, Cost, and Responsible Scaling
20
Distributed Reinforcement Learning Infrastructure Actors, learners, distributed replay, GRPO, and reinforcement learning with verifiable rewards.
1. 20.1 Why RL Is a Distributed-Systems Problem
2. 20.2 The Actor-Learner Architecture
3. 20.3 Distributed Experience Collection
4. 20.4 Distributed Replay Buffers
5. 20.5 Off-Policy Correction at Scale
6. 20.6 Ape-X, R2D2, and SEED RL Designs
7. 20.7 Synchronous vs Asynchronous RL Systems
8. 20.8 Scaling Bottlenecks: Sampling vs Learning Throughput
9. 20.9 Frameworks and Practice
21
Distributed Hyperparameter Search and AutoML Parallel search, Hyperband, and population-based training.
1. 21.1 Why Search Is Embarrassingly Parallel, and Why That Is Not Enough
2. 21.2 Grid, Random, and Bayesian Optimization
3. 21.3 Multi-Fidelity Optimization
4. 21.4 Successive Halving and Hyperband
5. 21.5 Population-Based Training
6. 21.6 Distributed Trial Scheduling and Early Stopping
7. 21.7 Ray Tune and the AutoML Ecosystem
8. 21.8 Cost-Aware Distributed Experimentation

Part V · Distributed Inference and Serving

5 chapters · 44 sections

Per-node efficiency as a labeled prerequisite, multiplied across the fleet: inference systems, LLM serving, vector search, and MLOps.

22
Per-Node Inference Efficiency: A Prerequisite Quantization, paged and compressed KV cache, FlashAttention-3, chunked prefill, and speculative decoding.
1. 22.1 Why One Node's Efficiency Determines Fleet Cost
2. 22.2 Quantization
3. 22.3 Pruning and Sparsity
4. 22.4 Knowledge Distillation
5. 22.5 KV Cache and Paged Attention
6. 22.6 FlashAttention and Efficient Attention
7. 22.7 Continuous Batching and Speculative Decoding
8. 22.8 Compilation and Kernel Optimization
9. 22.9 From Per-Node Numbers to Fleet Sizing
23
Distributed Inference Systems Load balancing, batching, autoscaling replicas, and multi-tenant GPU packing.
24
Distributed LLM Serving vLLM V1, continuous batching, paged attention, and disaggregated prefill-decode serving.
1. 24.1 Why Large-Model Serving Spans Many Machines
2. 24.2 Tensor-Parallel Inference
3. 24.3 Pipeline-Parallel and Multi-Node Inference
4. 24.4 Distributed and Paged KV Cache
5. 24.5 Prefill/Decode Disaggregation
6. 24.6 Request Scheduling and Continuous Batching Across Nodes
7. 24.7 Prefix Caching and Multi-LoRA Fleets
8. 24.8 Serving Distributed MoE Models
9. 24.9 Inference Engines and Practice
25
Distributed Retrieval and Vector Search Approximate nearest neighbor, sharded indexes, hybrid search, and RaBitQ binary quantization.
1. 25.1 Retrieval-Augmented Generation as a Distributed System
2. 25.2 Distributed Embedding Pipelines
3. 25.3 Vector Databases
4. 25.4 Approximate Nearest Neighbor Search
5. 25.5 Index Sharding and Replication
6. 25.6 Distributed Hybrid Search
7. 25.7 Multi-Stage Retrieval and Distributed Reranking
8. 25.8 Distributed Caching for Retrieval
9. 25.9 Evaluating Distributed Retrieval
26
MLOps for Distributed AI Pipelines, registries, monitoring, statistical drift detection, and retraining triggers across a fleet.
1. 26.1 Operating AI Across a Fleet
2. 26.2 Distributed Data and Training Pipelines
3. 26.3 Model and Prompt Registries
4. 26.4 CI/CD for Distributed ML
5. 26.5 Distributed Experiment Tracking
6. 26.6 Fleet-Wide Monitoring and Observability
7. 26.7 Distributed Drift Detection
8. 26.8 A/B Testing and Shadow Deployment at Scale
9. 26.9 Rollbacks, Incident Response, and Guardrails

Part VI · Distributed AI and Multi-Agent Systems

6 chapters · 55 sections

Distributing the intelligence itself: distributed AI, game theory, multi-agent RL, swarm intelligence, and agent orchestration.

27
Distributed Artificial Intelligence Agents, coordination, and distributed problem solving.
1. 27.1 History of Distributed Artificial Intelligence
2. 27.2 Distributed Problem Solving
3. 27.3 Centralized, Decentralized, and Hybrid AI
4. 27.4 Blackboard Systems
5. 27.5 The Contract-Net Protocol
6. 27.6 Distributed Constraint Optimization
7. 27.7 Coordination and Cooperation
8. 27.8 Distributed Knowledge and Belief
9. 27.9 DAI in Modern AI Systems
28
Game-Theoretic Foundations for Multi-Agent AI Equilibria, auctions, and mechanism design for agents.
1. 28.1 Why Agents Need Game Theory
2. 28.2 Normal-Form and Extensive-Form Games
3. 28.3 Nash Equilibria and Solution Concepts
4. 28.4 Cooperative Games and Coalitions
5. 28.5 Social Welfare and Pareto Optimality
6. 28.6 Mechanism Design and Auctions
7. 28.7 Repeated Games and Learning Dynamics
29
Multi-Agent Systems Agent architectures, negotiation, and coordination.
1. 29.1 What Is an Agent?
2. 29.2 Agent Architectures
3. 29.3 Multi-Agent Environments
4. 29.4 Communication
5. 29.5 Coordination
6. 29.6 Negotiation
7. 29.7 Coalition Formation
8. 29.8 Task Allocation
9. 29.9 Consensus
10. 29.10 Trust and Reputation
30
Multi-Agent Reinforcement Learning Markov games, CTDE, value decomposition, and mean-field methods for large populations.
1. 30.1 From Reinforcement Learning to MARL
2. 30.2 Markov Games
3. 30.3 Cooperative, Competitive, and Mixed Settings
4. 30.4 Independent Learners
5. 30.5 Centralized Training with Decentralized Execution
6. 30.6 Value Decomposition
7. 30.7 Policy Gradient Methods in MARL
8. 30.8 Credit Assignment
9. 30.9 Non-Stationarity
10. 30.10 Distributed MARL Training
31
Swarm Intelligence and Collective Behavior Flocking, ant colonies, particle swarms, emergence, and LLM-driven agent swarms.
1. 31.1 Collective Intelligence
2. 31.2 Swarm Intelligence
3. 31.3 Ant Colony Optimization
4. 31.4 Particle Swarm Optimization
5. 31.5 Flocking and Distributed Consensus
6. 31.6 Collective Perception
7. 31.7 Emergent Communication
8. 31.8 Coordination Without Central Control
9. 31.9 Failure Modes in Collective Systems
32
Distributed Agent Orchestration Tool-using LLM agents, planners, and the MCP and A2A protocols.
1. 32.1 LLM Agents as Distributed Components
2. 32.2 Tool Use and Function Calling
3. 32.3 Planner-Executor and Role-Specialized Agents
4. 32.4 Parallel and Distributed Multi-Agent Workflows
5. 32.5 Debate, Critique, and Reflection Across Agents
6. 32.6 Agent Communication Protocols (MCP and A2A)
7. 32.7 Shared State and Distributed Memory
8. 32.8 Distributed Orchestration Engines
9. 32.9 Evaluating Distributed Agentic Systems
10. 32.10 Cost, Latency, and Reliability at Scale

Part VII · Cluster, Edge, and Reliable Infrastructure

3 chapters · 26 sections

The substrate everything runs on, and how it stays alive: cluster scheduling, edge and on-device AI, and reliable, secure distributed AI.

33
Cluster Infrastructure and Scheduling Accelerators, Kubernetes, DRA, Kueue, KAI Scheduler, and gang scheduling.
34
Edge, Fog, and On-Device Distributed AI Split inference, federated edge, real-time deadlines, and on-device generative LLMs.
1. 34.1 Edge AI as Distribution to the Periphery
2. 34.2 Fog Computing
3. 34.3 On-Device Inference
4. 34.4 Edge-Cloud Collaboration and Split Computing
5. 34.5 Distributed Sensing
6. 34.6 Federated Edge Learning
7. 34.7 Latency-Critical Distributed AI
8. 34.8 Robotics and Autonomous Systems
9. 34.9 Privacy-Preserving Edge AI
35
Reliable and Secure Distributed AI Fault tolerance, Byzantine robustness, differential privacy, GPU TEEs, and LLM security.

Part VIII · Case Studies and Capstone Projects

6 chapters · 57 sections

The whole book assembled into systems: web-scale RAG, federated medical AI, recommendation, robotics, agentic apps, and a capstone.

36
Web-Scale Text Processing and Distributed RAG Crawl, clean, embed, shard, retrieve with GraphRAG and agentic multi-hop, and generate at web scale.
1. 36.1 Problem Definition
2. 36.2 Distributed Crawling
3. 36.3 Distributed Cleaning and Deduplication
4. 36.4 Distributed Indexing
5. 36.5 Distributed Embedding Generation
6. 36.6 Sharded Retrieval and Ranking
7. 36.7 RAG Integration Across a Fleet
8. 36.8 Evaluation
9. 36.9 Project Extension
37
Federated Medical AI Training a clinical model across hospitals without moving data.
1. 37.1 Problem Definition
2. 37.2 Multi-Hospital Data
3. 37.3 Privacy Constraints
4. 37.4 Federated Learning Setup
5. 37.5 Data Heterogeneity
6. 37.6 Secure Aggregation
7. 37.7 Monitoring and Drift Across Sites
8. 37.8 Safety and Responsibility
9. 37.9 Project Extension
38
Distributed Recommendation at Scale Sharded embeddings, the retrieve-then-rank funnel, and generative recommenders with semantic IDs.
1. 38.1 Problem Definition
2. 38.2 Distributed User and Item Embeddings
3. 38.3 Sharded Candidate Generation
4. 38.4 Distributed Ranking Models
5. 38.5 Feature Stores
6. 38.6 Real-Time Personalization
7. 38.7 Online Evaluation
8. 38.8 System Architecture
9. 38.9 Project Extension
39
Multi-Agent Robotics and Drone Swarms Decentralized coordination, multi-agent RL, VLA foundation-model policies, and sim-to-real.
1. 39.1 Problem Definition
2. 39.2 Multi-Robot Coordination
3. 39.3 Distributed Task Allocation
4. 39.4 Communication Constraints
5. 39.5 Shared Situational Awareness
6. 39.6 Decentralized Control
7. 39.7 Multi-Agent Reinforcement Learning
8. 39.8 Simulation-to-Real Transfer
9. 39.9 Safety and Failure Modes
10. 39.10 Project Extension
40
Distributed LLM and Agentic Applications Document pipelines, RAG, disaggregated vLLM serving, agent orchestration, and the A2A protocol.
1. 40.1 Problem Definition
2. 40.2 Distributed Document Processing
3. 40.3 Embedding Pipelines
4. 40.4 Sharded Vector Search
5. 40.5 RAG at Scale
6. 40.6 Distributed Agent Orchestration
7. 40.7 Distributed Model Serving with vLLM
8. 40.8 Cost Control Across the Fleet
9. 40.9 Evaluation
10. 40.10 Project Extension
41
Capstone Project Design Choose, baseline, design, measure, and present a scale-out system.
1. 41.1 Choosing a Distributed AI Problem
2. 41.2 Defining the Distribution Axis
3. 41.3 Building a Single-Machine Baseline
4. 41.4 Designing the Distributed Version
5. 41.5 Selecting Tools and Infrastructure
6. 41.6 Evaluation Metrics: Speedup, Efficiency, and Cost
7. 41.7 Cost and Performance Analysis
8. 41.8 Reproducibility Package
9. 41.9 Final Report
10. 41.10 Final Presentation

Back Matter · Appendices

4 appendices

A self-contained math refresher, the companion cluster lab, the notation and glossary, and a catalogue of datasets and benchmarks.