Table of Contents

A practitioner's guide to distributing data, training, models, inference, coordination, and decision-making across many machines.

First Edition · 2026

8 parts · 41 chapters · 366 sections, plus front matter and 4 appendices. Every chapter and section linked below is complete and live; the directory path is shown under each chapter.

Front Matter · How to Read This Book

4 entries
  1. F1
    PrefaceWhy scale out, how the eight parts fit together, the notation, and the suggested course paths.
  2. F2
    Mental Models: A Visual GlossaryThe book's recurring concepts in pictures: ring all-reduce, sharding, the retrieve-rank funnel, gang scheduling, and more.
  3. F3
    About the AuthorsWho wrote this book and how.
  4. F4
    Copyright & LegalEdition, license, and attribution.

Part I · Foundations of Distributed AI

5 chapters · 43 sections

The signal vocabulary every later part reuses: scale-out, distributed-systems concepts, performance models, communication primitives, and evaluation.

  1. 1
    What Is Scale-Out AI? The thesis, the six axes, and scale-out versus scale-up.
    1. 1.1 From Artificial Intelligence to Distributed AI
    2. 1.2 The Six Axes of Distribution
    3. 1.3 Scale-Out vs Scale-Up
    4. 1.4 Centralized, Decentralized, and Hybrid Architectures
    5. 1.5 Batch, Streaming, Online, and Interactive AI
    6. 1.6 Throughput, Latency, Cost, and Reliability
    7. 1.7 Examples of Distributed AI Systems
    8. 1.8 The Distributed AI Design Space
  2. 2
    Distributed Systems Concepts for AI CAP, consistency, consensus, and faults, in AI terms.
    1. 2.1 Processes, Nodes, Workers, and Coordinators
    2. 2.2 Communication, Synchronization, and Coordination
    3. 2.3 Partitioning, Sharding, and Replication
    4. 2.4 Fault Tolerance and Recovery
    5. 2.5 Consistency Models: From Parameter Staleness to the CAP Trade-off
    6. 2.6 Coordination and Consensus in the Control Plane
    7. 2.7 Stragglers and Bottlenecks
    8. 2.8 Data Locality and Compute Locality
    9. 2.9 Distributed System Patterns for AI
  3. 3
    Scalability and Performance Models Amdahl, Gustafson, speedup, efficiency, and the roofline.
    1. 3.1 What Does It Mean to Scale?
    2. 3.2 Horizontal and Vertical Scaling
    3. 3.3 Strong and Weak Scaling
    4. 3.4 Latency, Throughput, and Tail Latency
    5. 3.5 Amdahl's Law and Gustafson's Law
    6. 3.6 Work, Depth, and Parallelism
    7. 3.7 The Roofline Model
    8. 3.8 Communication Cost Models
    9. 3.9 Scaling Efficiency and Cost-Awareness
  4. 4
    Communication Primitives for Distributed Training All-reduce, all-gather, and the collectives that move gradients.
    1. 4.1 Why Communication, Not Compute, Bounds Distributed Training
    2. 4.2 The Communication Substrate
    3. 4.3 All-Reduce: Synchronizing Gradients in Data-Parallel SGD
    4. 4.4 All-Reduce Algorithms, and Why Ring All-Reduce Mattered for Deep Learning
    5. 4.5 All-Gather and Reduce-Scatter: The Primitives Behind ZeRO and FSDP
    6. 4.6 All-to-All: Routing Tokens in Mixture-of-Experts
    7. 4.7 Broadcast and Gather: Weight and Experience Movement
    8. 4.8 Communication Libraries: NCCL, MPI, and Gloo
    9. 4.9 Topology-Aware Placement
    10. 4.10 Overlapping Communication with the Backward Pass, and Gradient Bucketing
  5. 5
    Evaluating Distributed AI Systems Speedup, efficiency, and the discipline of holding quality constant.
    1. 5.1 Why Distribution Needs Its Own Evaluation Discipline
    2. 5.2 Speedup, Efficiency, and Scalability Curves
    3. 5.3 Throughput, Goodput, and Tail Latency / SLOs
    4. 5.4 Communication-to-Computation Ratio
    5. 5.5 Cost, Utilization, and Energy Accounting
    6. 5.6 Benchmarking Methodology and Pitfalls
    7. 5.7 Reproducible Measurement on Clusters

Part II · Distributed Data Processing for AI

4 chapters · 36 sections

The data layer that feeds everything: MapReduce, Spark, distributed storage and loading, and stream processing for online AI.

  1. 6
    The MapReduce Model and Distributed Algorithms Map, shuffle, reduce, and the algorithms they express.
    1. 6.1 Motivation for MapReduce
    2. 6.2 The Map, Shuffle, and Reduce Pattern
    3. 6.3 Key-Value Computation: Word Count and Inverted Indexing
    4. 6.4 Aggregation, Filtering, and Secondary Sorting
    5. 6.5 Distributed Sorting and Joins
    6. 6.6 Top-K, Matrix Multiplication, and PageRank
    7. 6.7 MinHash and Locality-Sensitive Hashing
    8. 6.8 Approximate Algorithms at Scale
    9. 6.9 Fault Tolerance, Limits, and Why the Model Still Matters
  2. 7
    Spark and Distributed DataFrames RDDs, lazy evaluation, and the shuffle at scale.
    1. 7.1 From MapReduce to Spark
    2. 7.2 Resilient Distributed Datasets
    3. 7.3 DataFrames and Spark SQL
    4. 7.4 Lazy Evaluation and DAG Execution
    5. 7.5 Transformations and Actions
    6. 7.6 Partitioning and Caching
    7. 7.7 Joins, Shuffles, and Data Skew
    8. 7.8 PySpark for AI Workloads
    9. 7.9 Spark Performance Tuning
  3. 8
    Distributed Storage and Data Loading Sharding, replication, and feeding the training loop.
    1. 8.1 Why the Storage Layer Determines Scale
    2. 8.2 Object Storage and Distributed Filesystems
    3. 8.3 Columnar Formats and the Lakehouse
    4. 8.4 Data Layout, Partitioning, and Compaction
    5. 8.5 Sharded Training Data and the DataLoader Bottleneck
    6. 8.6 Streaming and WebDataset-Style Pipelines
    7. 8.7 Distributed Preprocessing
    8. 8.8 Data Leakage and Correctness in Distributed Pipelines
    9. 8.9 Data Versioning and Lineage
  4. 9
    Stream Processing and Online AI Event time, windows, watermarks, and online learning.
    1. 9.1 Batch vs Stream Processing
    2. 9.2 Events, Streams, and Windows
    3. 9.3 Event Time and Processing Time
    4. 9.4 Watermarks and Late Events
    5. 9.5 Kafka-Style Distributed Logs
    6. 9.6 Spark Structured Streaming and Flink
    7. 9.7 Online Feature Computation
    8. 9.8 Distributed Real-Time Inference Pipelines
    9. 9.9 Concept Drift and Distributed Monitoring

Part III · Distributed Machine Learning

5 chapters · 43 sections

Training distributed by hand: optimization, parameter servers and embeddings, classical and graph ML, and federated learning.

  1. 10
    Distributed Optimization Synchronous and asynchronous SGD, and gradient compression.
    1. 10.1 Empirical Risk Minimization at Scale
    2. 10.2 Mini-Batch Stochastic Gradient Descent
    3. 10.3 Synchronous Distributed SGD
    4. 10.4 Asynchronous Distributed SGD
    5. 10.5 Gradient Aggregation and All-Reduce SGD
    6. 10.6 Stale and Delayed Gradients
    7. 10.7 Communication-Efficient Optimization
    8. 10.8 Large-Batch Training and Learning-Rate Scaling
    9. 10.9 Communication Complexity and Lower Bounds
    10. 10.10 Convergence and Practical Trade-Offs
  2. 11
    Parameter Servers and Distributed Embeddings Push-pull training and terabyte embedding tables.
    1. 11.1 Motivation for Parameter Servers
    2. 11.2 Push-Pull Architecture
    3. 11.3 Centralized and Sharded Parameter Servers
    4. 11.4 Synchronous and Asynchronous Updates
    5. 11.5 Bounded Staleness
    6. 11.6 Sparse Models and Distributed Embedding Tables
    7. 11.7 Terabyte-Scale Embeddings
    8. 11.8 Fault Tolerance in Parameter Servers
    9. 11.9 Parameter Servers vs All-Reduce in Modern Systems
  3. 12
    Distributed Classical Machine Learning Trees, linear models, and clustering across a cluster.
    1. 12.1 Distributed Linear and Logistic Regression
    2. 12.2 Distributed Support Vector Machines
    3. 12.3 Distributed Decision Trees
    4. 12.4 Random Forests at Scale
    5. 12.5 Distributed Gradient Boosting
    6. 12.6 Distributed Clustering
    7. 12.7 Distributed Approximate Nearest Neighbors
  4. 13
    Distributed Graph Machine Learning Graph partitioning, message passing, and distributed GNNs.
    1. 13.1 Why Graphs Are Hard to Distribute
    2. 13.2 Graph Partitioning
    3. 13.3 The Pregel / Vertex-Centric Model
    4. 13.4 Distributed Graph Analytics
    5. 13.5 Distributed Graph Neural Networks
    6. 13.6 Distributed Neighbor Sampling
    7. 13.7 Mini-Batch vs Full-Graph Distributed Training
    8. 13.8 Frameworks and Systems for Distributed Graph ML
  5. 14
    Federated and Decentralized Learning FedAvg, non-IID data, and secure aggregation.
    1. 14.1 Motivation for Federated Learning
    2. 14.2 Cross-Device and Cross-Silo Learning
    3. 14.3 FedAvg and Its Variants
    4. 14.4 Non-IID Data
    5. 14.5 Communication Constraints
    6. 14.6 Privacy and Secure Aggregation
    7. 14.7 Personalized Federated Learning
    8. 14.8 Decentralized Learning
    9. 14.9 Edge and On-Device Learning

Part IV · Parallel Deep Learning and Large Models

7 chapters · 62 sections

Data, model, pipeline, sharded, and expert parallelism; elastic training; foundation models; distributed RL; and distributed HPO.

  1. 15
    Data-Parallel Deep Learning Replicas, all-reduce SGD, and large-batch training.
    1. 15.1 Why Deep Learning Needs Distributed Training
    2. 15.2 Single-GPU, Multi-GPU, and Multi-Node Training
    3. 15.3 Data Parallelism
    4. 15.4 Gradient Synchronization and All-Reduce
    5. 15.5 Gradient Bucketing and Communication/Computation Overlap
    6. 15.6 PyTorch Distributed Data Parallel
    7. 15.7 Horovod and the Broader Ecosystem
    8. 15.8 Mixed Precision as a Per-Node Enabler
    9. 15.9 Practical Bottlenecks and Scaling Efficiency
  2. 16
    Model, Pipeline, and Sharded Parallelism Tensor and pipeline parallelism, ZeRO, and FSDP.
    1. 16.1 When the Model No Longer Fits on One Device
    2. 16.2 Tensor Parallelism
    3. 16.3 Pipeline Parallelism
    4. 16.4 Sharded Data Parallelism: ZeRO Stages 1-3
    5. 16.5 PyTorch FSDP
    6. 16.6 DeepSpeed and Megatron-LM
    7. 16.7 Sequence and Context Parallelism
    8. 16.8 Activation Checkpointing as a Per-Node Enabler
    9. 16.9 3D and 4D Parallelism
    10. 16.10 Choosing and Tuning a Parallelism Strategy
  3. 17
    Expert Parallelism and Sparse Distributed Models Mixture-of-experts routing and all-to-all communication.
    1. 17.1 Dense vs Sparse Scaling
    2. 17.2 The Mixture-of-Experts Layer
    3. 17.3 Routing and Gating
    4. 17.4 Expert Parallelism: Sharding Experts Across Nodes
    5. 17.5 All-to-All Communication for Token Routing
    6. 17.6 Load Balancing Across Experts
    7. 17.7 Capacity Factors, Token Dropping, and Stability
    8. 17.8 Serving Distributed MoE Models
    9. 17.9 Trade-Offs vs Dense Distributed Models
  4. 18
    Elastic and Fault-Tolerant Distributed Training Checkpointing, elasticity, and surviving node failure.
    1. 18.1 Failure Is the Norm at Thousand-GPU Scale
    2. 18.2 Distributed Checkpointing
    3. 18.3 Restart, Replay, and Determinism
    4. 18.4 Elastic Training
    5. 18.5 Straggler Detection and Mitigation
    6. 18.6 Preemption and Spot-Instance Training
    7. 18.7 Memory Offload Across the Hierarchy
    8. 18.8 Monitoring and Debugging Distributed Training
  5. 19
    Training Foundation Models at Scale 3D parallelism, scaling laws, and training stability.
    1. 19.1 Foundation Models as Distributed Systems
    2. 19.2 Scaling Laws
    3. 19.3 Distributed Dataset Construction
    4. 19.4 Distributed Deduplication and Data Quality
    5. 19.5 Tokenization at Scale
    6. 19.6 Orchestrating Distributed Pretraining
    7. 19.7 Distributed Fine-Tuning
    8. 19.8 Distributed Alignment: A Systems View
    9. 19.9 Energy, Cost, and Responsible Scaling
  6. 20
    Distributed Reinforcement Learning Infrastructure Actors, learners, and distributed replay at scale.
    1. 20.1 Why RL Is a Distributed-Systems Problem
    2. 20.2 The Actor-Learner Architecture
    3. 20.3 Distributed Experience Collection
    4. 20.4 Distributed Replay Buffers
    5. 20.5 Off-Policy Correction at Scale
    6. 20.6 Ape-X, R2D2, and SEED RL Designs
    7. 20.7 Synchronous vs Asynchronous RL Systems
    8. 20.8 Scaling Bottlenecks: Sampling vs Learning Throughput
    9. 20.9 Frameworks and Practice
  7. 21
    Distributed Hyperparameter Search and AutoML Parallel search, Hyperband, and population-based training.
    1. 21.1 Why Search Is Embarrassingly Parallel, and Why That Is Not Enough
    2. 21.2 Grid, Random, and Bayesian Optimization
    3. 21.3 Multi-Fidelity Optimization
    4. 21.4 Successive Halving and Hyperband
    5. 21.5 Population-Based Training
    6. 21.6 Distributed Trial Scheduling and Early Stopping
    7. 21.7 Ray Tune and the AutoML Ecosystem
    8. 21.8 Cost-Aware Distributed Experimentation

Part V · Distributed Inference and Serving

5 chapters · 44 sections

Per-node efficiency as a labeled prerequisite, multiplied across the fleet: inference systems, LLM serving, vector search, and MLOps.

  1. 22
    Per-Node Inference Efficiency: A Prerequisite Quantization, the paged KV cache, and the per-node prerequisite.
    1. 22.1 Why One Node's Efficiency Determines Fleet Cost
    2. 22.2 Quantization
    3. 22.3 Pruning and Sparsity
    4. 22.4 Knowledge Distillation
    5. 22.5 KV Cache and Paged Attention
    6. 22.6 FlashAttention and Efficient Attention
    7. 22.7 Continuous Batching and Speculative Decoding
    8. 22.8 Compilation and Kernel Optimization
    9. 22.9 From Per-Node Numbers to Fleet Sizing
  2. 23
    Distributed Inference Systems Load balancing, batching, and autoscaling replicas.
    1. 23.1 Why Model Serving Differs from Web Serving
    2. 23.2 Replicas, Load Balancing, and Batch-Aware Routing
    3. 23.3 Online vs Batch Inference Across a Fleet
    4. 23.4 Autoscaling on GPU Utilization and Queue Depth
    5. 23.5 Multi-Model and Multi-Tenant GPU Serving
    6. 23.6 Large-Model Loading, Cold Starts, and Warm Pools
    7. 23.7 Availability, Failover, and Redundancy
    8. 23.8 Serving Frameworks and Practice
  3. 24
    Distributed LLM Serving vLLM, continuous batching, and paged attention.
    1. 24.1 Why Large-Model Serving Spans Many Machines
    2. 24.2 Tensor-Parallel Inference
    3. 24.3 Pipeline-Parallel and Multi-Node Inference
    4. 24.4 Distributed and Paged KV Cache
    5. 24.5 Prefill/Decode Disaggregation
    6. 24.6 Request Scheduling and Continuous Batching Across Nodes
    7. 24.7 Prefix Caching and Multi-LoRA Fleets
    8. 24.8 Serving Distributed MoE Models
    9. 24.9 Inference Engines and Practice
  4. 25
    Distributed Retrieval and Vector Search Approximate nearest neighbor, sharded indexes, and scatter-gather.
    1. 25.1 Retrieval-Augmented Generation as a Distributed System
    2. 25.2 Distributed Embedding Pipelines
    3. 25.3 Vector Databases
    4. 25.4 Approximate Nearest Neighbor Search
    5. 25.5 Index Sharding and Replication
    6. 25.6 Distributed Hybrid Search
    7. 25.7 Multi-Stage Retrieval and Distributed Reranking
    8. 25.8 Distributed Caching for Retrieval
    9. 25.9 Evaluating Distributed Retrieval
  5. 26
    MLOps for Distributed AI Pipelines, registries, monitoring, and drift across a fleet.
    1. 26.1 Operating AI Across a Fleet
    2. 26.2 Distributed Data and Training Pipelines
    3. 26.3 Model and Prompt Registries
    4. 26.4 CI/CD for Distributed ML
    5. 26.5 Distributed Experiment Tracking
    6. 26.6 Fleet-Wide Monitoring and Observability
    7. 26.7 Distributed Drift Detection
    8. 26.8 A/B Testing and Shadow Deployment at Scale
    9. 26.9 Rollbacks, Incident Response, and Guardrails

Part VI · Distributed AI and Multi-Agent Systems

6 chapters · 55 sections

Distributing the intelligence itself: distributed AI, game theory, multi-agent RL, swarm intelligence, and agent orchestration.

  1. 27
    Distributed Artificial Intelligence Agents, coordination, and distributed problem solving.
    1. 27.1 History of Distributed Artificial Intelligence
    2. 27.2 Distributed Problem Solving
    3. 27.3 Centralized, Decentralized, and Hybrid AI
    4. 27.4 Blackboard Systems
    5. 27.5 The Contract-Net Protocol
    6. 27.6 Distributed Constraint Optimization
    7. 27.7 Coordination and Cooperation
    8. 27.8 Distributed Knowledge and Belief
    9. 27.9 DAI in Modern AI Systems
  2. 28
    Game-Theoretic Foundations for Multi-Agent AI Equilibria, auctions, and mechanism design for agents.
    1. 28.1 Why Agents Need Game Theory
    2. 28.2 Normal-Form and Extensive-Form Games
    3. 28.3 Nash Equilibria and Solution Concepts
    4. 28.4 Cooperative Games and Coalitions
    5. 28.5 Social Welfare and Pareto Optimality
    6. 28.6 Mechanism Design and Auctions
    7. 28.7 Repeated Games and Learning Dynamics
  3. 29
    Multi-Agent Systems Agent architectures, negotiation, and coordination.
    1. 29.1 What Is an Agent?
    2. 29.2 Agent Architectures
    3. 29.3 Multi-Agent Environments
    4. 29.4 Communication
    5. 29.5 Coordination
    6. 29.6 Negotiation
    7. 29.7 Coalition Formation
    8. 29.8 Task Allocation
    9. 29.9 Consensus
    10. 29.10 Trust and Reputation
  4. 30
    Multi-Agent Reinforcement Learning Markov games, CTDE, and value decomposition.
    1. 30.1 From Reinforcement Learning to MARL
    2. 30.2 Markov Games
    3. 30.3 Cooperative, Competitive, and Mixed Settings
    4. 30.4 Independent Learners
    5. 30.5 Centralized Training with Decentralized Execution
    6. 30.6 Value Decomposition
    7. 30.7 Policy Gradient Methods in MARL
    8. 30.8 Credit Assignment
    9. 30.9 Non-Stationarity
    10. 30.10 Distributed MARL Training
  5. 31
    Swarm Intelligence and Collective Behavior Flocking, ant colonies, particle swarms, and emergence.
    1. 31.1 Collective Intelligence
    2. 31.2 Swarm Intelligence
    3. 31.3 Ant Colony Optimization
    4. 31.4 Particle Swarm Optimization
    5. 31.5 Flocking and Distributed Consensus
    6. 31.6 Collective Perception
    7. 31.7 Emergent Communication
    8. 31.8 Coordination Without Central Control
    9. 31.9 Failure Modes in Collective Systems
  6. 32
    Distributed Agent Orchestration Tool-using LLM agents, planners, and the MCP and A2A protocols.
    1. 32.1 LLM Agents as Distributed Components
    2. 32.2 Tool Use and Function Calling
    3. 32.3 Planner-Executor and Role-Specialized Agents
    4. 32.4 Parallel and Distributed Multi-Agent Workflows
    5. 32.5 Debate, Critique, and Reflection Across Agents
    6. 32.6 Agent Communication Protocols (MCP and A2A)
    7. 32.7 Shared State and Distributed Memory
    8. 32.8 Distributed Orchestration Engines
    9. 32.9 Evaluating Distributed Agentic Systems
    10. 32.10 Cost, Latency, and Reliability at Scale

Part VII · Cluster, Edge, and Reliable Infrastructure

3 chapters · 26 sections

The substrate everything runs on, and how it stays alive: cluster scheduling, edge and on-device AI, and reliable, secure distributed AI.

  1. 33
    Cluster Infrastructure and Scheduling Accelerators, Kubernetes, and gang scheduling.
    1. 33.1 Anatomy of an AI Cluster
    2. 33.2 Compute: CPUs, GPUs, TPUs, and Accelerator Instances
    3. 33.3 Containers and Kubernetes for AI
    4. 33.4 Batch Schedulers: Slurm, Kubernetes Batch, and Volcano
    5. 33.5 Gang Scheduling and Collective-Aware Placement
    6. 33.6 Multi-Tenant GPU Sharing: MIG, MPS, and Time-Slicing
    7. 33.7 Ray Clusters and the Object Store
    8. 33.8 Spot and Preemptible Scheduling for Cost Optimization
    9. 33.9 Managed Platforms: Databricks, SageMaker, and Vertex AI
  2. 34
    Edge, Fog, and On-Device Distributed AI Split inference, federated edge, and real-time deadlines.
    1. 34.1 Edge AI as Distribution to the Periphery
    2. 34.2 Fog Computing
    3. 34.3 On-Device Inference
    4. 34.4 Edge-Cloud Collaboration and Split Computing
    5. 34.5 Distributed Sensing
    6. 34.6 Federated Edge Learning
    7. 34.7 Latency-Critical Distributed AI
    8. 34.8 Robotics and Autonomous Systems
    9. 34.9 Privacy-Preserving Edge AI
  3. 35
    Reliable and Secure Distributed AI Fault tolerance, Byzantine robustness, and differential privacy.
    1. 35.1 Reliability in Distributed AI
    2. 35.2 Fault Tolerance and Recovery
    3. 35.3 Security in Distributed AI
    4. 35.4 Data and Model Poisoning in Distributed and Federated Settings
    5. 35.5 Byzantine-Robust Aggregation
    6. 35.6 Privacy and Differential Privacy in Distributed Learning
    7. 35.7 Auditability and Governance Across a Fleet
    8. 35.8 Bias and Environmental Cost at Scale

Part VIII · Case Studies and Capstone Projects

6 chapters · 57 sections

The whole book assembled into systems: web-scale RAG, federated medical AI, recommendation, robotics, agentic apps, and a capstone.

  1. 36
    Web-Scale Text Processing and Distributed RAG Crawl, clean, embed, shard, retrieve, and generate at web scale.
    1. 36.1 Problem Definition
    2. 36.2 Distributed Crawling
    3. 36.3 Distributed Cleaning and Deduplication
    4. 36.4 Distributed Indexing
    5. 36.5 Distributed Embedding Generation
    6. 36.6 Sharded Retrieval and Ranking
    7. 36.7 RAG Integration Across a Fleet
    8. 36.8 Evaluation
    9. 36.9 Project Extension
  2. 37
    Federated Medical AI Training a clinical model across hospitals without moving data.
    1. 37.1 Problem Definition
    2. 37.2 Multi-Hospital Data
    3. 37.3 Privacy Constraints
    4. 37.4 Federated Learning Setup
    5. 37.5 Data Heterogeneity
    6. 37.6 Secure Aggregation
    7. 37.7 Monitoring and Drift Across Sites
    8. 37.8 Safety and Responsibility
    9. 37.9 Project Extension
  3. 38
    Distributed Recommendation at Scale Sharded embeddings and the retrieve-then-rank funnel.
    1. 38.1 Problem Definition
    2. 38.2 Distributed User and Item Embeddings
    3. 38.3 Sharded Candidate Generation
    4. 38.4 Distributed Ranking Models
    5. 38.5 Feature Stores
    6. 38.6 Real-Time Personalization
    7. 38.7 Online Evaluation
    8. 38.8 System Architecture
    9. 38.9 Project Extension
  4. 39
    Multi-Agent Robotics and Drone Swarms Decentralized coordination, multi-agent RL, and sim-to-real.
    1. 39.1 Problem Definition
    2. 39.2 Multi-Robot Coordination
    3. 39.3 Distributed Task Allocation
    4. 39.4 Communication Constraints
    5. 39.5 Shared Situational Awareness
    6. 39.6 Decentralized Control
    7. 39.7 Multi-Agent Reinforcement Learning
    8. 39.8 Simulation-to-Real Transfer
    9. 39.9 Safety and Failure Modes
    10. 39.10 Project Extension
  5. 40
    Distributed LLM and Agentic Applications Document pipelines, RAG, a vLLM fleet, and agent orchestration.
    1. 40.1 Problem Definition
    2. 40.2 Distributed Document Processing
    3. 40.3 Embedding Pipelines
    4. 40.4 Sharded Vector Search
    5. 40.5 RAG at Scale
    6. 40.6 Distributed Agent Orchestration
    7. 40.7 Distributed Model Serving with vLLM
    8. 40.8 Cost Control Across the Fleet
    9. 40.9 Evaluation
    10. 40.10 Project Extension
  6. 41
    Capstone Project Design Choose, baseline, design, measure, and present a scale-out system.
    1. 41.1 Choosing a Distributed AI Problem
    2. 41.2 Defining the Distribution Axis
    3. 41.3 Building a Single-Machine Baseline
    4. 41.4 Designing the Distributed Version
    5. 41.5 Selecting Tools and Infrastructure
    6. 41.6 Evaluation Metrics: Speedup, Efficiency, and Cost
    7. 41.7 Cost and Performance Analysis
    8. 41.8 Reproducibility Package
    9. 41.9 Final Report
    10. 41.10 Final Presentation

Back Matter · Appendices

4 appendices

A self-contained math refresher, the companion cluster lab, the notation and glossary, and a catalogue of datasets and benchmarks.