First Edition · 2026
Book cover: a glowing neural core radiating circuits into a constellation of servers, devices, and agents, with the title Building Scalable AI, From Big Data Algorithms to Distributed Intelligence

Building Scalable AI From Big Data Algorithms to Distributed Intelligence

A practitioner's guide to distributing data, training, models, inference, coordination, and decision-making across many machines.

Alexander (Sasha) Apartsin, Ph.D. & Yehudit Aperstein, Ph.D.

Modern AI is distributed AI. A single machine can no longer hold the data, the model, the inference traffic, or the fleet of agents that today's systems demand. This book is one connected journey from big-data algorithms to distributed intelligence, organized around six axes of distribution. It leads with scale-out, treats single-node efficiency as a clearly labeled per-node prerequisite, and builds every primitive, from the all-reduce collective to consensus, elastic recovery, and agent orchestration, through the AI operation that uses it, closing with end-to-end case studies and a capstone you design yourself.

8 parts 41 chapters 366 sections 4 appendices & a capstone

The Eight-Part Arc

Each part stands on the one before it; together they carry one system from a single split gradient to a swarm acting as one.

I

Foundations of Distributed AI

The vocabulary every later part reuses: what scale-out is, distributed-systems concepts, scalability and performance models, the communication primitives, and how to evaluate distributed AI with rigor.

5 chapters · 43 sections
II

Distributed Data Processing for AI

The data layer that feeds everything: MapReduce and distributed algorithms, Spark and DataFrames, distributed storage and data loading, and stream processing for online AI.

4 chapters · 36 sections
III

Distributed Machine Learning

Training distributed by hand: distributed optimization, parameter servers and terabyte embeddings, classical and graph ML at scale, and federated and decentralized learning.

5 chapters · 43 sections
IV

Parallel Deep Learning and Large Models

The heart of large-model training: data, model, pipeline, sharded, and expert parallelism; elastic and fault-tolerant training; foundation models; distributed RL; and distributed HPO.

7 chapters · 62 sections
V

Distributed Inference and Serving

Per-node efficiency as a labeled prerequisite, then multiplied across the fleet: distributed inference systems, LLM serving with vLLM, distributed retrieval and vector search, and MLOps.

5 chapters · 44 sections
VI

Distributed AI and Multi-Agent Systems

Distributing the intelligence itself: distributed AI foundations, game theory, multi-agent reinforcement learning, swarm intelligence, and LLM agent orchestration.

6 chapters · 55 sections
VII

Cluster, Edge, and Reliable Infrastructure

The substrate everything runs on, and how it stays alive: cluster infrastructure and scheduling, edge, fog, and on-device AI, and reliable, secure, privacy-preserving distributed AI.

3 chapters · 26 sections
VIII

Case Studies and Capstone Projects

The whole book assembled into systems: web-scale RAG, federated medical AI, distributed recommendation, multi-agent robotics, agentic LLM applications, and a capstone you design.

6 chapters · 57 sections

How This Book Teaches

Five habits, kept in every chapter from the first split gradient to the last agent.

Runnable Demos

Every concept built from first principles is paired with a small program that runs and prints real numbers, never an isolated snippet, so distribution is something you measure rather than assume.

Library Shortcuts

After each from-scratch build, a shortcut callout shows the same task in a few lines of PySpark, PyTorch DDP and FSDP, DeepSpeed, Ray, or vLLM, and names exactly what the framework handles for you.

A Callout System

Big-picture framings, key insights, research frontiers, practical examples, and cross-references are typeset as distinct boxes, so you can read deep or skim fast and never miss a trap.

Exercises & Projects

Each chapter closes with typed exercises and buildable project ideas that extend its worked systems, scaling from quick checks to capstone-sized distributed builds.

Primitives Return Scaled Out

The MapReduce shuffle becomes all-reduce, parameter-server sharding becomes ZeRO and FSDP, data parallelism becomes expert parallelism, and per-node KV-cache economics return multiplied across the serving fleet. One story, told at every scale.