Appendix D
Datasets, Benchmarks, and Resources

Datasets, Benchmarks, and Resources

"The chapters taught me how to split the work. Then I went looking for work big enough to be worth splitting, and discovered that the dataset, not the method, decides whether a project ever feels distributed at all."

A Worker Searching for a Shard Worth Owning
Big Picture

A distributed-AI project is only as convincing as the resource it runs on: a method that scales out is proven by a dataset, a benchmark, and a framework that actually exercise the axis you claim to have distributed. The forty-one chapters teach how to split data, training, models, inference, coordination, and intelligence across machines. A capstone or a course project turns that knowledge into a measured result, and a measured result needs three things this appendix collects in one place: real data large or heterogeneous enough that one machine genuinely struggles, a benchmark that fixes a comparable number so your claim can be checked, and a framework that handles the mechanics the chapters explained from first principles. The entries below are real, canonical projects, each annotated with its scale, the distribution axis it stresses, and the chapter that uses it, so you can pick the resource that makes the bottleneck you chose to attack actually bind. We name the canonical projects rather than pin specific download links; access points move, but the names and the axes they exercise are stable.

This appendix is organized in four sections that mirror the life of a project. Section D.1 lists datasets, grouped by the axis each one stresses, because the first decision is what to point the cluster at. Section D.2 lists benchmarks, the agreed measuring sticks that turn a result into a comparable claim. Section D.3 lists frameworks and tools, grouped by the stage of the pipeline they own, so the from-scratch methods of the chapters have a production counterpart. Section D.4 lists venues, surveys, and communities, and points back at the book's own chapters as the entry points that send you here. Everything is cross-linked to the chapter that develops it, so a resource and the method that consumes it are never more than one click apart.

Key Insight: Choose the Dataset That Makes Your Axis Bind

The most common failure of a distribution project is a dataset that fits comfortably on one machine, so the distributed version is slower than the baseline and the speedup curve points the wrong way (Section 1.1 showed why an unforced split only buys communication cost). Before picking a dataset, name the ceiling you intend to hit: data volume that overflows one disk (Common Crawl, Criteo 1TB), model size that overflows one accelerator (a foundation-model corpus like The Pile or FineWeb feeding a billion-parameter model), request throughput that overflows one server (a recommendation or retrieval workload), or heterogeneity and privacy that forbid centralization (MIMIC across hospitals, federated EHR). Then choose the dataset whose scale forces exactly that ceiling. A dataset that exercises the wrong axis produces a clean experiment that proves nothing about the claim you set out to make.

D.1 Datasets for Distributed AI Projects Beginner

The datasets below are grouped by the domain and the distribution axis they most naturally stress. Text corpora at web scale force the data axis (Part II) and feed the training axis (Parts III and IV); recommendation and graph datasets force the embedding and parameter-server axis (Chapter 11, Chapter 13); medical and federated datasets force the decentralization axis (Chapter 14); and RL environments force the actor-learner and multi-agent axis (Chapter 20, Chapter 30). Each entry names its scale and the axis it exercises so the choice is deliberate.

Web-Scale Text (Data and Training Axes)

Common Crawl. commoncrawl.org

A petabyte-scale, monthly-refreshed crawl of the open web in WARC format; the raw input behind most large text corpora. Stresses the data axis hard: deduplication, filtering, and tokenization across this volume is a genuine distributed-data job, exactly the pipeline of Chapter 6 and the case study in Chapter 36.

🗄 Dataset

C4 (Colossal Clean Crawled Corpus). Raffel et al., 2020. tensorflow.org/datasets

A roughly 750 GB cleaned slice of Common Crawl built for the T5 work; a manageable web-text corpus for a data-pipeline or data-parallel training project that still does not fit in one machine's memory. Pairs with the data-loading work of Chapter 8.

🗄 Dataset

The Pile. Gao et al., 2020. arXiv:2101.00027

An 825 GB diverse English corpus assembled from 22 sources (academic, code, web, dialogue); a standard pretraining mixture for language models below the very largest scale. The training corpus that makes the data-parallel loop of Chapter 15 worth distributing.

🗄 Dataset

FineWeb. Penedo et al., 2024. huggingface.co

A 15-trillion-token, carefully filtered and deduplicated web corpus released with its full processing recipe; a current reference for what a state-of-the-art data pipeline produces. Its scale exercises both the data axis (the filtering pipeline) and the training axis (foundation-model pretraining in Chapter 19).

🗄 Dataset

Recommendation and Embeddings (Parameter-Server Axis)

MovieLens. GroupLens, University of Minnesota. grouplens.org

Rating datasets from 100 K up to 25 M interactions; the standard starting point for a recommendation model. Small enough to prototype on one machine, which makes it the place to validate a sharded-embedding design before scaling it; the embedding tables of Chapter 11 and the recommendation case study in Chapter 38.

🗄 Dataset

Criteo 1TB Click Logs. ailab.criteo.com

A terabyte of click-through records with high-cardinality categorical features; the canonical large-scale click-prediction benchmark. Its sparse feature space produces embedding tables too large for one accelerator, forcing the model-parallel embedding sharding of Chapter 11 and the recommendation engineering of Chapter 38.

🗄 Dataset

Vision (Data and Data-Parallel Axes)

ImageNet (ILSVRC). Deng et al., 2009. image-net.org

1.28 M labeled training images across 1,000 classes; the reference image-classification dataset and a long-standing distributed-training benchmark (the MLPerf ResNet-50 task trains on it). The right scale to demonstrate data-parallel training and the input pipeline of Chapter 8 without web-scale data engineering.

🗄 Dataset

LAION-5B. Schuhmann et al., 2022. laion.ai

5.85 billion image-text pairs; the open dataset behind contrastive vision-language and diffusion training. Its volume forces a real distributed-data pipeline (download, filter, shard, stream) on top of data-parallel training, combining the axes of Part II and Chapter 15.

🗄 Dataset

Graphs (Graph-Partitioning Axis)

Open Graph Benchmark (OGB) and OGB-LSC. Hu et al., 2020. ogb.stanford.edu

Standardized graph datasets with leaderboards; the large ones (ogbn-papers100M with 111 M nodes, and the OGB-LSC MAG240M and WikiKG90Mv2 challenges) do not fit in one machine. They force graph partitioning, neighbor sampling across machines, and the distributed GNN training of Chapter 13.

🗄 Dataset

Federated and Medical (Decentralization Axis)

MIMIC-IV and MIMIC-CXR. Johnson et al., PhysioNet. physionet.org

De-identified intensive-care records (and paired chest radiographs) under credentialed access; the standard clinical dataset. Because real patient data cannot leave the institution that holds it, splitting MIMIC-like data across simulated hospitals is the natural way to drive the federated-learning design of Chapter 14 and the federated medical case study in Chapter 37.

🗄 Dataset

LEAF Federated Benchmark and Synthea Synthetic EHR. Caldas et al., 2018; leaf.cmu.edu, synthetichealth.github.io

LEAF supplies naturally partitioned, non-IID datasets (FEMNIST, Shakespeare, Sentiment140) built for federated experiments; Synthea generates realistic synthetic patient records with no privacy barrier. Together they let a federated project model client heterogeneity and the non-IID data splits that Chapter 14 treats.

🗄 Dataset

RL and Multi-Agent Environments (Actor-Learner Axis)

Gymnasium and Arcade Learning Environment (Atari). gymnasium.farama.org

The maintained successor to OpenAI Gym, with the Atari suite as the classic single-agent RL benchmark. Its environments are cheap to step in parallel, which is exactly why they expose the distributed actor-learner architecture (many parallel rollout workers, one learner) of Chapter 20.

🗄 Environment

PettingZoo and the StarCraft Multi-Agent Challenge (SMAC). Terry et al., 2021; Samvelyan et al., 2019. pettingzoo.farama.org

PettingZoo is the standard multi-agent environment API (the multi-agent counterpart to Gymnasium); SMAC is the benchmark for cooperative multi-agent control. They drive the distributed multi-agent training and coordination of Chapter 30 and the swarm robotics case study in Chapter 39.

🗄 Environment

D.2 Benchmarks Intermediate

A dataset gives a project something to scale; a benchmark gives it a comparable number. The benchmarks below fix what to measure and how to report it, so a capstone result can be checked against a community baseline rather than asserted in isolation, the standard the methodology of Chapter 41 and the evaluation chapter (Chapter 5) demand. Each entry names what it measures.

Training and Inference Performance

MLPerf Training and MLPerf Inference. MLCommons; Mattson et al., 2020. mlcommons.org

The industry-standard suites for time-to-train (Training) and latency and throughput at fixed accuracy (Inference). They measure exactly the quantities a distributed project must report: scaling efficiency, wall-clock to a target, and queries per second under a latency bound. The template for stating a capstone result so others can compare it (Chapter 5, Chapter 41).

📊 Benchmark

Megatron-LM and DeepSpeed throughput references. NVIDIA; Microsoft. github.com/NVIDIA/Megatron-LM

Published model-FLOPs-utilization (MFU) and tokens-per-second-per-GPU figures for large-model training; the de-facto reference points for whether a parallelism configuration is efficient. They measure the fraction of peak compute a sharded training run actually realizes, the number the model-parallel methods of Chapter 16 and Chapter 19 are tuned against.

📊 Benchmark

Retrieval, Embeddings, and RAG

BEIR. Thakur et al., 2021. github.com/beir-cellar/beir

A heterogeneous zero-shot information-retrieval benchmark across 18 datasets; measures how well a retriever generalizes. The standard yardstick for the dense-retrieval and vector-search systems of Chapter 25 and the retrieval half of the RAG case study in Chapter 36.

📊 Benchmark

MTEB (Massive Text Embedding Benchmark). Muennighoff et al., 2022. github.com/embeddings-benchmark/mteb

Evaluates text-embedding models across retrieval, clustering, classification, and reranking on 50-plus tasks; measures embedding quality, which sets the recall ceiling any distributed vector index can deliver. The quality baseline behind the index-sharding decisions of Chapter 25.

📊 Benchmark

RAGAS. Es et al., 2023. github.com/explodinggradients/ragas

A reference-light evaluation framework for retrieval-augmented generation; measures faithfulness, answer relevance, and context precision. It scores the end-to-end RAG pipeline, the system-level metric the distributed RAG case study of Chapter 36 and the agentic applications of Chapter 40 report against.

📊 Benchmark

Ranking and Online Evaluation

A/B testing and ranking metrics (NDCG, MAP, AUC; interleaving and sequential testing). Standard methodology.

The offline ranking metrics (NDCG, MAP, AUC) and the online experimentation protocols (A/B tests, interleaving, sequential and CUPED-adjusted analysis) that decide whether a recommendation or retrieval change is real. The measurement discipline the recommendation case study of Chapter 38 and the MLOps chapter (Chapter 26) build on.

📊 Benchmark

D.3 Frameworks and Tools Intermediate

Every chapter that builds a distributed primitive from scratch also names the production tool that does it in a few lines (the "right tool" principle of the Preface). This section collects those tools, grouped by the stage of the pipeline they own: data processing, distributed training, model parallelism, inference and serving, retrieval, federated learning, orchestration, and experiment tracking. Each entry links the chapter that uses it, so the from-scratch method and its industrial counterpart sit side by side.

Distributed Data (Data Axis)

Apache Spark. spark.apache.org

The standard engine for distributed dataframes and large-scale ETL; handles partitioning, the shuffle, and fault-tolerant re-execution. The production realization of the MapReduce model of Chapter 6 and the subject of Chapter 7.

🛠 Tool

Dask. dask.org

Parallel computing in native Python with familiar NumPy and pandas APIs; the lighter-weight path to out-of-core and multi-node data processing when a Spark cluster is more than the job needs. A complement to the data-loading work of Chapter 8.

🛠 Tool

Distributed Training and Model Parallelism (Training and Model Axes)

PyTorch DistributedDataParallel (DDP) and Fully Sharded Data Parallel (FSDP). pytorch.org

DDP automates the gradient all-reduce of Section 1.1; FSDP shards parameters, gradients, and optimizer state to fit models larger than one accelerator. The core training tools of Chapter 15 and Chapter 16.

🛠 Tool

DeepSpeed. Microsoft. deepspeed.ai

The ZeRO family of memory-optimization stages plus pipeline parallelism and offload; the toolkit that makes billion-parameter training fit on commodity clusters. The production engine behind the sharded-parallelism methods of Chapter 16 and foundation-model training in Chapter 19.

🛠 Tool

Megatron-LM. NVIDIA. github.com/NVIDIA/Megatron-LM

Tensor and pipeline model parallelism for transformer training at the largest scale; the reference implementation of the 3D-parallelism the foundation-model chapter assembles. Used throughout Chapter 16, Chapter 17, and Chapter 19.

🛠 Tool

TorchRec. Meta. pytorch.org/torchrec

A PyTorch domain library for large-scale recommendation, with sharded embedding tables and the model-parallel plumbing for high-cardinality features. The production counterpart to the distributed-embedding design of Chapter 11 and the recommendation case study in Chapter 38.

🛠 Tool

Orchestration, Tuning, and RL (Coordination Axis)

Ray (Train, Serve, RLlib, Tune). Anyscale. ray.io

A unified framework whose libraries cover distributed training (Train), online serving (Serve), reinforcement learning (RLlib), and hyperparameter search (Tune). The connective tissue across several axes: the actor-learner RL infrastructure of Chapter 20, the distributed HPO of Chapter 21, and serving in Chapter 23.

🛠 Tool

Kubernetes, Volcano, and Slurm. kubernetes.io, volcano.sh, slurm.schedmd.com

The schedulers that place distributed jobs on a cluster: Kubernetes for general orchestration, Volcano for gang-scheduled batch AI workloads on top of it, and Slurm for HPC-style allocation. The cluster-management substrate of Chapter 33 and the companion cluster lab in Appendix B.

🛠 Tool

Inference, Serving, and Retrieval (Inference Axis)

vLLM. github.com/vllm-project/vllm

A high-throughput LLM serving engine built on paged-attention KV-cache management and continuous batching; the reference system for serving large models efficiently across a fleet. The production engine behind the distributed LLM serving of Chapter 24, building on the per-node economics of Chapter 22.

🛠 Tool

FAISS and ScaNN. Meta; Google. github.com/facebookresearch/faiss, github.com/google-research/scann

The standard libraries for approximate nearest-neighbor search over billion-scale embedding indexes; the building blocks of a sharded vector store. The retrieval engines of Chapter 25 and the index behind the RAG case study in Chapter 36.

🛠 Tool

Federated Learning (Decentralization Axis)

Flower and NVIDIA FLARE. flower.ai, github.com/NVIDIA/NVFlare

Frameworks for federated learning across decentralized clients: Flower is framework-agnostic and research-friendly, NVFLARE targets production deployments with privacy and security features. The tooling for the federated methods of Chapter 14 and the federated medical case study in Chapter 37.

🛠 Tool

Experiment Tracking

MLflow and Weights & Biases. mlflow.org, wandb.ai

Tools for logging runs, metrics, artifacts, and model versions across a distributed cluster; they make a multi-node experiment reproducible and a result auditable. The tracking backbone of the MLOps chapter (Chapter 26) and the reproducibility package the capstone assembles in Chapter 41.

🛠 Tool

D.4 Further Reading and Communities Advanced

A project that hits the frontier needs the literature and the communities that maintain it. The venues below are where distributed-AI systems work is published; the surveys and docs are the entry points into a subfield; and the book's own chapters are the map that sent you to every resource on this page. When a method here outgrows the book, these are the next places to read.

Key Venues

Systems venues: OSDI, SOSP, NSDI, and MLSys. usenix.org, mlsys.org

OSDI and SOSP (operating systems and distributed systems), NSDI (networked systems), and MLSys (machine learning systems) are where the infrastructure this book teaches is first published: MapReduce, Spark, parameter servers, and most large-scale training and serving systems appeared at these venues. The primary literature for Parts II, IV, V, and VII.

🎓 Venue

Machine-learning venues: NeurIPS, ICML, and ICLR. neurips.cc, icml.cc

Where the algorithms that the systems run are published: distributed optimization, federated learning, gradient compression, and multi-agent reinforcement learning. The primary literature for Parts III and VI and the research-frontier callouts throughout the book.

🎓 Venue

Surveys and Foundational References

Dean, J., Ghemawat, S. "MapReduce: Simplified Data Processing on Large Clusters." OSDI 2004. research.google

The paper that made cluster-scale data processing routine; the conceptual root of the data axis. The starting point for the lineage that Chapter 6 traces forward to the all-reduce of distributed training.

📄 Paper

Kleppmann, M. "Designing Data-Intensive Applications." O'Reilly, 2017. dataintensive.net

The standard treatment of partitioning, replication, consistency, and fault tolerance; the four-question vocabulary the distributed-systems foundations of Chapter 2 and the capstone design in Chapter 41 draw on directly.

📖 Book

Kairouz, P., McMahan, H. B., et al. "Advances and Open Problems in Federated Learning." 2021. arXiv:1912.04977

The comprehensive survey of federated learning, from systems to privacy to open problems; the reference map for the decentralization axis that Chapter 14 condenses.

📄 Survey

Documentation and Communities

Framework documentation and engineering blogs: PyTorch, Ray, vLLM, DeepSpeed, Hugging Face. huggingface.co/blog

The official docs and engineering blogs of the tools in Section D.3 are the most current source for distributed-training and serving recipes; they move faster than any printed text. The practical companion to the chapters that introduce each tool.

📚 Docs

This book as an entry point. Table of Contents

Every chapter's own bibliography card (8 to 15 annotated, hyperlinked entries grouped by category) is the curated reading list for that topic. To go deeper on any axis, start from the chapter that owns it in the chapter map, then follow its bibliography outward into the venues and surveys above.

📖 Book
Looking Back, and the Road You Built

This is the last page of the book. It closes the way Section 1.1 opened, on the move from one machine to many, but the pen is now in your hands. The chapters gave you the methods: split the data (Part II), distribute the training (Part III), shard the model (Part IV), serve from a fleet (Part V), coordinate many agents (Part VI), and run it all on real infrastructure (Part VII). The appendices gave you the apparatus: the mathematics in Appendix A, the cluster in Appendix B, the words in Appendix C, and now, in this appendix, the data, the benchmarks, and the tools. What remains is yours to build. Pick a dataset whose scale makes a ceiling bind, name the axis that ceiling lives on, reach for the framework that owns it, measure against the benchmark that makes the claim comparable, and defend the number. The journey from a single process on a single computer to a fleet of cooperating machines is complete; the next system is the one you point all of this at. The cluster is waiting.