Part II: Distributed Data Processing for AI
Chapter 7: Spark and Distributed DataFrames

From MapReduce to Spark

"MapReduce and I agreed on the answer every time. We just disagreed about how many times we should drive to the disk to find it."

An Iterative Job That Read the Same Data Ninety Times
Big Picture

MapReduce proved that a cluster of cheap machines can process unbounded data reliably, but it bought that reliability by writing every intermediate result to disk, which makes the iterative and multi-stage pipelines that dominate AI ruinously slow. Spark keeps the same partition-and-recombine model and the same fault tolerance, then changes one thing that changes everything: it keeps the working set in memory across stages, expresses the whole pipeline as one directed acyclic graph of transformations, and recovers lost partitions by recomputing them from a recorded lineage instead of re-reading replicated copies. For the iterative machine learning and multi-stage feature pipelines that fill the rest of this book, that single change is the difference between a job that finishes overnight and one that finishes before lunch. This section explains why the disk round-trip was fatal, what Spark replaced it with, and why the answer stays exactly the same.

In Chapter 6 we built the MapReduce model from first principles: split the input across machines, apply a map to each record, shuffle records by key, and apply a reduce to each group. That model is one of the most consequential ideas in distributed computing, and the discipline it teaches (think in terms of partitioned data, key-based shuffles, and idempotent re-execution) carries directly into everything that follows. But by the end of that chapter, in Section 6.9, one limitation had become impossible to ignore, and it is the limitation that brought Spark into existence. Every MapReduce stage reads its input from disk and writes its output back to disk before the next stage may begin. For a single pass over the data, that cost is paid once and amortized away. For the workloads AI actually runs, it is paid again and again.

Those workloads are iterative. Training a model by gradient descent revisits the same dataset every epoch. PageRank multiplies the same graph by the same vector until it converges, often a hundred times. k-means reassigns the same points to centroids round after round. A feature pipeline cleans, joins, aggregates, and encodes the same records through a dozen sequential stages. Under MapReduce, each of those rounds is a separate job that re-reads the entire working set from disk, performs a little computation, and writes the result back out for the next round to re-read. The useful work is dwarfed by the bookkeeping. Spark was designed to delete that bookkeeping, and Figure 7.1.1 shows the contrast that motivated it.

MapReduce: every iteration round-trips to disk DISK working set Iteration 1 map + reduce Iteration 2 map + reduce Iteration 3 map + reduce Iteration N map + reduce read, compute, write back, repeat: the disk is touched 2N times Spark: read once, then run the DAG entirely in memory DISK working set Cached in memory (RDD) read from disk once Iter 1 map Iter 2 map Iter 3 map Iter N final action
Figure 7.1.1: Why iterative jobs broke MapReduce. In the top row, each of the $N$ iterations is a separate job that reads the working set from disk and writes its output back, so the disk is touched on the order of $2N$ times and the useful compute (the small boxes) is a sliver of the total. In the bottom row, Spark reads the working set from disk once into a cached in-memory dataset (the orange box), then runs every iteration of its directed acyclic graph against memory; only the first read and the final action touch disk. The orange arrow is the one disk read Spark keeps; the black arrows in the bottom row never leave memory.

1. The Disk Round-Trip That MapReduce Could Not Avoid Beginner

The disk round-trip in MapReduce is not an implementation accident; it is load-bearing. Writing every intermediate result to a replicated distributed file system is exactly what makes the model fault tolerant. If a worker dies mid-job, the scheduler re-runs that worker's task on another machine, and the inputs it needs are sitting safely on disk where the previous stage left them. Reliability and the disk round-trip are, in the original design, the same decision. The cost of that decision is invisible for a single pass over the data and crippling for anything that loops.

Put a number on it. Reading a large partition from a distributed file system runs in the tens to hundreds of megabytes per second per stream; the actual transformation applied to those bytes, a multiply-add or a comparison, runs orders of magnitude faster. For a one-pass aggregation, the read happens once and disappears into the noise. For an iterative algorithm that runs the same data through one hundred rounds, the same bytes are read from disk one hundred times, and the job spends the overwhelming majority of its wall-clock moving data it already had, not computing anything new. The Chapter 3 performance models make this precise: when the per-iteration data-movement cost dominates the per-iteration compute, total time scales with the number of iterations times the cost of re-reading, and adding machines does not help, because each machine still re-reads its own partition every round.

Key Insight: The Bottleneck Is Re-Reading, Not Computing

MapReduce materializes every stage to disk so that a dead worker's output can be recovered without recomputation. For a single pass that price is paid once. For an iterative or multi-stage AI pipeline it is paid every round, so the job's wall-clock is governed by how many times the working set is re-read from disk, not by how much arithmetic it performs. Spark's central bet is that you can keep the working set in memory between stages and still recover from failure, by recording how each partition was computed rather than where its bytes were stored.

2. Spark's Three Moves: Memory, DAG, and Lineage Beginner

Spark keeps almost everything MapReduce got right. Data is still partitioned across machines, work is still expressed as transformations applied independently to each partition, key-based shuffles still move records between machines, and failed work is still re-executed rather than mourned. What changes is three connected design moves, and they are worth naming separately because each later section of this chapter develops one of them.

The first move is to keep working sets in memory across stages. Spark's core abstraction, the resilient distributed dataset, is a partitioned collection that can be held in cluster memory and reused by many operations without ever touching disk between them. The second move is to express the whole pipeline, not one stage at a time, as a directed acyclic graph of transformations. Because Spark sees the entire graph before it runs anything, it can fuse adjacent operations, avoid materializing intermediates, and schedule the work as one optimized plan rather than a chain of independent jobs. The third move is to recover from failure through lineage. Instead of relying on replicated on-disk copies, Spark records the sequence of transformations that produced each partition; if a partition is lost, it is recomputed from its inputs by replaying that recorded recipe. Reliability survives, but the disk round-trip that paid for it does not.

These three are one idea seen from three sides. Holding data in memory is only safe if you can recover it after a crash, which is what lineage provides; lineage is only cheap to record if the computation is a graph of deterministic transformations, which is what the DAG provides; and the DAG is only worth optimizing if its intermediate results can live in memory, which is what the resilient distributed dataset provides. The next three sections of this chapter take them in turn: resilient distributed datasets in Section 7.2, the DataFrame and SQL layer built on top of them in Section 7.3, and lazy DAG execution in Section 7.4.

Thesis Thread: Same Partition-and-Recombine Spine, Cheaper Between Rounds

Spark does not overturn the distribution model of this book; it sharpens it. The partition-the-data, transform-each-shard, shuffle-by-key, recombine spine is exactly the one introduced for MapReduce in Chapter 6 and for data-parallel gradients in Section 1.1. What Spark removes is the disk barrier between successive rounds of that spine, which is precisely the barrier that made iterative distributed machine learning impractical. When data-parallel deep learning arrives in Chapter 15, the same instinct (keep the working set resident and synchronize only what must be shared) reappears as keeping parameters on the accelerators and exchanging only gradients. Spark is where this book first makes the round-trip to slow storage optional.

3. Modeling the Speedup From First Principles Intermediate

The argument so far is qualitative; let us make it quantitative with a cost model and a runnable demonstration. Consider an iterative job of $T$ iterations over a fixed working set. Let $r_{\text{disk}}$ be the time to load that working set from disk, $r_{\text{mem}}$ the time to touch it when it already lives in memory, and $c$ the useful compute per iteration. Both engines pay one cold read $r_{\text{disk}}$ to bring the data in the first time. After that, the MapReduce-style job pays $r_{\text{disk}}$ again at every subsequent iteration, while the Spark-style job pays only $r_{\text{mem}}$. The two total times are

$$T_{\text{MR}} = r_{\text{disk}} + (T-1)\,(r_{\text{disk}} + c) + c, \qquad T_{\text{Spark}} = r_{\text{disk}} + (T-1)\,(r_{\text{mem}} + c) + c.$$

Because $r_{\text{disk}} \gg r_{\text{mem}}$, the speedup grows with the number of iterations and with how much the disk read dominates the compute. The script below models exactly this, running a small real per-iteration loop so the timings are concrete rather than asserted, and reports the modeled wall-clock for each engine, their ratio, and the fraction of time each one wastes outside of useful compute.

import time

# Model an iterative job over a fixed working set. Each iteration does the same
# fixed amount of CPU "compute"; the only difference is where the working set lives
# at the START of each iteration.
#
#   MapReduce style: every stage materializes its output to disk, so the NEXT
#   iteration must re-read the whole working set back from disk.
#   Spark style:     the working set is cached in memory after the first read,
#   so every later iteration skips the disk read entirely.

N_ROWS = 2_000_000          # rows in the working set
ITERS = 10                  # iterations of an iterative algorithm (PageRank, k-means, GD)
DISK_READ_S_PER_ITER = 0.90 # cost to re-load the working set from disk, one iteration
MEM_READ_S_PER_ITER = 0.03  # cost to touch the same set already resident in memory
COMPUTE_S_PER_ITER = 0.20   # the actual useful work per iteration (identical for both)

def run(reload_cost):
    # Simulate one iterative job; return modeled wall-clock seconds.
    total = 0.0
    workset = [i & 0xFF for i in range(N_ROWS)]   # the working set, built once (first read)
    first = True
    for _ in range(ITERS):
        if first:
            total += DISK_READ_S_PER_ITER          # both pay one initial cold read
            first = False
        else:
            total += reload_cost                   # MapReduce re-reads; Spark hits memory
        s = 0
        for v in workset[:50_000]:                 # a small real loop so timing is non-trivial
            s += v
        total += COMPUTE_S_PER_ITER                # the same per-iteration useful work
    return total

mapreduce_s = run(DISK_READ_S_PER_ITER)   # re-materialize to disk every round
spark_s = run(MEM_READ_S_PER_ITER)        # cache once, stay in memory

print(f"iterations                : {ITERS}")
print(f"working set rows          : {N_ROWS:,}")
print(f"MapReduce (disk each iter): {mapreduce_s:6.2f} modeled seconds")
print(f"Spark (cached in memory)  : {spark_s:6.2f} modeled seconds")
print(f"speedup                   : {mapreduce_s / spark_s:5.2f}x")
Code 7.1.1: A pure-Python model (no Spark, no Java) of one iterative job run two ways. The only line that differs between the engines is reload_cost: MapReduce pays the full disk read every round, Spark pays the in-memory cost after the first read.
iterations                : 10
working set rows          : 2,000,000
MapReduce (disk each iter):  11.00 modeled seconds
Spark (cached in memory)  :   3.17 modeled seconds
speedup                   :  3.47x
Output 7.1.1: Over ten iterations the modeled Spark job finishes about 3.5 times faster, and the gap widens with every additional iteration because MapReduce pays the disk read $T$ times while Spark pays it once. Real iterative machine learning runs hundreds of iterations, where the same arithmetic yields the order-of-magnitude speedups Spark reported in practice.

The numbers are modeled, not measured on a cluster, but the structure is exactly right: the only difference between the two runs is whether the working set is re-read from disk each iteration or kept in memory, and that single difference produces the speedup. Notice what did not change. Both engines perform identical per-iteration compute and reach the identical answer; Spark is not approximating anything, just as the data-parallel gradient in Section 1.1 was exact. The win is entirely in the data movement that surrounds the compute, which is why this section sits at the head of a chapter about a faster engine and not a different one.

Fun Note: Sorting a Petabyte to Make the Point

In 2014 Spark won the Gray Sort benchmark by sorting 100 terabytes in 23 minutes on 206 machines, beating the previous MapReduce record of 72 minutes on roughly 2,100 machines, and then sorted a full petabyte for good measure. Sorting is a single shuffle-heavy pass, the case least favorable to Spark's in-memory advantage, and it still won decisively. The lesson the benchmark advertised was not that Spark cheats on iteration; it was that a well-engineered DAG executor beats a chain of separate jobs even when there is only one stage to run.

4. The Payoff for AI: Iteration and One Unified Engine Intermediate

Two consequences of these design moves matter most for the AI practitioner. The first is speed on exactly the workloads AI runs. Iterative optimization, graph algorithms such as PageRank, and multi-stage feature engineering all reuse the same working set many times, and all of them see the order-of-magnitude improvement that Output 7.1.1 sketches in miniature. A feature pipeline that joins, aggregates, and encodes a dataset through a dozen stages no longer writes and re-reads the data a dozen times; it streams through the whole graph with intermediates held in memory. The classical distributed machine learning of Chapter 12 is built directly on this capability.

The second consequence is unification. Because resilient distributed datasets and the DAG executor are general, Spark grew a family of libraries that share one engine and one data abstraction: SQL and DataFrames for structured queries, structured streaming for unbounded data, MLlib for machine learning, and GraphX for graphs. A practitioner can read raw logs with Spark SQL, engineer features with DataFrame operations, train a model with MLlib, and stream fresh data through the same cluster, without stitching together four separate systems with four separate data formats. That unified engine is the reason Spark became the default substrate for data preparation in machine learning, and it is why streaming gets its own treatment in Chapter 9 rather than living as an afterthought.

Practical Example: The Nightly Feature Pipeline That Stopped Missing Its Window

Who: A data engineer on the recommendations team at a streaming-media company.

Situation: A nightly pipeline turned a day of raw clickstream logs into the feature tables that the next morning's model training consumed.

Problem: Built as a chain of nine MapReduce jobs, the pipeline wrote and re-read the full dataset to the distributed file system between every stage, and at growing data volumes it ran past its 6 a.m. deadline, delaying training.

Dilemma: Add more machines to the MapReduce chain, which sped up each stage but did nothing about the eight inter-stage disk round-trips that dominated the wall-clock, or rewrite the pipeline on a different engine.

Decision: They rewrote it as a single Spark job, because the binding cost was the repeated materialization between stages, exactly the cost Spark removes, not the per-stage compute that more machines would address.

How: The nine stages became one DAG of DataFrame transformations with the shared intermediate cached in memory; Spark fused adjacent operations and materialized only the final feature tables to disk.

Result: The pipeline went from roughly five hours to under one, comfortably inside the window, on the same cluster, and the feature tables were bit-for-bit identical to the MapReduce output.

Lesson: When a pipeline's cost lives in the round-trips between stages, the fix is an engine that keeps intermediates in memory, not more workers re-reading the same disk.

Library Shortcut: The Same Iterative Job in PySpark

Code 7.1.1 modeled the cost difference by hand. In real PySpark you express the iterative job as transformations on a DataFrame or resilient distributed dataset and ask Spark to hold the reused working set in memory with a single cache() call; the engine then keeps it resident across iterations instead of re-reading from disk. The snippet below is illustrative and is not run here, because it requires a Java virtual machine and a Spark runtime that the from-scratch model deliberately avoids:

# Illustrative PySpark; needs a JVM + Spark runtime, not run in this section.
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("iterative").getOrCreate()
points = spark.read.parquet("s3://bucket/working_set")  # read from disk ONCE
points.cache()                                          # keep it resident in memory

centroids = init_centroids(points)
for _ in range(100):                                    # 100 iterations of k-means
    # every pass reads `points` from MEMORY, not disk, because of cache()
    centroids = (points
                 .withColumn("c", nearest(points, centroids))
                 .groupBy("c").mean()
                 .collect())
Code 7.1.2: The work modeled in Code 7.1.1, expressed in PySpark. One cache() call replaces the entire MapReduce pattern of writing and re-reading intermediates; Spark handles partitioning, the in-memory store, and lineage-based recovery, which Section 7.2 unpacks. PySpark itself is the subject of Section 7.8.

5. What This Chapter Builds Beginner

The rest of Chapter 7 turns the three design moves of Section 2 into working knowledge, in the order the ideas depend on one another. Table 7.1.1 is the map. Resilient distributed datasets come first because everything else is built on them; DataFrames and Spark SQL add a structured, optimizable layer on top; lazy evaluation explains why Spark waits until an action before running anything and how the DAG is planned; transformations and actions, partitioning and caching, and joins and shuffles are the day-to-day vocabulary of writing efficient Spark; and the chapter closes with PySpark for AI workloads and the performance tuning that turns a correct job into a fast one.

Table 7.1.1: The road through Chapter 7. Each section develops one part of the in-memory, DAG-based, lineage-recovered model introduced here.
SectionWhat it addsWhy it comes when it does
7.2 Resilient Distributed DatasetsThe in-memory, lineage-recovered core abstractionEverything else is built on it
7.3 DataFrames and Spark SQLA structured, query-optimizable layerThe interface most AI code actually uses
7.4 Lazy Evaluation and DAG ExecutionWhy nothing runs until an action, and how the plan is builtExplains the optimizer that makes 7.2 and 7.3 fast
7.5 Transformations and ActionsThe operation vocabulary and the lazy/eager splitThe verbs you write once the model is clear
7.6 Partitioning and CachingControlling data placement and what stays in memoryWhere the speedups of this section are actually claimed
7.7 Joins, Shuffles, and Data SkewThe expensive cross-machine operations and their failure modesThe shuffle from Chapter 6 returns, with skew
7.8 PySpark for AI WorkloadsThe Python API and machine learning patternsHow an AI practitioner actually drives Spark
7.9 Spark Performance TuningTurning a correct job into a fast oneCloses the loop on the cost models of this section

One caution sets the honest scope. Keeping the working set in memory wins only when it fits, or mostly fits, in cluster memory and is reused enough times to amortize the first read. A genuinely single-pass job over data far larger than memory gains little from Spark's central trick, and Spark spills to disk gracefully when the data does not fit, which costs back some of the advantage. The skill, as always in this book, is matching the tool to the workload: Spark earns its keep on iterative and multi-stage pipelines, which is most of what AI data preparation consists of, and the partitioning and caching choices of Section 7.6 are where you tell it which working sets are worth keeping resident.

Research Frontier: Where the Spark Engine Is Going (2024 to 2026)

The in-memory DAG idea is being pushed hard in three directions that an AI practitioner will meet. First, vectorized native execution: Apache's Spark integration with Gluten and the Velox engine, alongside Databricks' Photon, replace the row-at-a-time Java virtual machine path with a columnar, SIMD-accelerated native one, reporting multi-fold query speedups without changing the DataFrame API. Second, decoupled clients: Spark Connect (stable since Spark 3.4 and central to Spark 4.0, released 2024 to 2025) splits the user's program from the cluster with a thin gRPC protocol, so a notebook or a service can drive a remote Spark engine without bundling the whole runtime. Third, the lakehouse: open table formats such as Apache Iceberg and Delta Lake give Spark transactional, time-traveling tables over object storage, and competing engines (Trino, DuckDB, Ray Data) now read the same tables, turning Spark from a monolith into one participant in a shared-storage ecosystem. The through-line is that the resilient, in-memory, DAG-scheduled model introduced here is being made faster and more modular, not replaced.

We now have the thesis of the chapter (keep the working set in memory, plan the whole pipeline as a graph, recover by lineage rather than re-reading) and a cost model that shows why it matters for the iterative workloads AI runs. The next section makes the abstraction concrete by building the resilient distributed dataset itself: what it is, how its partitions and lineage work, and why it can sit in memory and still survive a worker's death. That construction begins in Section 7.2.

Exercise 7.1.1: When Does Spark Actually Help? Conceptual

For each job, state whether Spark's in-memory model offers a large advantage over MapReduce, a small one, or essentially none, and justify it in one sentence using the cost model of Section 3: (a) a single pass that counts word frequencies over a 50-terabyte corpus that does not fit in cluster memory; (b) training logistic regression by gradient descent for 200 epochs over a dataset that fits in memory; (c) a graph PageRank run to convergence over about 80 iterations; (d) a one-time join of two tables, each read once, written once. Identify which property (iteration count, working-set reuse, fit in memory) decides each answer.

Exercise 7.1.2: Scale the Model to Real Iteration Counts Coding

Extend Code 7.1.1 to sweep the iteration count $T$ over the values $\{1, 5, 25, 100, 200\}$ and, for each, print the modeled MapReduce time, the modeled Spark time, and the speedup. Plot or tabulate the speedup against $T$ and confirm it grows toward a ceiling set by $r_{\text{disk}} / (r_{\text{mem}} + c)$. Then add a third "Spark, working set does not fit, spills to disk" mode whose per-iteration cost is halfway between $r_{\text{mem}}$ and $r_{\text{disk}}$, and discuss how spilling erodes the advantage and what that implies for the caching decisions of Section 7.6.

Exercise 7.1.3: Lineage Versus Replication for Recovery Analysis

MapReduce recovers a lost partition by reading a replicated on-disk copy; Spark recovers it by recomputing it from its recorded lineage. Argue, for a 30-iteration job, when each strategy is cheaper to recover from after a single worker failure, in terms of the cost to recompute a partition's lineage versus the cost to store and read replicas. Identify one situation where pure lineage recovery is expensive (a hint: a long chain with a wide shuffle in the middle) and explain why Spark offers checkpointing to disk as a deliberate escape hatch. Relate your answer to the fault-tolerance-by-re-execution principle introduced for MapReduce in Chapter 6.