Section 16.10: Choosing and Tuning a Parallelism Strategy

"They gave me four ways to split a model and told me to pick the right one. I picked all four, in the wrong order, and spent a week watching GPUs wait on each other."
A Shard That Believes It Is the Whole Model

Big Picture

There is no single best way to parallelize a model; there is a best way for a specific model on a specific cluster, and you find it by composing a small number of axes in a fixed order, each matched to the interconnect it can afford. This chapter built four ways to cross the memory wall: tensor parallelism splits a layer, pipeline parallelism splits the depth, sharded data parallelism splits the optimizer state, and activation checkpointing trades compute to recover memory. This closing section turns those four tools into one decision procedure. Start with the cheapest option that works, add an axis only when a concrete ceiling forces it, place each axis on the fastest link it needs, and measure Model FLOPs Utilization rather than guessing. The result is a strategy you can defend with numbers instead of folklore.

By now you can implement every form of parallelism in this chapter in isolation. The harder skill, and the one that separates a working training run from a wasteful one, is deciding which forms to use and in what combination. A 1.3-billion-parameter model on eight GPUs and a 175-billion-parameter model on hundreds of GPUs are not the same problem with a bigger number; they sit in different regions of a design space, and the right answer in one region is an expensive mistake in the other. The previous section (Section 16.9) showed how data, tensor, pipeline, and expert parallelism compose into 3D and 4D configurations. Now we build the procedure that tells you which configuration to reach for, and why each step is forced rather than chosen for elegance.

We proceed in three moves. First, a decision tree that walks from "the model fits" to "frontier scale," adding one axis at each branch. Second, the guiding principle that makes the tree correct: every axis must live on an interconnect fast enough to hide its traffic, the same alpha-beta reasoning introduced in Section 4.9. Third, the metric that tells you whether your choice actually worked, Model FLOPs Utilization, which ties back to the evaluation discipline of Section 5.5. The section, and the chapter, close with a runnable decision function, a key takeaway, and project ideas.

1. The Decision Tree Beginner

The framework is a sequence of yes-or-no questions, each answered by a concrete resource limit. You stop at the first configuration that fits, because every axis you add costs communication, complexity, and a slice of efficiency. Figure 16.10.1 draws the whole tree; the prose below walks it branch by branch.

Figure 16.10.1: The parallelism decision tree. Green boxes are terminal recommendations; orange boxes are axes you add to the running configuration; the purple box is applied at every level. You descend the central spine only as far as a real ceiling forces you, and each rightward branch places an axis on the interconnect it can afford. The runnable function in Code 16.10.1 implements exactly this spine.

The first question is the cheapest: does one full training replica, meaning the weights plus gradients plus optimizer state, fit in one accelerator's memory? If yes, you are done before you start the chapter: use plain data parallelism from Chapter 15, replicate the model, and synchronize gradients with all-reduce. Nothing in this chapter is needed, and reaching for tensor or pipeline parallelism here would only slow you down. The mixed-precision rule of thumb is roughly $16$ bytes per parameter for the full replica (2 bytes of bf16 weights, 2 of gradients, and about 12 of fp32 optimizer moments), so a 1.3-billion-parameter model needs about 21 GB, comfortable on a 40 GB accelerator.

If the replica does not fit, the next move is the one with the best simplicity-to-payoff ratio: sharded data parallelism (FSDP or ZeRO-3, from Sections 16.4 and 16.5). It keeps the single-program mental model of data parallelism but partitions the weights, gradients, and optimizer state across all workers, gathering each layer's parameters only for the moment they are used. Per-GPU memory now scales as roughly $\frac{16 P}{N}$ bytes for $P$ parameters across $N$ GPUs, so the same 13-billion-parameter model that needs about 208 GB as one replica fits in about 26 GB per GPU across eight workers. Sharded data parallelism is the default first answer whenever the model does not fit, because it asks the least of you and degrades gracefully.

You add tensor parallelism (Section 16.2) when a single layer is too large to live on one GPU even after sharding, or when activation memory, not parameters, is the binding constraint and you want to cut it by splitting each matmul. Tensor parallelism communicates inside every forward and backward pass, an all-reduce per layer, so it must stay within a single high-bandwidth island; we return to why in the next subsection. You add pipeline parallelism (Section 16.3) when the model is so deep that it spans many nodes, staging consecutive layer groups across nodes and passing only activations between stages, which is cheap enough to cross the slower inter-node fabric. You add sequence or context parallelism (Section 16.7) when the context is long enough that activations, which grow with sequence length, dominate memory regardless of parameter count. And through every level you apply activation checkpointing (Section 16.8), recomputing activations in the backward pass to trade a little compute for a large memory saving. At frontier scale all of these compose into the 3D and 4D configurations of Section 16.9.

Key Insight: Add an Axis Only When a Ceiling Forces It

Every parallelism axis pays a tax in communication and complexity, so the correct strategy is the least parallelism that fits, not the most. Walk the tree top to bottom and stop at the first configuration where the per-GPU state fits the memory budget. Sharded data parallelism before tensor parallelism, tensor before pipeline, pipeline before sequence, because that order adds the cheapest, most local communication first and reserves the expensive cross-node patterns for when nothing simpler suffices. A team that starts with 4D parallelism on a model that would have fit with FSDP alone has bought themselves weeks of tuning to run slower than the simple answer.

2. Match Each Axis to Its Interconnect Intermediate

The decision tree's branch order is not arbitrary; it follows from the bandwidth each axis demands. A cluster is a hierarchy of links: inside one node, accelerators talk over NVLink or a similar fabric at hundreds of gigabytes per second; between nodes, they talk over InfiniBand or Ethernet at a fraction of that, with higher latency. The alpha-beta cost model of Section 4.9 makes the consequence precise: a collective of message size $m$ over a link with latency $\alpha$ and per-byte cost $\beta$ takes about $\alpha + \beta m$ per step, and the axis that fires most often must sit on the link with the smallest $\beta$.

Tensor parallelism issues an all-reduce on the order of $\mathcal{O}(b \cdot s \cdot h)$ activation bytes for every single layer, both directions, where $b$ is batch, $s$ is sequence length, and $h$ is hidden size. That traffic is enormous and constant, so tensor parallelism only pays off when $\beta$ is tiny, which means it must stay inside one NVLink island; stretched across nodes it stalls every layer waiting on the slow fabric. Pipeline parallelism, by contrast, sends only a stage's boundary activations once per micro-batch, a far smaller and less frequent message, so it tolerates the inter-node link well. Sharded data parallelism overlaps its all-gather and reduce-scatter with computation, hiding much of the cost, and ordinary data-parallel gradient all-reduce happens once per step and can also cross nodes. The rule that falls out is the one practitioners encode in their launch configs:

Table 16.10.1: Which interconnect each parallelism axis can afford, and how often it communicates. Place the most frequent, highest-volume traffic on the fastest link.

Axis	Communication pattern	Frequency	Belongs on
Tensor (16.2)	all-reduce of activations	every layer, both passes	intra-node (NVLink)
Sequence / context (16.7)	all-gather / all-to-all of keys and values	per attention block	intra-node preferred
Sharded data (16.4 / 16.5)	all-gather + reduce-scatter of shards	per layer, overlapped	intra- or inter-node
Pipeline (16.3)	point-to-point activations	per micro-batch boundary	inter-node tolerable
Data (Ch 15)	all-reduce of gradients	once per step	inter-node tolerable

This is why the standard layout for a frontier run nests the axes by locality: tensor parallelism within a node, pipeline parallelism across a small number of nodes, and data or sharded-data parallelism across the outermost ring of replicas. The ordering of the decision tree is the ordering of Table 16.10.1 read bottom to top, and the two are consistent for the same reason: cheap, local communication first.

Library Shortcut: Frameworks Take the Topology as a Few Integers

Hand-placing ranks onto NVLink islands and InfiniBand rails is exactly the bookkeeping that production frameworks absorb. In DeepSpeed or Megatron-LM you declare the 3D or 4D shape as a handful of integers and the library builds the process groups, assigns ranks to the right links, and schedules the collectives:

# Megatron-style launch: the parallel shape is just four numbers.
# 256 GPUs = TP(8, in-node) x PP(8, cross-node) x DP(4, outer ring)
torchrun --nproc_per_node=8 --nnodes=32 pretrain.py \
  --tensor-model-parallel-size 8 \   # each layer split across one node's 8 GPUs (NVLink)
  --pipeline-model-parallel-size 8 \ # 8 pipeline stages spanning nodes
  --sequence-parallel \              # split activations along the sequence dim
  --use-distributed-optimizer \      # ZeRO-style sharding of the outer data-parallel ring
  --recompute-activations            # activation checkpointing throughout

Code 16.10.2: The entire 4D layout reduced to flags. What would be hundreds of lines of manual process-group construction and rank-to-link assignment collapses to five arguments; the framework derives the data-parallel size ($256 / (8 \times 8) = 4$), builds the NCCL groups, and overlaps the collectives with compute.

3. Measure, Do Not Guess: Model FLOPs Utilization Intermediate

A strategy that fits in memory is not yet a good strategy; it might leave the accelerators idle most of the time, waiting on communication or pipeline bubbles. The single number that tells you whether your configuration is efficient is Model FLOPs Utilization (MFU), the fraction of the hardware's peak floating-point throughput that your training actually spends on useful model math. Following Section 5.5, if a step does $C$ model FLOPs of real work in wall-clock time $T$ on $N$ accelerators each rated at peak $F_{\text{peak}}$, then

$$\text{MFU} = \frac{C / T}{N \cdot F_{\text{peak}}}.$$

For a dense transformer the per-step work is well approximated by $C \approx 6 \, P \, D$, where $P$ is the parameter count and $D$ is the number of tokens in the step (the factor six counts one forward and two backward matmuls). A well-tuned large-model run reaches an MFU in the rough range of 0.35 to 0.55; a poorly chosen parallelism layout can drop it below 0.15, which means more than five accelerators in six are idle. MFU is the objective you tune the decision tree against: if it is low, the diagnosis is usually a pipeline bubble too large for the micro-batch count, a tensor-parallel group spilling across nodes, or communication that failed to overlap with compute. The metric turns "is this configuration good?" from an opinion into a measurement, exactly as the evaluation discipline of Part I insists.

Thesis Thread: One Primitive, Placed Where It Can Afford to Run

Every axis in the decision tree is, underneath, a collective from Chapter 4: tensor parallelism is all-reduce, sharded data parallelism is reduce-scatter plus all-gather, pipeline parallelism is point-to-point, and the next chapter's expert parallelism (Chapter 17) will be all-to-all. Choosing a parallelism strategy is therefore not really about the model; it is about placing each collective on a link fast enough to hide it. The gradient all-reduce that Section 1.1 performed by hand has, by the end of this chapter, fanned out into a whole family of placed-and-scheduled collectives. Scale-out is the art of deciding which collective runs where.

The decision procedure is mechanical enough to write as a function. Code 16.10.1 implements the central spine of Figure 16.10.1: given a model size, its largest layer, the context length, and the cluster topology, it estimates per-GPU memory under sharding and recommends a strategy with a feasibility verdict. It is a planning aid, not a substitute for measuring MFU on real hardware, but it gets you to a sensible starting configuration in milliseconds instead of a day of trial and error.

def gib(x):
    return x / (1024 ** 3)

def recommend(name, params_b, layer_params_b, seq_len,
              gpus_per_node, num_nodes, mem_per_gpu_gib,
              bytes_per_param=2, optimizer_bytes_per_param=12):
    """Recommend a parallelism strategy and estimate feasibility.

    params_b        : total parameters (billions)
    layer_params_b  : parameters of the single largest layer (billions)
    seq_len         : training context length (tokens)
    """
    total_gpus = gpus_per_node * num_nodes
    P, Lp = params_b * 1e9, layer_params_b * 1e9
    budget = mem_per_gpu_gib * 0.85 * (1024 ** 3)        # ~15% runtime headroom
    replica_state = P * optimizer_bytes_per_param        # weights+grads+optimizer
    sharded_state = replica_state / total_gpus           # if split over all GPUs
    layer_weights = Lp * bytes_per_param

    # Q1: does one full replica fit on one GPU? -> plain data parallelism (Ch 15)
    if replica_state <= budget:
        return _report(name, "Data parallelism (DDP)", total_gpus, budget,
                       replica_state, sharded_state, True,
                       ["Full replica fits; replicate and all-reduce gradients."])

    axes = ["sharded-data (FSDP/ZeRO-3)"]                 # Q1 no -> shard first (16.4/16.5)
    notes = ["Replica exceeds one GPU; shard weights+grads+optimizer."]

    tp = 1
    if layer_weights > budget * 0.10:                    # Q2: heavy layer -> tensor (16.2)
        tp = gpus_per_node                               # keep TP inside the NVLink island
        axes.append(f"tensor (TP={tp}, intra-node/NVLink)")
        notes.append("Largest layer is heavy; split it on NVLink.")
    pp = 1
    if num_nodes >= 4 and params_b >= 70:                 # Q3: spans nodes -> pipeline (16.3)
        pp = min(num_nodes, 8)
        axes.append(f"pipeline (PP={pp}, inter-node)")
        notes.append("Model spans many nodes; stage the depth.")
    if seq_len >= 16384:                                  # Q4: long context -> sequence (16.7)
        axes.append("sequence/context parallel")
        notes.append("Long context inflates activations; split the sequence dim.")
    notes.append("Apply activation checkpointing throughout to fit.")

    feasible = sharded_state <= budget and (layer_weights / tp) <= budget * 0.5
    if not feasible:
        notes.append("WARNING: sharded state still exceeds budget; add GPUs or offload.")
    return _report(name, " + ".join(axes), total_gpus, budget,
                   replica_state, sharded_state, feasible, notes)

def _report(name, strategy, total_gpus, budget, replica, sharded, feasible, notes):
    print(f"=== {name} ===")
    print(f"  cluster GPUs    : {total_gpus}")
    print(f"  per-GPU budget  : {gib(budget):6.1f} GiB usable")
    print(f"  one replica     : {gib(replica):6.1f} GiB")
    print(f"  sharded per GPU : {gib(sharded):6.1f} GiB")
    print(f"  strategy        : {strategy}")
    print(f"  feasible        : {'YES' if feasible else 'NO (needs more GPUs/offload)'}")
    for n in notes:
        print(f"    - {n}")
    print()

# Four scenarios spanning the tree, from a model that fits to frontier 4D scale.
recommend("A. 1.3B model, 8xA100-40G",  1.3, 0.05, 2048,  8,  1, 40)
recommend("B. 13B model, 8xA100-80G",    13,  0.3, 4096,  8,  1, 80)
recommend("C. 70B model, 8 nodes",       70,  4.5, 8192,  8,  8, 80)
recommend("D. 175B model, 32 nodes",    175,  9.0, 32768, 8, 32, 80)

Code 16.10.1: A pure-Python implementation of the decision tree in Figure 16.10.1. Each if corresponds to one branch of the tree, and the order of the branches encodes the "least parallelism that fits" rule of the chapter.

=== A. 1.3B model, 8xA100-40G ===
  cluster GPUs    : 8
  per-GPU budget  :   34.0 GiB usable
  one replica     :   14.5 GiB
  sharded per GPU :    1.8 GiB
  strategy        : Data parallelism (DDP)
  feasible        : YES
    - Full replica fits; replicate and all-reduce gradients.

=== B. 13B model, 8xA100-80G ===
  cluster GPUs    : 8
  per-GPU budget  :   68.0 GiB usable
  one replica     :  145.3 GiB
  sharded per GPU :   18.2 GiB
  strategy        : sharded-data (FSDP/ZeRO-3)
  feasible        : YES
    - Replica exceeds one GPU; shard weights+grads+optimizer.
    - Apply activation checkpointing throughout to fit.

=== C. 70B model, 8 nodes ===
  cluster GPUs    : 64
  per-GPU budget  :   68.0 GiB usable
  one replica     :  782.3 GiB
  sharded per GPU :   12.2 GiB
  strategy        : sharded-data (FSDP/ZeRO-3) + tensor (TP=8, intra-node/NVLink) + pipeline (PP=8, inter-node)
  feasible        : YES
    - Replica exceeds one GPU; shard weights+grads+optimizer.
    - Largest layer is heavy; split it on NVLink.
    - Model spans many nodes; stage the depth.
    - Apply activation checkpointing throughout to fit.

=== D. 175B model, 32 nodes ===
  cluster GPUs    : 256
  per-GPU budget  :   68.0 GiB usable
  one replica     : 1955.8 GiB
  sharded per GPU :    7.6 GiB
  strategy        : sharded-data (FSDP/ZeRO-3) + tensor (TP=8, intra-node/NVLink) + pipeline (PP=8, inter-node) + sequence/context parallel
  feasible        : YES
    - Replica exceeds one GPU; shard weights+grads+optimizer.
    - Largest layer is heavy; split it on NVLink.
    - Model spans many nodes; stage the depth.
    - Long context inflates activations; split the sequence dim.
    - Apply activation checkpointing throughout to fit.

Output 16.10.1: The four scenarios trace the decision tree end to end: a model that fits gets plain DDP, a 13B model needs only sharding, a 70B run composes sharded plus tensor plus pipeline (3D), and a 175B long-context run adds sequence parallelism (4D). The per-GPU memory estimates explain each escalation.

Notice how the recommendation escalates exactly with the binding ceiling: scenario A never leaves the first branch, B stops at sharding, C reaches 3D parallelism, and D the full 4D stack, each driven by a number the function computed rather than a preference. This is the framework in action; on real hardware you would then launch the recommended shape, measure MFU, and tune the micro-batch count and group sizes until the bubble shrinks and the collectives overlap.

Practical Example: The 30B Run That Stopped Fighting Its Interconnect

Who: An ML platform engineer at a startup pretraining a 30-billion-parameter model on four nodes of eight 80 GB GPUs.

Situation: The first working configuration used tensor parallelism of size 32, spanning all four nodes, because that was the layout a tutorial happened to show.

Problem: Measured MFU sat at 0.13; profiling showed every layer's tensor-parallel all-reduce stalling on the inter-node InfiniBand link.

Dilemma: Keep the simple single-axis layout that at least ran, or restructure into a nested 3D shape that risked new bugs in process-group setup and pipeline scheduling.

Decision: They restructured, confining tensor parallelism to size 8 inside each node (on NVLink) and using pipeline parallelism of size 4 across the nodes, with sharded data parallelism on the remainder, exactly the placement Table 16.10.1 prescribes.

How: The change was four integers in the launch command (Code 16.10.2) plus enabling activation checkpointing; no model code changed.

Result: MFU rose from 0.13 to 0.41, more than tripling training throughput on the identical hardware, and the per-step time stopped being dominated by communication.

Lesson: The axes were never the problem; their placement was. Matching each axis to the interconnect it can afford, and measuring MFU to confirm it, turned a stalled run into an efficient one.

Fun Note: The Configuration That Looked Impressive

There is a recognizable failure mode where a team, proud of having built every axis, switches them all on for a model that would have fit with FSDP alone. The dashboard looks magnificent: tensor, pipeline, sequence, and data parallelism all lit up. The MFU sits at 0.11. The model trains slower than it would have on a single well-configured node, and the postmortem reads "we parallelized the parallelism." The decision tree exists to prevent precisely this kind of enthusiastic over-engineering.

4. Research Frontier and the Chapter in One Idea Advanced

Choosing a parallelism strategy by hand, even with a decision tree, is increasingly something the system does for you. The frontier is automation of the very search this section formalizes.

Research Frontier: Automated Parallelization (2024 to 2026)

The decision tree is a human-readable approximation of an optimization problem, and a growing line of work solves that problem directly. Compiler-style systems in the lineage of Alpa search the joint space of inter-operator (pipeline) and intra-operator (tensor) parallelism automatically, and PyTorch's native APIs have moved the same direction: FSDP2 and the DTensor-based device-mesh abstraction (2024 to 2025) let you express a multi-dimensional parallel layout declaratively and have the runtime place the collectives, while torch.distributed pipelining and async tensor parallelism push communication-computation overlap further. Auto-tuners that profile a few candidate shapes and pick the highest-MFU configuration are now standard in large-model training stacks, and recent work folds long-context sequence parallelism and activation-recomputation choices into the same search. The trajectory is clear: the engineer specifies the model and the cluster, and the system returns the placed, scheduled 4D layout. Understanding the tree by hand, as you now do, is what lets you trust, debug, and override those tools when their cost model disagrees with your measured MFU.

Step back and the whole chapter compresses to a single picture. A modern model hits a memory wall: the weights, gradients, optimizer state, and activations of one training replica no longer fit in one accelerator. There are exactly four ways to cross that wall, and you have now built all four. Tensor parallelism splits a layer across devices. Pipeline parallelism splits the depth into stages. Sharded data parallelism splits the optimizer state across the data-parallel ring. Activation checkpointing recovers memory by recomputing rather than storing. Each one is a collective from Chapter 4 placed on a link it can afford, and at frontier scale they compose into the 3D and 4D configurations of Section 16.9. The decision tree of this section is the procedure that picks the right combination, and MFU is the meter that tells you whether the choice was right.

Key Takeaway: The Memory Wall and the Four Ways Across It

When one training replica no longer fits on one accelerator, you cross the memory wall with four composable tools: tensor parallelism (split a layer, on NVLink), pipeline parallelism (split the depth, across nodes), sharded data parallelism (split the optimizer state, FSDP/ZeRO), and activation checkpointing (recompute instead of store), combined into 3D and 4D parallelism for frontier scale. Choose the combination by the decision tree: use the least parallelism that fits, add an axis only when a concrete ceiling forces it, place each axis on the interconnect it can afford (tensor intra-node, pipeline and data inter-node), and confirm the choice by measuring Model FLOPs Utilization rather than guessing.

Exercise 16.10.1: Read the Tree Conceptual

For each system, walk the decision tree in Figure 16.10.1 and name the strategy it lands on, justifying each branch you take or skip: (a) a 7-billion-parameter model fine-tuned on a single node of eight 80 GB GPUs at a 4096-token context; (b) a 70-billion-parameter model on four nodes where every individual layer fits on one GPU but the full replica does not; (c) a 30-billion-parameter model trained at a 128k-token context on two nodes. For (c), explain why long context can force an axis even when the parameter count alone would not.

Exercise 16.10.2: Search the Configuration Space Coding

Extend Code 16.10.1 so that, instead of returning one recommendation, it enumerates every valid $(\text{TP}, \text{PP}, \text{DP})$ factorization of the cluster's GPU count and estimates per-GPU memory for each. Add a crude MFU proxy that penalizes a tensor-parallel group that spills across nodes (set $\beta$ high when $\text{TP} > \text{gpus\_per\_node}$) and penalizes a pipeline with too few micro-batches relative to its stage count (a larger bubble). Rank the configurations and print the top three. Compare the winner to what the original decision tree recommended for scenario C, and discuss any disagreement.

Exercise 16.10.3: Where Does the Time Go? Analysis

Using the alpha-beta model of Section 4.9, estimate the per-layer tensor-parallel all-reduce time for a model with hidden size $h = 12288$, batch $b = 4$, and sequence length $s = 8192$ in bf16, first on an intra-node link with $\beta = 1/(300\,\text{GB/s})$ and then on an inter-node link with $\beta = 1/(25\,\text{GB/s})$. Given a layer compute time of about 2 milliseconds, compute the MFU ceiling each placement imposes, and use the two numbers to explain in one paragraph why Table 16.10.1 confines tensor parallelism to a single node.

Project Ideas

1. A parallelism autopilot. Turn the decision tree into a real planner. Given a model config (layer count, hidden size, parameter count, context length) and a cluster description (nodes, GPUs per node, HBM per GPU, intra- and inter-node bandwidths), search the full $(\text{TP}, \text{PP}, \text{DP})$ space, score each candidate with a memory check plus an alpha-beta communication estimate and a pipeline-bubble term, and emit the predicted-best launch command for DeepSpeed or Megatron. Validate the predictions against a published MFU table for a known model, and report how often the planner's top choice matches the configuration the original team actually used.

2. Measure the wall. On whatever GPUs you can access (even two), train a small transformer under each strategy in turn: plain DDP, FSDP, tensor parallelism, and pipeline parallelism, holding the model and global batch fixed. Instrument each run for MFU and peak memory, then plot the memory-versus-throughput trade-off curve. Confirm empirically that sharded data parallelism gives the best simplicity-to-payoff ratio and that tensor parallelism across the slow link collapses MFU, reproducing the claims of Table 16.10.1 on real hardware.

3. Stress-test the tree. Construct adversarial scenarios where the decision tree's greedy "least parallelism that fits" rule gives a suboptimal answer: for example, a case where adding tensor parallelism earlier than the tree suggests would shrink activation memory enough to raise the achievable batch size and net higher MFU. Quantify the gap with your planner from idea 1, and propose a refinement to the branch ordering that closes it without sacrificing the tree's simplicity.