"I hold layers nine through sixteen. Most of the time I am simply waiting for layers one through eight to finish having opinions."
A Pipeline Stage Between Forward and Backward
Pipeline parallelism splits a model along its depth: consecutive layers become stages that live on different devices, and a batch flows forward through the stages and backward again like work moving down an assembly line. Where tensor parallelism (the previous section) cuts each layer crosswise and pays an all-reduce inside every layer, pipeline parallelism cuts between layers and pays only a cheap point-to-point handoff at each stage boundary. That cheap communication is what lets a pipeline stretch across separate nodes. The price is a new failure mode all its own: if you feed the pipeline one batch at a time, only one stage works while the rest stand idle, an inefficiency called the bubble. This section shows where the bubble comes from, derives the formula that predicts it, and shows how micro-batching and smarter schedules shrink it until the assembly line runs nearly full.
A model too large for one device must be cut somewhere. Tensor parallelism, developed in Section 16.2, cuts each layer's weight matrices across devices and stitches the pieces back together with an all-reduce on every forward and backward pass. That all-reduce is frequent and bandwidth-hungry, so tensor parallelism wants the fattest interconnect you have and rarely crosses a node boundary. Pipeline parallelism makes the opposite cut. Instead of splitting a layer, it keeps each layer whole and assigns a contiguous block of layers to each device. The devices form stages in sequence: stage one holds the first block of layers, stage two the next block, and so on, with the last stage producing the loss. A device only ever talks to its immediate neighbor, handing forward the activations it computed and receiving back the gradients that flow the other way. Figure 16.3.1 lays out this depth-ordered relay across four devices.
1. Splitting a Model by Depth Beginner
The idea is mechanical. Number the layers of a network from input to output, choose $S$ cut points so the layers fall into $S$ contiguous groups, and place group $s$ on device $s$. A forward pass now runs as a relay: device one computes the activations of its layers and sends the final activation tensor to device two, which continues, and so on until the last device computes the loss. The backward pass runs the relay in reverse: the last device computes the gradient of the loss with respect to its inputs and sends that gradient back to the previous device, which continues backpropagation through its own layers, and so on to the first device. Each device holds only the parameters, optimizer state, and activations of its own stage, so a model whose parameters are $S$ times too large for one device now fits, $1/S$ of it per device.
The communication between stages is the cheapest in this chapter. At each boundary a stage sends exactly one activation tensor forward (during the forward pass) and one gradient tensor backward (during the backward pass). These are point-to-point transfers, a send from one device matched by a recv on its neighbor, not a collective that touches every device. The volume is the size of one activation tensor, independent of how many parameters the stage holds. This is the structural reason pipeline parallelism, unlike tensor parallelism, tolerates a slower link and is the standard tool for crossing node boundaries, a trade-off we make precise in Section 5.
Cutting a model between layers means each device exchanges only a single activation or gradient tensor with one neighbor, the lightest communication of any model-parallel method. But a strictly sequential dependency comes with that cut: stage $s$ cannot start until stage $s-1$ has handed it something to work on. Run one batch at a time and most stages sit idle most of the time. The entire craft of pipeline parallelism is keeping that cheap-to-communicate structure while filling the idle time, and the lever for doing so is feeding many small micro-batches through the pipeline at once.
2. The Bubble Problem Beginner
Push a single batch through a four-stage pipeline and watch the timeline. At the first time step only stage one is busy; the other three have nothing to do because the data has not reached them yet. At the second step stages one and two might both work, and only by the fourth step is the whole pipeline lit up, just in time for the forward pass to end and the backward pass to begin draining it the other way. That fill-and-drain idle time, where some stages wait with no work, is the pipeline bubble. With one batch and $S$ stages, the pipeline is full for only a sliver of its runtime, and utilization is roughly $1/S$: four stages running at one-quarter efficiency is a poor return on four devices.
The fix that defines modern pipeline parallelism is micro-batching, introduced by the GPipe system. Instead of pushing one batch of $B$ examples through the pipeline as a single indivisible unit, split it into $M$ micro-batches of $B/M$ examples each and feed them in one after another. While stage two works on micro-batch one, stage one can already start micro-batch two; while stage three handles micro-batch one, stage two handles micro-batch two and stage one micro-batch three. After a short fill, every stage is busy on a different micro-batch at once, and the bubble shrinks to the fill-and-drain at the very start and end. The gradients from all $M$ micro-batches are accumulated and applied as one optimizer step, so micro-batching changes the schedule, not the math: the update is identical to processing the full batch at once. Figure 16.3.2 contrasts the two schedules on a four-stage pipeline.
3. The Bubble Fraction, Exactly Intermediate
How much idle time remains, and how fast does it vanish as we add micro-batches? Count in units of one micro-batch's work on one stage. With $S$ stages and $M$ micro-batches, the forward pass needs $M$ units of useful work per stage, and the fill-and-drain that surrounds it costs $S-1$ extra units (the time for the first micro-batch to travel from the first stage to the last, and symmetrically at the end). The backward pass mirrors this exactly. The useful work is therefore proportional to $M$ and the wasted bubble to $S-1$, giving the bubble fraction
$$\text{bubble fraction} = \frac{S - 1}{M + S - 1}, \qquad \text{utilization} = 1 - \frac{S - 1}{M + S - 1} = \frac{M}{M + S - 1}.$$Read off the two limits that matter. When $M = 1$ (one batch, no micro-batching) the fraction is $(S-1)/S$, so a four-stage pipeline wastes three quarters of its devices, exactly the $1/S$ utilization of Section 2. As $M$ grows, the $S-1$ in the numerator stays fixed while the denominator grows, so the bubble fraction falls toward zero like $1/M$. The practical rule that follows is direct: choose the number of micro-batches $M$ to be several times the number of stages $S$. At $M = 4S$ the bubble is already under a fifth; at $M = 8S$ it is under a tenth. The formula also exposes the cost of adding stages: more stages let you fit a larger model but enlarge the $S-1$ bubble, so depth and efficiency pull against each other, and $M$ is the knob that buys back the efficiency.
The code below simulates a GPipe schedule directly, counting busy device-slots against total device-slots to measure utilization, and checks the measurement against the formula above. It then sweeps $M$ to show the bubble collapsing, and sweeps $S$ to show it growing.
def gpipe_schedule(S, M):
"""Busy and total device-slots for a synchronous GPipe schedule.
Each (stage, micro-batch) pair costs one forward unit and one backward
unit of work, so the useful work is 2*S*M slot-units. The per-device
timeline runs for the forward fill-drain plus the backward fill-drain,
each of length M+S-1, so the makespan is 2*(M+S-1) units."""
work_units = 2 * S * M # one F and one B per stage per micro-batch
makespan = 2 * (M + S - 1) # length of each device's timeline
total_slots = S * makespan # device-slots offered over the whole run
return work_units, total_slots # busy, total
def bubble_fraction(S, M):
return (S - 1) / (M + S - 1)
print(f"{'S':>3} {'M':>4} {'util_measured':>14} {'util_formula':>13} {'bubble':>9}")
S = 4
for M in (1, 2, 4, 8, 16, 32, 64):
busy, total = gpipe_schedule(S, M)
util = busy / total # measured from the schedule
bub = bubble_fraction(S, M)
print(f"{S:>3} {M:>4} {util:>14.4f} {1.0 - bub:>13.4f} {bub:>9.4f}")
S M util_measured util_formula bubble
4 1 0.2500 0.2500 0.7500
4 2 0.4000 0.4000 0.6000
4 4 0.5714 0.5714 0.4286
4 8 0.7273 0.7273 0.2727
4 16 0.8421 0.8421 0.1579
4 32 0.9143 0.9143 0.0857
4 64 0.9552 0.9552 0.0448
The bubble is the same problem a factory faces when it starts a new product line. Before the line fills, the workers downstream stand around waiting for the first piece to reach them; after the last piece passes, the upstream workers have nothing left to do. The cure is identical too: keep many items in flight so every station always has something on the belt. Pipeline parallelism is a 1913 idea wearing a 2019 paper's clothes, and the bubble fraction $(S-1)/(M+S-1)$ is just the ratio of ramp-up-and-down time to the length of the production run.
4. Better Schedules: 1F1B and Interleaving Advanced
GPipe shrinks the bubble but pays for it in memory. Because it runs all $M$ forward passes before any backward pass begins, every stage must keep the activations of all $M$ in-flight micro-batches alive until their gradients are computed, and peak activation memory grows with $M$, the very knob you wanted to turn up. The one-forward-one-backward schedule, known as 1F1B and used by PipeDream and Megatron-LM, breaks this tension. After a short warm-up that fills the pipeline, each stage alternates: do one forward for a new micro-batch, then immediately do one backward for the oldest micro-batch whose gradient is ready. A micro-batch's activations are freed as soon as its backward pass runs, so at steady state a stage holds activations for only about $S$ micro-batches, not all $M$. The bubble fraction is the same $(S-1)/(M+S-1)$, but the peak memory no longer scales with $M$, which is what lets you raise $M$ high enough to make the bubble negligible.
Interleaved (virtual-stage) scheduling, also from the Megatron-LM line, attacks the $S-1$ in the numerator. Instead of giving each device one contiguous block of layers, give it several smaller, non-contiguous chunks, called virtual stages, scattered through the model's depth. With $v$ virtual stages per device the fill-and-drain shrinks by a factor of $v$, so the bubble fraction becomes roughly $\frac{1}{v} \cdot \frac{S-1}{M+S-1}$. The cost is more frequent, smaller point-to-point transfers, since a micro-batch now visits each device $v$ times per pass instead of once. Modern large-model training stacks combine 1F1B for memory with interleaving for a smaller bubble, and then compose the whole pipeline with data and tensor parallelism, the subject of this chapter's later sections on 3D parallelism.
Code 16.3.1 only simulates the timeline; writing a correct 1F1B schedule by hand, with its send and recv ordering, warm-up, and gradient accumulation, is hundreds of lines and a notorious source of deadlocks. PyTorch's torch.distributed.pipelining package takes a model split into stages and a chosen schedule and runs it for you:
# Run with: torchrun --nproc_per_node=4 thisfile.py
from torch.distributed.pipelining import PipelineStage, ScheduleGPipe, Schedule1F1B
stage = PipelineStage(stage_module, stage_index=rank, num_stages=4, device=dev)
schedule = Schedule1F1B(stage, n_microbatches=32, loss_fn=loss_fn) # or ScheduleGPipe
# One call runs warm-up, steady-state 1F1B, drain, and gradient accumulation:
if rank == 0:
schedule.step(x) # first stage feeds the input micro-batches
else:
losses = []
schedule.step(target=y, losses=losses) # last stage gets targets, returns losses
ScheduleGPipe for Schedule1F1B changes the entire execution plan, memory profile and all, with no other edits. The library handles micro-batch splitting, the send/recv choreography, and accumulation that Code 16.3.1 only counts, collapsing hundreds of lines of scheduling logic into two.Who: An ML systems engineer at a startup training a 30-billion-parameter language model.
Situation: The model needed roughly 480 GB for parameters, gradients, and optimizer state, far beyond the 8 GPUs (640 GB total) of a single node once activations were counted, and the team had two such nodes connected by a 100 Gb/s Ethernet link.
Problem: Tensor parallelism alone could not span the two nodes; its per-layer all-reduce over the slow inter-node link would have dominated the step time and crushed throughput.
Dilemma: Keep everything on one node and shrink the model to fit, accepting a weaker result, or span both nodes with a parallelism whose communication survives a 100 Gb/s link.
Decision: They used tensor parallelism within each node (over the fast NVLink) and pipeline parallelism across the two nodes, so the only inter-node traffic was the single activation tensor at the one pipeline stage boundary that crossed nodes.
How: Two pipeline stages, one per node; inside each stage, 8-way tensor parallelism; micro-batches set to $M = 16$ so the bubble fraction $(2-1)/(16+2-1) \approx 0.059$ stayed under six percent, and a 1F1B schedule kept activation memory flat.
Result: The model trained at about 94 percent pipeline utilization, the slow Ethernet link carried only activations and never a per-layer all-reduce, and the run fit comfortably across the two nodes.
Lesson: Use the cheap-to-communicate cut where the link is slow. Pipeline parallelism crosses the node boundary; tensor parallelism stays inside it where bandwidth is plentiful.
5. Pipeline versus Tensor Parallelism Intermediate
The two model-parallel methods of this chapter cut the model along different axes and therefore communicate very differently, and the practical example above already hints at how they compose. Table 16.3.1 lays the trade-offs side by side. The governing idea is simple: tensor parallelism communicates a lot but only over the fastest links, so it belongs inside a node; pipeline parallelism communicates little but introduces a bubble, so it belongs across nodes where bandwidth is scarce. The communication-cost reasoning behind this split is the alpha-beta model from Chapter 3, and the collectives that tensor parallelism leans on are the all-reduce family built in Chapter 4.
| Property | Pipeline (inter-layer) | Tensor (intra-layer) |
|---|---|---|
| What is cut | The model's depth, into contiguous layer stages | Each layer's weight matrices, crosswise |
| Communication | One point-to-point send/recv per stage boundary | An all-reduce inside every layer, each pass |
| Volume per step | One activation tensor per boundary | Grows with hidden size, every layer |
| Tolerates slow links | Yes; standard for crossing nodes | No; wants NVLink-class bandwidth, stays in a node |
| Main inefficiency | The pipeline bubble (fixable with large $M$) | All-reduce latency on the critical path |
| Memory note | 1F1B keeps activation memory flat in $M$ | Splits parameter and activation memory per layer |
Neither method replaces data parallelism, the subject of Chapter 15; they extend it. A real large-model run typically wraps data parallelism around the outside (replicating the whole pipeline-and-tensor group across more devices and all-reducing gradients between replicas), giving the three-dimensional parallelism that the rest of this chapter assembles. Pipeline parallelism contributes the dimension that scales across the cheap links, and the bubble fraction you derived here is the number that tells you how many micro-batches that dimension needs to stay efficient. The next section turns to a different way of fitting a model that is too large, sharding the optimizer state and parameters of a data-parallel job rather than partitioning the model graph, in Section 16.4.
The bubble fraction $(S-1)/(M+S-1)$ has driven a steady stream of schedule innovations. Zero-bubble pipeline parallelism (Qi et al., 2024) splits the backward pass into its input-gradient and weight-gradient halves and reorders the weight-gradient work to fill the drain, reaching a near-zero bubble without raising $M$, and its successor work pushes peak activation memory down further still. DeepSeek-V3's training stack (2024 to 2025) introduced the DualPipe schedule, which runs forward and backward computation streams in both directions at once and overlaps the inter-stage communication so completely that the pipeline bubble nearly vanishes on its hardware. A parallel line studies how pipeline depth interacts with the optimizer when micro-batches grow very large, since the bubble-shrinking trick of raising $M$ eventually collides with the large-batch generalization limits studied in Chapter 3. The throughline: the schedule, not just the cut, is now a first-class design surface.
A team runs an 8-stage pipeline. (a) Using the bubble fraction $(S-1)/(M+S-1)$, how many micro-batches $M$ are needed to push the bubble below 10 percent? (b) They later split the model into 16 stages to fit a larger version. Holding the same $M$ you found in (a), what is the new bubble fraction, and what value of $M$ restores the 10 percent target? (c) Explain in one or two sentences why doubling the stage count more than doubles the micro-batches needed to hold a fixed efficiency.
Extend Code 16.3.1 so it also reports peak activation memory in units of "micro-batches held alive at once." For the GPipe schedule, this peak is $M$ (all forwards precede all backwards). For a 1F1B schedule, model it as approximately $S$ at steady state. Print both alongside utilization for $S = 4$ and $M \in \{4, 8, 16, 32\}$, and confirm that GPipe's memory grows with $M$ while 1F1B's stays flat. Explain in a comment why 1F1B can therefore use a much larger $M$ than GPipe on the same device.
You have two nodes joined by a 100 Gb/s link; inside each node the GPUs share a 600 GB/s NVLink fabric. A transformer layer's per-pass tensor-parallel all-reduce moves about 200 MB, while one pipeline stage boundary hands off a single activation tensor of about 8 MB. Estimate the inter-node transfer time for each option using only bandwidth (ignore latency). Argue from these two numbers which method should cross the node boundary and which should stay inside a node, and connect your answer to the alpha-beta cost model of Chapter 3.