Section 16.6: DeepSpeed and Megatron-LM

"I have been sharded by ZeRO, sliced by tensor parallelism, and staged by the pipeline. Ask me who I am and I will hand you a config file."
A Shard That Believes It Is the Whole Model

Big Picture

The parallelism techniques of this chapter are rarely written by hand; they reach production through two frameworks, DeepSpeed and Megatron-LM, that each own a different slice of the problem and are most often used together. Megatron-LM contributes the high-throughput tensor and pipeline parallelism that splits one transformer across the fast links inside and between a few nodes. DeepSpeed contributes ZeRO, which shards the optimizer state, gradients, and parameters across the data-parallel replicas, plus offload paths that push state down the memory hierarchy to CPU and NVMe when even sharding is not enough. Neither framework is the whole answer. The skill this section builds is reading a model and a cluster, deciding how much of each kind of parallelism to apply, and understanding that this choice, not the code, is the hard and consequential part. That choice is where the next several sections, culminating in 3D parallelism, all converge.

The previous five sections built the components separately. Tensor parallelism (Section 16.2) split each layer's matrices across devices. Pipeline parallelism (Section 16.3) split the stack of layers into stages. ZeRO (Section 16.4) sharded the training state across data-parallel workers, and PyTorch FSDP (Section 16.5) packaged the ZeRO idea natively. Each on its own addresses one ceiling. A real frontier-scale training run hits several ceilings at once, so it needs several of these techniques composed in one job. Assembling that composition correctly, with the right collectives on the right network links, is more than a weekend of plumbing, and it is exactly what the two dominant frameworks exist to provide.

This section is about what those frameworks own, how they fit together, and why the configuration is the genuinely difficult artifact. We diagram the composed stack, run a demo that prices the memory and communication of two contrasting strategies, and show illustratively how the choice is expressed in a config. The deep treatment of stacking all the axes at once is Section 16.9 on 3D and 4D parallelism; here we set it up by naming the tools and the trade-off they manage.

1. Two Frameworks, Two Jobs Beginner

DeepSpeed, from Microsoft, and Megatron-LM, from NVIDIA, are often spoken of in the same breath, which obscures that they solve different problems and were designed around different parts of the memory-and-communication budget. Keeping their responsibilities distinct is the first step to using them well.

Megatron-LM is a transformer training system whose signature contribution is efficient tensor parallelism: it lays out the attention and feed-forward matrices so that the per-layer all-reduce lands on the fast intra-node interconnect, and it pairs this with a carefully scheduled pipeline parallelism across nodes. Its reason to exist is throughput; it is the reference implementation for getting high hardware utilization out of a transformer that has been cut into pieces. Megatron does not, by itself, shard the optimizer state across data-parallel replicas; within a model-parallel group each replica still holds a full optimizer for its slice.

DeepSpeed is a training-optimization library whose signature contribution is ZeRO, the sharding of optimizer state, gradients, and parameters across the data-parallel workers so that no single device holds the full training state (Section 16.4). On top of ZeRO it adds offload paths that move state off the GPU entirely, and it integrates pipeline and tensor parallelism so that ZeRO data parallelism can wrap around a model-parallel core. DeepSpeed's reason to exist is memory: it lets a given cluster train a model larger than the naive arithmetic says should fit.

Key Insight: The Frameworks Partition the Problem, Not Just the Model

Megatron owns the within-model split (tensor and pipeline parallelism on fast links), and DeepSpeed owns the across-replica split (ZeRO sharding and offload of training state). They compose because they act on orthogonal axes: Megatron decides how one copy of the model is cut across a handful of tightly coupled devices, and DeepSpeed decides how the many copies share their optimizer state and, when needed, spill it down the memory hierarchy. Choosing how much of each to apply is the configuration problem this section is really about.

2. Offload and Infinity: Renting the Memory Hierarchy Intermediate

ZeRO shards the training state across GPUs, but the total still has to live somewhere in aggregate GPU memory. DeepSpeed's ZeRO-Offload and ZeRO-Infinity break that constraint by treating the whole memory hierarchy, not just HBM, as the place where state can live. ZeRO-Offload moves the fp32 optimizer state and the optimizer step itself to CPU memory, so the GPU holds only what it needs for the forward and backward pass. ZeRO-Infinity goes further and streams parameters and optimizer state to and from NVMe solid-state storage, letting a single node, or a small cluster, train a model whose state would never fit in its GPUs.

This is the memory hierarchy from Chapter 2 used as a deliberate design lever rather than an accident of hardware. Each step down the hierarchy, HBM to CPU DRAM to NVMe, multiplies capacity and divides bandwidth, so offload buys you the ability to fit a model at the cost of slower per-step data movement. It is the right tool when the alternative is not training the model at all, and the wrong tool when you have enough GPUs that staying in HBM keeps the accelerators busy. The reliability machinery that makes long offloaded runs survivable, checkpointing and elastic restart, is the subject of Chapter 18.

Fun Note: The GPU That Sent Its Homework to the Hard Drive

ZeRO-Infinity lets a lone GPU train a model whose optimizer state lives mostly on an NVMe drive, fetched layer by layer just in time. It works, in the sense that the model trains, the way a student with a tiny desk works by keeping all but the current page in a filing cabinet across the room. Every page arrives when needed; you just would not call the throughput inspiring. Offload is a capability switch, not a speed switch.

3. The Composed Stack Intermediate

The canonical large-model recipe stacks the two frameworks along orthogonal axes: Megatron supplies tensor parallelism within a node and pipeline parallelism across a few nodes, forming one complete model replica, and DeepSpeed wraps ZeRO data parallelism around that replica so many replicas share their sharded optimizer state. This is the arrangement that trained a long line of well-known large models, and it is the concrete realization of the "a primitive returns, scaled out" thread: the all-reduce of Chapter 4 appears here three times over, once inside tensor parallelism, once as the pipeline's point-to-point exchange, and once as ZeRO's reduce-scatter and all-gather. Figure 16.6.1 shows the composition.

Figure 16.6.1: The composed training stack. Each Megatron replica (left and right) is one model cut by tensor parallelism inside a stage (fast intra-node all-reduce, orange) and pipeline parallelism between stages (point-to-point across nodes). DeepSpeed ZeRO (green, dashed) wraps the replicas, synchronizing corresponding shards with reduce-scatter and all-gather over the global fabric and optionally offloading each shard down the memory hierarchy. The configuration's job is to pick the three parallel degrees that keep the heaviest traffic on the fastest links.

Reading Figure 16.6.1 from the inside out gives the design rule the whole stack obeys: put the highest-volume, most-frequent communication on the fastest link. Tensor parallelism all-reduces activations on every layer, so it goes on NVLink inside a node. Pipeline parallelism exchanges only stage-boundary activations, far less traffic, so it can cross slower node-to-node links. ZeRO's parameter and gradient traffic, large but once per step, spreads over the global fabric. Get that ordering wrong, for example by spreading tensor parallelism across slow inter-node links, and the accelerators starve waiting on communication.

4. Pricing Two Strategies: A Runnable Demo Advanced

The reason configuration is hard is that memory and communication pull in opposite directions, and the only way to see the tension is to put numbers on it. The program below takes one model and one cluster and prices two strategies: pure ZeRO-3, which shards everything across all sixty-four GPUs, and a 3D-parallel layout that combines tensor, pipeline, and a small data-parallel degree. For each it reports the per-device memory broken into training state and activations, whether it fits, and the dominant communication volume per step. The arithmetic is deliberately simplified to expose the trade-off, not to replace a profiler.

# Compare per-device memory and dominant communication for parallelism strategies.
# Model: a transformer of P parameters trained in mixed precision with Adam.
# Cluster: G GPUs, each with M_gpu bytes of usable HBM.

P = 70e9          # parameters (70B model)
G = 64            # total GPUs
M_gpu = 80e9      # 80 GB per GPU (bytes)
H = 8192          # hidden size
L = 80            # transformer layers
B = 2             # micro-batch sequences per device
S = 4096          # sequence length

# Per-parameter training-state bytes (mixed precision + Adam):
#  fp16 weight 2 + fp16 grad 2 + fp32 master 4 + fp32 m 4 + fp32 v 4 = 16 bytes.
state_per_param = 16

def gb(x):
    return x / 1e9

print(f"Model: {P/1e9:.0f}B params | Cluster: {G} GPUs x {gb(M_gpu):.0f} GB")
print(f"Full training state if undivided: {gb(P*state_per_param):.0f} GB "
      f"(needs >= {P*state_per_param/M_gpu:.0f} GPUs just to hold state)\n")

# Activation memory per device (rough): bytes for stored activations of local layers.
act_const = 12
def act_mem(layers_on_device):
    return act_const * 2 * B * S * H * layers_on_device

# --- Config A: pure ZeRO-3 (sharded data parallel across all G GPUs) ---
zero3_state = P * state_per_param / G        # training state sharded 1/G
zero3_act = act_mem(L)                        # full stack of layers present locally
zero3_total = zero3_state + zero3_act
zero3_comm = 3 * 2 * P / G    # all-gather x2 + reduce-scatter, bytes/device/step (global fabric)

# --- Config B: tensor(T) x pipeline(Pp) x data(D) = 3D parallelism ---
T, Pp, D = 8, 4, 2                            # 8*4*2 = 64 = G
assert T * Pp * D == G
params_per_dev = P / (T * Pp)                 # tensor + pipeline both shrink the local params
tp_pp_state = params_per_dev * state_per_param
tp_pp_act = act_mem(L / Pp) / T               # fewer layers (pipeline), tensor-split width
tp_pp_total = tp_pp_state + tp_pp_act
tp_comm = 4 * (L / Pp) * 2 * B * S * H        # per-layer TP all-reduce, mostly intra-node

print("Config A  pure ZeRO-3 (sharded data parallel, degree 64)")
print(f"  per-device state : {gb(zero3_state):6.1f} GB")
print(f"  per-device activ : {gb(zero3_act):6.1f} GB")
print(f"  per-device TOTAL : {gb(zero3_total):6.1f} GB   fits={'YES' if zero3_total<M_gpu else 'NO'}")
print(f"  dominant comm    : {gb(zero3_comm):6.2f} GB/device/step over the GLOBAL group (slow links)\n")

print(f"Config B  3D parallel  T={T} x P={Pp} x D={D}")
print(f"  per-device state : {gb(tp_pp_state):6.1f} GB")
print(f"  per-device activ : {gb(tp_pp_act):6.1f} GB")
print(f"  per-device TOTAL : {gb(tp_pp_total):6.1f} GB   fits={'YES' if tp_pp_total<M_gpu else 'NO'}")
print(f"  dominant comm    : {gb(tp_comm):6.2f} GB/device/step, mostly INTRA-node (fast NVLink)\n")

print("Trade-off summary")
print(f"  ZeRO-3 frees memory by sharding state {G}x but pays {gb(zero3_comm):.2f} GB/step on the global fabric.")
print(f"  3D parallel keeps the heavy traffic on fast links, at the cost of a more complex config and")
print(f"  larger per-device state ({gb(tp_pp_state):.0f} vs {gb(zero3_state):.1f} GB).")

Code 16.6.1: A back-of-the-envelope pricer that contrasts pure ZeRO-3 against a tensor x pipeline x data layout on the same 70B model and 64-GPU cluster, reporting per-device memory (split into state and activations) and the dominant per-step communication for each.

Model: 70B params | Cluster: 64 GPUs x 80 GB
Full training state if undivided: 1120 GB (needs >= 14 GPUs just to hold state)

Config A  pure ZeRO-3 (sharded data parallel, degree 64)
  per-device state :   17.5 GB
  per-device activ :  128.8 GB
  per-device TOTAL :  146.3 GB   fits=NO
  dominant comm    :   6.56 GB/device/step over the GLOBAL group (slow links)

Config B  3D parallel  T=8 x P=4 x D=2
  per-device state :   35.0 GB
  per-device activ :    4.0 GB
  per-device TOTAL :   39.0 GB   fits=YES
  dominant comm    :  10.74 GB/device/step, mostly INTRA-node (fast NVLink)

Trade-off summary
  ZeRO-3 frees memory by sharding state 64x but pays 6.56 GB/step on the global fabric.
  3D parallel keeps the heavy traffic on fast links, at the cost of a more complex config and
  larger per-device state (35 vs 17.5 GB).

Output 16.6.1: Pure ZeRO-3 shards the optimizer state beautifully (17.5 GB per device) yet still does not fit, because every device holds the activations of the full 80-layer stack (128.8 GB); sharding state does not shard activations. The 3D layout fits in 39 GB by cutting both layers (pipeline) and layer width (tensor), and it keeps its heavier communication on the fast intra-node links.

The lesson in Output 16.6.1 is the one that surprises newcomers: ZeRO-3 alone is not enough at this scale, and the reason is activations, not optimizer state. Sharding the state across sixty-four GPUs drives it down to a comfortable 17.5 GB, but each GPU still runs the entire forward pass, so it stores the activations of all eighty layers, and those dominate. The 3D layout wins not by sharding state harder but by splitting the layers themselves across pipeline stages and tensor shards, which cuts the activation footprint along with it. This is precisely why activation checkpointing (Section 16.8) and tensor or pipeline parallelism are not optional luxuries on top of ZeRO; they attack the term ZeRO cannot reach.

Thesis Thread: The Collective Decides the Layout

Every strategy in Output 16.6.1 is ultimately a decision about which collective runs on which link. Pure ZeRO-3 routes all of its all-gather and reduce-scatter traffic over the global fabric; the 3D layout pushes the high-frequency tensor all-reduce onto NVLink and reserves the global fabric for the once-per-step ZeRO synchronization. The frameworks are, at heart, machinery for binding the collectives of Chapter 4 to a physical network so the fast links carry the heavy traffic. When you later meet 3D parallelism in Section 16.9 and expert parallelism's all-to-all in Chapter 17, ask the same question first: which collective, on which link, how often?

5. The Configuration Is the Hard Part Advanced

Once the frameworks are in place, the engineering does not become typing; it becomes choosing. The number of parameters, the cluster topology, the sequence length, and the batch size jointly determine a feasible region of (tensor, pipeline, data, ZeRO-stage, offload) settings, and within that region throughput can vary by large factors. The config file looks innocuous and is the most consequential artifact in the run. Below is an illustrative DeepSpeed configuration paired with the launch flags that select Megatron's model-parallel degrees; the point is not the exact keys but that a handful of integers encode the entire layout.

// ds_config.json (illustrative): DeepSpeed owns ZeRO sharding and offload
{
  "train_micro_batch_size_per_gpu": 2,
  "gradient_accumulation_steps": 8,
  "bf16": { "enabled": true },
  "zero_optimization": {
    "stage": 1,                      // ZeRO-1: shard optimizer state only,
                                     //   because Megatron already shards the model
    "offload_optimizer": {           // spill the optimizer shard to CPU if needed
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,            // hide reduce-scatter behind the backward pass
    "reduce_bucket_size": 5e8
  }
}

Code 16.6.2: An illustrative DeepSpeed ZeRO config for a job whose model is already tensor- and pipeline-parallel under Megatron. Because Megatron shards the model, ZeRO-1 (optimizer state only) is often the right stage; pushing to ZeRO-3 would all-gather parameters that Megatron already split, duplicating work.

# Megatron launch flags (illustrative): these pick the model-parallel degrees.
# Total GPUs = tensor x pipeline x data must equal the world size.
torchrun --nproc_per_node=8 --nnodes=8 pretrain_gpt.py \
    --tensor-model-parallel-size 8 \    # within a node, on NVLink
    --pipeline-model-parallel-size 4 \  # across nodes, stage to stage
    --sequence-parallel \               # split activations along the sequence too (16.7)
    --deepspeed --deepspeed_config ds_config.json
# data-parallel degree is implied: 64 GPUs / (8 * 4) = 2

Code 16.6.3: The Megatron side of the same job. Three integers, the tensor, pipeline, and implied data degrees, fix the layout that Figure 16.6.1 draws; --sequence-parallel previews the long-context techniques of Section 16.7.

Two integers in Code 16.6.3 that multiply wrong, or a ZeRO stage in Code 16.6.2 that fights the model parallelism instead of complementing it, can halve throughput or break the run. Choosing them well requires the communication-cost reasoning of Chapter 3 applied to the specific topology, plus a few profiling runs. The systematic method for navigating this space is the subject of Section 16.10; what matters here is the recognition that the frameworks have moved the difficulty from implementation to decision.

Library Shortcut: DeepSpeed and Megatron Replace Thousands of Lines

Writing tensor parallelism, pipeline scheduling, and ZeRO sharding by hand is thousands of lines of collective plumbing, memory bookkeeping, and overlap scheduling, and getting the overlap right alone is a research project. DeepSpeed reduces the ZeRO and offload side to one deepspeed.initialize call plus the JSON of Code 16.6.2; Megatron-LM reduces the model-parallel side to the launch flags of Code 16.6.3 over its transformer implementation. The combined frameworks handle process-group construction, collective scheduling, gradient bucketing, parameter all-gather, optimizer-state offload, and the placement of each collective on the correct network link. Your job shrinks to choosing a few integers and providing the data, which is why every section after this one is about choosing those integers well rather than implementing what they trigger.

import deepspeed

# model already wrapped with Megatron tensor + pipeline parallelism upstream
model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model, model_parameters=model.parameters(),
    config="ds_config.json")        # the JSON of Code 16.6.2 drives ZeRO + offload

for batch in data_loader:
    loss = model_engine(batch)      # forward across pipeline stages
    model_engine.backward(loss)     # reduce-scatter gradients (ZeRO), overlapped
    model_engine.step()             # optimizer step on the (possibly offloaded) shard

Code 16.6.4: The entire DeepSpeed training loop for a model that Megatron has already cut along tensor and pipeline axes. The three calls hide ZeRO's reduce-scatter, the parameter all-gather, and the offloaded optimizer step; the layout itself lives in the config and launch flags, not the loop.

Practical Example: The 40B Model That Would Not Fit, Then Did

Who: An ML platform engineer at a startup standing up a 64-GPU cluster (eight nodes, eight 80 GB GPUs each, NVLink inside a node, slower Ethernet between).

Situation: The research team wanted to pretrain a 40-billion-parameter transformer and had started with the simplest thing that could work, pure ZeRO-3 across all sixty-four GPUs.

Problem: The job ran out of memory on the first step despite ZeRO-3 sharding the optimizer state thirty-two-fold; the profiler showed activations, not state, filling the GPU, exactly the pattern in Output 16.6.1.

Dilemma: Add aggressive activation checkpointing to make ZeRO-3 fit but pay recomputation cost and still route all traffic over slow Ethernet, or introduce Megatron tensor and pipeline parallelism to shrink the activation footprint at the cost of a more intricate config.

Decision: They moved to a composed stack: Megatron tensor parallelism of degree eight inside each node, pipeline degree four across nodes, ZeRO-1 with optimizer offload wrapping a data-parallel degree of two.

How: Tensor parallelism stayed on intra-node NVLink, pipeline stage boundaries crossed the slower links where their light traffic could tolerate it, and ZeRO-1 sharded only the optimizer state because Megatron already sharded the model, the reasoning behind Code 16.6.2.

Result: Per-device memory dropped under the 80 GB ceiling with room to spare, the heavy tensor all-reduce never touched Ethernet, and step time fell by more than half compared to the checkpointed ZeRO-3 attempt.

Lesson: When activations dominate, sharding state harder cannot save you; split the model so the activations shrink, and place each collective on the link that can afford it.

6. What Composes, and Where It Goes Next Intermediate

The two frameworks compose because they act on orthogonal axes, and that orthogonality is what the next sections exploit and extend. Megatron's tensor and pipeline parallelism cut a single model copy; DeepSpeed's ZeRO and offload manage the many copies and the memory hierarchy beneath them. Stack them and you have three parallel degrees to choose, the setup for the full 3D parallelism treatment in Section 16.9. Add the sequence and context parallelism of Section 16.7 and a fourth axis appears, the 4D layout. Add expert parallelism's all-to-all (Chapter 17) and a fifth. Each new axis is another integer in the config and another collective to place on the network.

Research Frontier: Automating the Parallelism Choice (2024 to 2026)

Because the configuration is the hard part, a research line is trying to remove the human from it. Auto-parallelization systems in the lineage of Alpa search the space of tensor, pipeline, and data degrees to fit a model on a cluster automatically, and the idea has matured into the planners shipping inside production stacks. NVIDIA's Megatron-Core and the NeMo framework, together with the maturing of FSDP2 and PyTorch's DTensor and the torch.distributed tensor-parallel API, are converging the once-separate DeepSpeed and Megatron capabilities into composable building blocks with cost-model-driven defaults. Recent work also pushes offload further with NVMe-aware schedulers that overlap solid-state-storage traffic with compute so ZeRO-Infinity-style training keeps the accelerators busier. The throughline is that choosing the parallel degrees, the skill this section names as the hard one, is itself becoming an optimization problem the frameworks solve, with the human supplying constraints rather than integers. We return to the cost models that make such search tractable in Chapter 3.

We now have the production frameworks and a clear division of labor: Megatron within the model, DeepSpeed across the replicas and down the memory hierarchy, and a configuration that ties them to the physical network. The remaining ceiling this chapter has not addressed is the sequence dimension, the activations that grow with context length and that even tensor and pipeline parallelism leave on each device. Splitting the sequence itself across machines is the subject of Section 16.7.

Exercise 16.6.1: Who Owns Which Axis? Conceptual

For each responsibility, state whether it belongs to Megatron-LM, to DeepSpeed, or to both acting together, and justify it in one sentence: (a) splitting a single attention layer's weight matrix across four GPUs; (b) sharding the fp32 optimizer state across data-parallel replicas; (c) streaming optimizer state to NVMe when GPU and CPU memory are exhausted; (d) scheduling pipeline micro-batches to keep stage idle time low; (e) placing the per-layer activation all-reduce on the fast intra-node link. Then explain why a job can use both frameworks at once without the axes conflicting.

Exercise 16.6.2: Re-Price the Cluster Coding

Starting from Code 16.6.1, add a third configuration that applies activation checkpointing to the pure ZeRO-3 case by reducing the stored activation term (for example, dividing act_const by a checkpointing factor of eight to approximate storing only layer boundaries). Recompute whether ZeRO-3-plus-checkpointing now fits in 80 GB, and compare its per-step communication and its recomputation cost (one extra forward pass) against the 3D layout. Then sweep the tensor and pipeline degrees of Config B over the factorizations of 64 and print, for each, the per-device total memory; identify which layouts fit and which keep tensor parallelism within a single eight-GPU node.

Exercise 16.6.3: The Stage That Fights the Sharding Analysis

Code 16.6.2 chooses ZeRO-1 rather than ZeRO-3 for a model that Megatron already shards. Explain, in terms of the collectives each stage runs (Section 16.4), why combining ZeRO-3's parameter all-gather with Megatron's tensor parallelism can duplicate communication and memory work, and under what cluster conditions ZeRO-3 over a model-parallel core would nonetheless be the right call. Quantify your argument using the per-step communication terms from Output 16.6.1, and state what you would measure on a real cluster to settle the choice.