Section 24.2: Tensor-Parallel Inference

"They cut me into eight slices and told me I would feel faster. They were right, until layer eighty-one, when I spent more time gossiping than thinking."
A Weight Matrix Sharded One Slice Too Far

Big Picture

Tensor-parallel inference takes one model that does not fit, or does not run fast enough, on a single GPU and splits every layer's weights across several GPUs, so that the GPUs jointly compute each token of a single request and combine their partial results with an all-reduce on every layer. It is the training technique of Section 16.2 turned toward serving: the same column-and-row weight split, the same per-layer collective, now run forward-only to answer a prompt. It buys two things at once, smaller weight memory per GPU and lower per-token latency, and it charges one price, an all-reduce per layer per token. That price is only bearable on the fastest interconnect, which is why tensor parallelism almost always stays inside a single node. This section shows the split, proves it equals the single-GPU answer, and models exactly where the all-reduce overhead overtakes the speedup.

In the previous section we saw why a large language model can outgrow one GPU on two separate fronts: its weights may not fit, and even when they fit, a single GPU may compute each token too slowly for the latency budget. Replication, the subject of Chapter 23, answers neither problem, because a replica is a full copy of the model on one device; if the model does not fit on one device, you cannot make a replica at all, and adding replicas never makes a single request faster. The remedy is to make one logical model span several GPUs. Tensor parallelism is the first and most latency-friendly way to do that, and it is, almost beat for beat, the inference reuse of a method this book already developed for training.

1. The Same Split, Now Forward-Only Intermediate

Recall the Megatron-style split from Section 16.2. A transformer layer is, at its core, two large matrix multiplications back to back: an up-projection that widens the hidden vector, a nonlinearity, and a down-projection that narrows it again. Tensor parallelism partitions those matrices along their shared inner dimension. The first weight matrix is split by columns, so each of the $T$ GPUs holds a vertical slice and computes only a slice of the widened activation. The second weight matrix is split by rows to match, so each GPU multiplies its activation slice by its own row slice and produces a partial copy of the full-width output. No GPU holds the whole layer; each holds a $1/T$ slice of both weight matrices, and the attention block splits the same way, by attention heads.

The partial outputs are the key. Each GPU computes a full-shape output vector that is correct except that it is missing the contributions of the slices held by the other GPUs. Because the down-projection is a sum over the inner dimension, and a sum decomposes exactly the way the gradient sum did in Section 1.1, adding the $T$ partial outputs reconstructs the exact single-GPU result. That addition is an all-reduce: every GPU contributes one vector, the vectors are summed, and every GPU receives the total so it can feed the next layer. This is the same all-reduce collective introduced in Section 4.3, now fired once on the forward pass of each layer rather than once on the backward pass of each step. Figure 24.2.1 lays out this split across three GPUs, with the column-and-row weight slices, the head-sharded KV cache, and the per-layer all-reduce that combines the partial outputs.

Figure 24.2.1: One transformer layer served tensor-parallel across three GPUs. Each GPU holds a column slice of the up-projection ($W_1$) and the matching row slice of the down-projection ($W_2$), and produces a full-shape partial output. The three partials are summed by an all-reduce (green) to recover the exact single-GPU output, and this collective fires once per layer for every token. The KV cache (dashed) is sharded by attention head across the same three GPUs, so no GPU stores the whole cache.

Two consequences follow immediately, and they are exactly why tensor parallelism is used at inference rather than only at training. First, the weight memory per GPU drops by a factor of $T$, because each GPU stores only its slice; a model that needs four GPUs' worth of memory becomes servable on four GPUs that no single one could host. Second, the per-token latency drops, because the two big matrix multiplications that dominate each layer are now done $T$ ways in parallel, with each GPU doing $1/T$ of the arithmetic. Replication can give you neither: it cannot fit a model that does not fit, and it cannot make one token faster. Tensor parallelism makes ONE replica span GPUs; replication, the contrast we draw in Section 5, adds MORE replicas. They are orthogonal, and a real serving stack uses both.

2. Verifying the Split Equals One GPU Intermediate

The claim that summing partial outputs reconstructs the single-GPU result is the whole correctness argument, and it is worth seeing run rather than asserted. The code below builds one transformer feed-forward block, computes its output the ordinary single-GPU way, then computes it again as $T$ sharded GPUs that each hold a column slice of the first weight, the matching row slice of the second weight, and contribute one partial output to an all-reduce. It checks that the two agree for several values of $T$.

import numpy as np

rng = np.random.default_rng(7)
d_model, d_ff, S = 256, 1024, 16      # hidden dim, FFN inner dim, sequence length

x  = rng.standard_normal((S, d_model)) / np.sqrt(d_model)
W1 = rng.standard_normal((d_model, d_ff)) / np.sqrt(d_model)
W2 = rng.standard_normal((d_ff, d_model)) / np.sqrt(d_ff)

def gelu(z):
    return 0.5 * z * (1.0 + np.tanh(np.sqrt(2/np.pi) * (z + 0.044715 * z**3)))

# Single-GPU reference: y = gelu(x @ W1) @ W2
ref = gelu(x @ W1) @ W2

# Tensor-parallel across T simulated GPUs.
# W1 column-parallel (slice the d_ff axis), W2 row-parallel (same axis).
# Each GPU returns a full-shape partial; summing them is the all-reduce.
def tensor_parallel(T):
    cols = np.array_split(np.arange(d_ff), T)        # how the inner dim is sliced
    partials = []
    for c in cols:                                    # each c lives on one GPU
        W1_shard = W1[:, c]                            # column slice of W1
        W2_shard = W2[c, :]                            # matching row slice of W2
        h = gelu(x @ W1_shard)                         # local activation (S, |c|)
        partials.append(h @ W2_shard)                 # partial (S, d_model) output
    return np.sum(partials, axis=0)                   # all-reduce SUM across GPUs

for T in (1, 2, 4, 8):
    out = tensor_parallel(T)
    print(f"T={T}  max|TP - single| = {np.max(np.abs(out - ref)):.2e}")

Code 24.2.1: A transformer feed-forward block served tensor-parallel across $T$ simulated GPUs. The column-parallel and row-parallel slices are chosen so that summing the per-GPU partial outputs is exactly an all-reduce, and the result is compared against the single-GPU output ref.

T=1  max|TP - single| = 0.00e+00
T=2  max|TP - single| = 2.50e-16
T=4  max|TP - single| = 2.64e-16
T=8  max|TP - single| = 2.50e-16

Output 24.2.1: Across $T = 1, 2, 4, 8$ GPUs the tensor-parallel output matches the single-GPU output to floating-point rounding ($\sim 2 \times 10^{-16}$). The split changes the answer by nothing that matters; it changes only where the work runs.

The agreement reported in Output 24.2.1 is at the level of double-precision rounding, so tensor parallelism is, like data parallelism before it, an exact reorganization and not an approximation. The reader who served a request expecting a degraded answer gets the identical answer. What changed is purely operational: the arithmetic that one GPU would have done alone is now spread across $T$ GPUs, with one all-reduce per layer stitching their partial outputs back into the single coherent token the model would have produced.

Key Insight: Tensor Parallelism Trades Compute for a Per-Layer Collective

Splitting a layer across $T$ GPUs divides its matrix-multiply work by $T$ but adds one all-reduce of the hidden vector to every layer. The compute saving is large and shrinks as $1/T$; the all-reduce cost is small per call but is paid on every one of the model's layers for every single token generated. Whether the trade is worth it depends entirely on how fast the GPUs can exchange that hidden vector, which is why tensor parallelism lives or dies by the interconnect.

3. Why It Must Stay on the Fastest Interconnect Advanced

The all-reduce per layer is the entire reason tensor parallelism is confined, in practice, to a single node. A token-generation step walks every layer of the model in sequence, and each layer ends with an all-reduce that every GPU must finish before the next layer can begin. For an 80-layer model generating one token, that is 80 all-reduces standing directly on the critical path, with no useful computation to hide behind them, because the next layer literally depends on the summed result. This is the topology argument of Section 4.9 made concrete: a collective on the critical path must run on the fastest links available, and within a node those links are NVLink, which moves data between GPUs at hundreds of gigabytes per second, an order of magnitude faster than the network between nodes.

Push the tensor-parallel group across a node boundary and every one of those 80 per-token all-reduces now crosses the slow inter-node fabric. The collective that was a rounding error in the layer's time budget becomes the dominant term, and per-token latency rises instead of falls. This is why practical tensor-parallel degree is capped, usually to the number of GPUs in one node (8 on a typical server), and why spanning more machines is the job of pipeline parallelism in Section 24.3, which communicates far less often. Tensor parallelism is the within-node tool; the moment you leave the node, its economics invert.

Fun Note: The Chattiest Method in the Book

Data-parallel training in Section 1.1 all-reduces once per training step, which might be once every few hundred milliseconds. Tensor-parallel decoding all-reduces once per layer per token, which for an 80-layer model answering at 50 tokens per second is 4,000 all-reduces every second. It is, by a wide margin, the most talkative collective pattern you will meet, and it is precisely that talkativeness that nails it to NVLink.

4. The Latency-Versus-Overhead Trade-Off Advanced

If more GPUs always made each token faster, you would use as many as you could buy. They do not, and the reason is the all-reduce. We can model the per-token latency directly. Let the per-layer compute on one GPU cost $c_0$, falling as $c_0/T$ when split across $T$ GPUs. Let the per-layer all-reduce cost grow with $T$ as a ring or tree collective does, roughly $a + b\log_2 T$ for $T > 1$, where $a$ is a fixed latency floor and $b$ scales the per-step transfer term. Then the per-token latency for an $L$-layer model is

$$\text{latency}(T) = L \left( \frac{c_0}{T} + a + b\log_2 T \right), \qquad T > 1.$$

The first term falls with more GPUs; the second and third rise. Their sum has a minimum and then climbs, so adding GPUs helps until the all-reduce overhead overtakes the shrinking compute, after which more GPUs make each token slower, not faster. The same code that verified correctness also evaluates this model in Code 24.2.2, so the diminishing returns are measured rather than claimed.

# Latency model: compute falls as 1/T, all-reduce overhead grows with T.
# Units are milliseconds; values are illustrative of an NVLink-class node.
c0, a, b, L = 1.00, 0.04, 0.06, 80     # per-layer compute, allreduce floor, allreduce slope, layers
print(f"{'T':>3} {'compute':>9} {'allreduce':>10} {'total/layer':>12} {'token(ms)':>10} {'speedup':>8}")
base = None
for T in (1, 2, 4, 8, 16):
    compute = c0 / T
    allred  = (a + b * np.log2(T)) if T > 1 else 0.0   # no collective with one GPU
    per_layer = compute + allred
    token = per_layer * L                              # L layers on the critical path
    base = token if base is None else base
    print(f"{T:>3} {compute:>9.3f} {allred:>10.3f} {per_layer:>12.3f} {token:>10.2f} {base/token:>7.2f}x")

Code 24.2.2: The per-token latency model evaluated for $T = 1$ to $16$. The compute column shrinks as $1/T$ while the allreduce column grows as $a + b\log_2 T$, and their sum across $L = 80$ layers is the per-token latency whose speedup over a single GPU is reported in the last column.

  T   compute  allreduce  total/layer  token(ms)  speedup
  1     1.000      0.000        1.000      80.00      1.00x
  2     0.500      0.100        0.600      48.00      1.67x
  4     0.250      0.160        0.410      32.80      2.44x
  8     0.125      0.220        0.345      27.60      2.90x
 16     0.062      0.280        0.342      27.40      2.92x

Output 24.2.2: Going from 1 to 8 GPUs cuts per-token latency from 80 ms to 27.6 ms, a $2.9\times$ speedup, far short of the ideal $8\times$ because the all-reduce term has grown. Doubling again to 16 GPUs buys almost nothing ($2.90\times \to 2.92\times$): the all-reduce now dominates and the compute saving has run out.

The numbers tell the practical story crisply. Two GPUs give a healthy $1.67\times$; eight give $2.9\times$, already well below the ideal $8\times$ because every halving of compute is partly eaten by a growing collective. Sixteen GPUs are essentially pointless here, gaining 0.02 of a multiplier while doubling the hardware bill, and on real hardware where 16 GPUs would straddle two nodes the curve would bend the other way and latency would rise. This is the quantitative form of the rule from Section 3: there is a sweet spot for the tensor-parallel degree, it is usually within one node, and past it you are paying for GPUs that make the model slower.

Thesis Thread: A Collective Returns, Now on the Hot Path

The all-reduce proved exact by hand in Section 1.1 and specified as a primitive in Chapter 4 returns here in its most demanding role. In data-parallel training it fired once per step; in tensor-parallel serving it fires once per layer per token, directly on the latency critical path. The same operation that made distributed training exact now makes distributed serving fast, and the same cost model that bounded its training overhead (Chapter 23's fleet sizing rests on these per-node numbers) bounds its serving overhead. Scale-out reuses one primitive across the whole book; what changes is how often, and how urgently, it must run.

5. The Sharded KV Cache, and the Contrast with Replication Intermediate

Weights are not the only thing tensor parallelism shards. The KV cache, the per-token attention state whose per-node economics Chapter 22 developed as a prerequisite, is also split across the tensor-parallel GPUs. Because attention is partitioned by head, each GPU computes and stores the keys and values only for its own heads, so the cache for a long context is divided $T$ ways alongside the weights. This is the multiplication promised at the end of Chapter 22: per-node KV-cache pressure, already the dominant memory cost of long-context serving, is here relieved by the same factor $T$ that relieves the weight memory, which is part of why tensor parallelism is what makes very long contexts servable at all. No extra collective is needed for the cache itself; each GPU's heads stay on that GPU, and the per-layer all-reduce that combines the attention output already carries the cross-head coupling.

It helps to state plainly how this differs from replication, the subject of Chapter 23. Replication makes copies: each replica is a complete model on its own GPU (or its own tensor-parallel group), and adding replicas raises how many requests per second the fleet can serve. Tensor parallelism makes one model bigger than one GPU: it spans a single logical model across several GPUs to fit it and to speed up each token. The two compose cleanly and are the two axes of a serving deployment, often written together as a configuration like "tensor-parallel size 8, with 6 replicas." You reach for tensor parallelism when one model will not fit or one token is too slow; you reach for replication when one request stream is more than the fleet can keep up with. Confusing them is a common and expensive mistake: adding replicas will never fix a model that does not fit, and adding tensor-parallel GPUs past the sweet spot will never raise throughput.

Practical Example: Fitting the 70B That Would Not Fit

Who: An inference platform engineer at a company rolling out an internal coding assistant.

Situation: The chosen 70-billion-parameter model needed roughly 140 GB in half precision, but the available GPUs held 80 GB each, so the model did not fit on one device at all.

Problem: Replication, their usual scaling lever, was useless here: you cannot make a replica of a model that will not load onto a single GPU.

Dilemma: Quantize aggressively to squeeze onto one GPU and risk quality, or span the model across GPUs and decide how many, knowing that too few would still not fit and too many would waste hardware on a flat part of the latency curve.

Decision: They set tensor-parallel size to 4 within a single 8-GPU node, which fit the model comfortably (about 35 GB plus cache per GPU) and put each token's matrix multiplies on four GPUs over NVLink.

How: They launched vLLM with tensor_parallel_size=4, leaving the framework to place the column-and-row weight shards, shard the KV cache by head, and schedule the per-layer all-reduces on NVLink; the remaining four GPUs in the node held a second replica.

Result: The model fit and served at the latency target, and the two replicas (each tensor-parallel-4) doubled throughput, exactly the two orthogonal axes from this section working together.

Lesson: Tensor parallelism is the lever for "does not fit" and "too slow per token"; replication is the lever for "not enough requests per second." A real deployment sets both, and keeps the tensor-parallel group inside one node.

Library Shortcut: vLLM Does the Sharding and the All-Reduces for You

Code 24.2.1 placed the weight slices and summed the partials by hand, which for a real model means orchestrating column-and-row shards across every layer, a head-sharded KV cache, and the NVLink all-reduce schedule, hundreds of lines of careful collective code. A production serving engine reduces all of it to one argument:

from vllm import LLM, SamplingParams

# tensor_parallel_size=4 shards every layer's weights and the KV cache across
# 4 GPUs on this node, and inserts the per-layer all-reduce on NVLink.
llm = LLM(model="meta-llama/Llama-3.1-70B-Instruct", tensor_parallel_size=4)

out = llm.generate(["Explain tensor parallelism in one sentence."],
                   SamplingParams(max_tokens=64))
print(out[0].outputs[0].text)

Code 24.2.3: The same split as Code 24.2.1, now as one tensor_parallel_size argument to vLLM. The engine handles weight sharding, the head-sharded KV cache, and the per-layer NVLink all-reduce that Section 3 insists must stay within a node; the manual collective code collapses to a single keyword.

6. When Tensor Parallelism Is the Right Tool Intermediate

Tensor parallelism is the right answer to two questions and the wrong answer to a third. It is right when a model does not fit on one GPU, because it divides the weight and cache memory by $T$. It is right when a single token must be produced faster than one GPU can manage, because it divides the per-layer compute by $T$, up to the sweet spot the latency model exposed. It is the wrong answer when the model already fits and runs fast enough and you simply need more requests per second; that is replication's job, and reaching for tensor parallelism there just adds all-reduces to a model that did not need them. Keep the group inside one node, find the tensor-parallel degree at the bottom of the latency curve rather than the largest you can afford, and let pipeline parallelism (Section 24.3) carry the model across nodes when one node is genuinely not enough.

Research Frontier: Cutting the Per-Layer Collective (2024 to 2026)

Because the all-reduce per layer is tensor parallelism's defining cost, a vigorous research line attacks it. Computation-communication overlap fuses the matrix multiply with the collective so the all-reduce hides behind the GEMM rather than stalling after it; NVIDIA's TransformerEngine and the Flux kernels (Chang et al., 2024) push this fine-grained overlap into production decoding. Sequence-parallel and Ulysses-style schemes (Jacobs et al., 2023, with 2024 to 2025 serving variants) rearrange where the collective falls so that activation memory shrinks alongside the weights. A separate thread asks whether the collective must be a full all-reduce at all, exploring lower-precision (FP8) communication and quantized activations on the wire to shrink the bytes per call, echoing the communication-avoiding training work of Chapter 10. The throughline is the one this section measured: the field treats the per-layer all-reduce as a cost to engineer down, because it is the term that caps the tensor-parallel degree.

We now have the within-node tool for serving a model too large or too slow for one GPU: shard every layer's weights and KV cache across $T$ GPUs, pay one all-reduce per layer on NVLink, and stop at the degree where the collective overtakes the speedup. When one node is not enough, the model must be cut a different way, across nodes, with far less frequent communication. That is pipeline parallelism, and Section 24.3 takes it up.

Exercise 24.2.1: Why Not Just Add Replicas? Conceptual

A teammate proposes serving a 140 GB model on 80 GB GPUs by "running more replicas." Explain in two or three sentences why this cannot work, what tensor parallelism does instead, and the one situation in which their replication instinct is exactly right. Then state, for a deployment described as "tensor-parallel size 8, 6 replicas," how many GPUs it uses and what each of the two numbers buys you.

Exercise 24.2.2: Shard the Attention Block Coding

Code 24.2.1 sharded a feed-forward block. Extend it to a single-head-group attention block: split a multi-head attention layer by heads across $T$ simulated GPUs, where each GPU computes the attention output for its own heads and stores only its heads' keys and values (the sharded KV cache of Section 5). Concatenate the per-GPU head outputs, apply the row-parallel output projection with an all-reduce, and verify the result matches a single-GPU multi-head attention forward pass to floating-point rounding. Report the per-GPU KV-cache size as a function of $T$ for a context of $C$ tokens.

Exercise 24.2.3: Find the Sweet Spot Analysis

Using the latency model $\text{latency}(T) = L\left(c_0/T + a + b\log_2 T\right)$ from Section 4, treat $T$ as continuous and differentiate to find the $T$ that minimizes per-token latency in terms of $c_0$, $a$, $b$. With the section's values ($c_0 = 1$, $a = 0.04$, $b = 0.06$), compute that optimal $T$ and compare it to the discrete results in Output 24.2.2. Then explain what happens to the optimum if $b$ grows tenfold, the situation when the all-reduce crosses a node boundary onto a slow fabric, and connect your answer to why the group is capped to one node.