Part I: Foundations of Distributed AI
Chapter 5: Evaluating Distributed AI Systems

Communication-to-Computation Ratio

"They kept adding workers and asking me to go faster. I kept spending more of my day on the phone. Nobody asked which of those two facts caused the other."

A GPU Idling on a Communication Barrier
Big Picture

When a distributed job refuses to go faster as you add machines, one number usually explains it: the ratio of time spent communicating to time spent computing. Speedup curves (Section 5.2) and throughput numbers (Section 5.3) tell you that a job has stopped scaling; they do not tell you why. The communication-to-computation ratio is the diagnostic that does. A ratio near zero means each worker is busy with useful arithmetic and the cluster will keep rewarding more hardware. A ratio near or above one means the workers spend as much time waiting on the network as doing math, and adding machines only buys more waiting. This section defines the ratio, shows how to measure it from a profiled training step, and turns it into a decision rule for the four levers you actually pull: more workers, bigger batches, gradient compression, or better overlap.

The previous sections of this chapter gave you outcome metrics. Speedup and efficiency say how far a job is from linear scaling; goodput and tail latency say how much useful work survives contention and stragglers. Those are the symptoms. To treat the disease you need a metric that points at the mechanism, and for distributed training the dominant mechanism is the cost of moving gradients between machines. Chapter 4 opens with the thesis that communication, not computation, is what ultimately bounds large-scale training; this section turns that thesis into a quantity you can read off a profiler and act on. The communication-to-computation ratio, written $R$ throughout, is the single most useful diagnostic in your evaluation toolkit precisely because it converts a vague complaint ("it does not scale") into a measured cause.

1. Defining the Ratio Beginner

Decompose one step of a distributed worker into the time it spends on useful arithmetic and the time it spends moving data. Let $T_{\text{comp}}$ be the compute time of a step (forward and backward passes, the actual floating-point work) and let $T_{\text{comm}}$ be the communication time exposed on the critical path, meaning the network time that is not hidden behind compute. The communication-to-computation ratio is their quotient,

$$R = \frac{T_{\text{comm}}}{T_{\text{comp}}}.$$

The wall-clock time of a step, when communication is not overlapped, is $T_{\text{step}} = T_{\text{comp}} + T_{\text{comm}}$. The fraction of that step doing useful work is the per-step scaling efficiency,

$$E = \frac{T_{\text{comp}}}{T_{\text{comp}} + T_{\text{comm}}} = \frac{1}{1 + R}.$$

This little identity is the whole reason the ratio matters. Efficiency is a falling function of $R$: at $R = 0$ the step is pure compute and $E = 100\%$; at $R = 1$ the worker splits its time evenly and $E = 50\%$; at $R = 4$ it spends four fifths of every step on the network and $E = 20\%$. The word "exposed" in the definition of $T_{\text{comm}}$ is load-bearing, and Section 5 returns to it: modern frameworks overlap much of the all-reduce with the backward pass, so the number that belongs in $R$ is the communication that overlap failed to hide, not the raw transfer time.

Key Insight: The Ratio Is the Why Behind the Speedup Curve

A flattening speedup curve is a symptom; the communication-to-computation ratio is the cause. Because per-step efficiency is exactly $E = 1/(1+R)$, every point on a sub-linear scaling curve can be traced to a value of $R$ at that worker count. When you double the workers and the speedup barely moves, measure $R$: if it climbed from $0.2$ to near $1$, communication is now the bottleneck, and no amount of faster arithmetic will help until you shrink the network term.

2. Why the Ratio Grows With Scale Intermediate

The ratio is not a constant of the job; it grows as you add workers, which is exactly why scaling stalls. The communication-cost model of Section 3.8 gives the shape. For a ring all-reduce of a gradient with $P$ parameters across $K$ workers, the bytes each worker sends are roughly $2(K-1)/K$ times the gradient size, and under the $\alpha$-$\beta$ model the time is approximately

$$T_{\text{comm}} \approx 2(K-1)\,\alpha + 2\,\frac{K-1}{K}\,P\,\beta,$$

where $\alpha$ is the per-message latency and $\beta$ is the per-byte transfer cost. The bandwidth term saturates as $K$ grows (the factor $(K-1)/K$ approaches one), but the latency term $2(K-1)\alpha$ keeps climbing linearly in $K$. Meanwhile, if you hold the global batch fixed and add workers, each worker's compute $T_{\text{comp}}$ shrinks because it processes fewer examples. A growing numerator over a shrinking denominator is a ratio that rises fast, which is the analytic reason that fixed-batch strong scaling hits a wall. Chapter 4 develops this bound for every collective; here the point is only that $R$ is a function of scale, so you must measure it at the worker count you actually run.

Baseline compute 180 ms exposed comm 120 ms R = 0.67 + Overlap compute 180 ms 36 R = 0.20 + Compress compute 180 ms 9 R = 0.05 0 step time (ms) 200 compute exposed communication
Figure 5.4.1: One training step, broken into compute (blue) and exposed communication (orange), under three regimes. The compute bar never moves because the arithmetic is the same; what shrinks is the orange tail. Overlap hides most of the all-reduce behind the backward pass, dropping the ratio $R$ from $0.67$ to $0.20$; compressing the exposed remainder drops it to $0.05$. The numbers are the measured outputs of Code 5.4.1.

3. Measuring It From a Real Step Intermediate

You do not estimate $R$ from a formula in practice; you measure it. The procedure is always the same: profile one steady-state training step, attribute each slice of wall-clock to compute or to communication, then take the quotient. Three tools cover almost every case. The PyTorch profiler (torch.profiler) tags each operator as compute or as an NCCL collective and gives you a per-category time breakdown directly. CUDA events placed around the backward pass and around the all-reduce give you the two raw numbers with microsecond resolution when you want a lightweight in-loop measurement. And nccl-tests, run standalone on the same interconnect, gives you the pure collective time for a gradient of your size, which is the $T_{\text{comm}}$ you would see with zero overlap. Read all three the same way: the only quantities you need are the compute time and the communication time left on the critical path.

Library Shortcut: torch.profiler and nccl-tests Hand You the Two Numbers

You rarely instrument timers by hand. The PyTorch profiler separates compute kernels from communication kernels for you, and a single trace yields both terms of the ratio:

import torch
from torch.profiler import profile, ProfilerActivity

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
    train_one_step(model, batch)          # forward, backward, all-reduce, optimizer
    torch.cuda.synchronize()

# Group kernels: NCCL collectives are communication, everything else is compute.
ev = prof.key_averages()
comm = sum(e.cuda_time_total for e in ev if "nccl" in e.key.lower())
comp = sum(e.cuda_time_total for e in ev) - comm
print(f"R = {comm / comp:.3f}   efficiency = {comp / (comp + comm) * 100:.1f}%")
Code 5.4.2: The profiler tags every NCCL kernel for you, so the ratio is a one-line quotient over a category sum. For the standalone all-reduce time on your fabric, nccl-tests (the all_reduce_perf binary) reports bandwidth and time for any message size with no training loop to set up. The profiler replaces dozens of lines of manual CUDA-event bookkeeping and correctly attributes overlap, which hand timers get wrong.

The runnable demo below skips the GPU and works directly from a profiled breakdown, because the arithmetic of the ratio is what we want to make concrete, not the act of capturing a trace. Given a compute time and a raw all-reduce time, it computes $R$ and the resulting efficiency, then shows how the two levers of Section 4.10 (overlap) and gradient compression move both numbers. These are the same numbers plotted in Figure 5.4.1.

def scaling_efficiency(t_compute, t_exposed_comm):
    """Fraction of step time spent on useful compute; the rest is comm stall."""
    step = t_compute + t_exposed_comm
    ratio = t_exposed_comm / t_compute          # communication-to-computation ratio
    eff = t_compute / step                      # = 1 / (1 + ratio)
    return ratio, eff


# Measured per-step times in milliseconds, from a profiler breakdown of one
# training step on a worker: compute (forward + backward) and the raw all-reduce.
t_compute = 180.0          # ms of useful FLOPs per step
t_comm_raw = 120.0         # ms the all-reduce takes if nothing hides it

print("Baseline: no overlap, no compression")
ratio, eff = scaling_efficiency(t_compute, t_comm_raw)
print(f"  exposed comm        : {t_comm_raw:6.1f} ms")
print(f"  comm/compute ratio  : {ratio:6.3f}")
print(f"  scaling efficiency  : {eff*100:5.1f} %")

print("\nWith overlap: 70% of the all-reduce hides behind the backward pass")
overlap = 0.70
t_comm_overlapped = t_comm_raw * (1.0 - overlap)
ratio, eff = scaling_efficiency(t_compute, t_comm_overlapped)
print(f"  exposed comm        : {t_comm_overlapped:6.1f} ms")
print(f"  comm/compute ratio  : {ratio:6.3f}")
print(f"  scaling efficiency  : {eff*100:5.1f} %")

print("\nWith overlap + 4x gradient compression on the exposed remainder")
t_comm_compressed = t_comm_overlapped / 4.0
ratio, eff = scaling_efficiency(t_compute, t_comm_compressed)
print(f"  exposed comm        : {t_comm_compressed:6.1f} ms")
print(f"  comm/compute ratio  : {ratio:6.3f}")
print(f"  scaling efficiency  : {eff*100:5.1f} %")
Code 5.4.1: From a step breakdown to the ratio and the efficiency it implies. The same scaling_efficiency function is applied three times as overlap and compression peel time off the exposed-communication term while the compute term stays fixed.
Baseline: no overlap, no compression
  exposed comm        :  120.0 ms
  comm/compute ratio  :  0.667
  scaling efficiency  :  60.0 %

With overlap: 70% of the all-reduce hides behind the backward pass
  exposed comm        :   36.0 ms
  comm/compute ratio  :  0.200
  scaling efficiency  :  83.3 %

With overlap + 4x gradient compression on the exposed remainder
  exposed comm        :    9.0 ms
  comm/compute ratio  :  0.050
  scaling efficiency  :  95.2 %
Output 5.4.1: The baseline ratio of $0.667$ means the worker wastes 40% of every step waiting on the network. Overlap alone cuts the ratio to $0.20$ and lifts efficiency to 83%; adding compression on the leftover takes it to $0.05$ and 95%. The compute term never changed; only the exposed-communication term did.

4. Reading the Ratio: A Decision Rule Intermediate

The value of $R$ tells you which lever to pull, and pulling the wrong one wastes money. Table 5.4.1 is the rule the rest of this book quietly follows whenever it sizes a job. The guiding principle is simple: when $R$ is small you are compute-bound and the cluster will reward more hardware, so scale out; when $R$ is large you are communication-bound and more hardware only adds more waiting, so first attack the communication term itself.

Table 5.4.1: Reading the communication-to-computation ratio $R$ and choosing the lever that actually helps. The thresholds are rules of thumb, not laws; measure at your real worker count.
RegimeWhat it meansThe lever to pull
$R \lesssim 0.1$Compute-bound; the network is nearly free.Add workers or shrink the batch per worker; scaling out is efficient.
$0.1 \lesssim R \lesssim 0.5$Healthy; communication is visible but overlap can hide most of it.Maximize overlap (Section 4.10); only then add workers.
$0.5 \lesssim R \lesssim 1$Communication is eating into scaling; efficiency is below 67%.Raise compute per step with a bigger local batch, or compress gradients.
$R \gtrsim 1$Communication-bound; the worker waits more than it works.Stop adding workers; compress, batch larger, or reduce sync frequency.

Map the four levers onto the formula and each one becomes obvious. Adding workers lowers $T_{\text{comp}}$ per worker (fixed global batch) and raises $T_{\text{comm}}$, so it increases $R$ and is only safe when $R$ starts small. A bigger local batch raises $T_{\text{comp}}$ while leaving the gradient size, and hence $T_{\text{comm}}$, unchanged, so it decreases $R$ directly; this is why large-batch training is the first reflex of anyone fighting a communication wall. Gradient compression cuts the bytes in $T_{\text{comm}}$ and so cuts $R$ at the source. Overlap cuts the exposed fraction of $T_{\text{comm}}$ without touching the bytes at all, which is why it is the cheapest lever and the one to exhaust first.

Practical Example: The Cluster That Got Slower at 64 Workers

Who: An ML platform engineer scaling a vision-transformer pretraining job on a cloud GPU cluster.

Situation: Going from 16 to 32 workers nearly doubled throughput, but going from 32 to 64 added only 10%, and 128 workers were slower in wall-clock than 64.

Problem: The team's instinct was to request a larger quota and push past 128, assuming the plateau was a scheduling artifact.

Dilemma: Buy more workers on a hunch, which is fast to try but expensive if wrong, or stop and measure the communication-to-computation ratio at each worker count to find the actual cause.

Decision: They profiled one step at 32 and 64 workers with torch.profiler and found $R$ had jumped from $0.35$ to $1.4$: the global batch was fixed, so per-worker compute had halved while the all-reduce latency term kept climbing.

How: Rather than add workers, they doubled the per-worker batch (restoring $T_{\text{comp}}$) and turned on gradient bucketing with computation overlap, then measured again.

Result: At 64 workers $R$ fell back to $0.3$, efficiency rose to 77%, and throughput scaled almost linearly to 128 workers, at lower cost than the larger quota they had nearly bought.

Lesson: A plateau in the speedup curve is a question, and the ratio answers it. They were communication-bound, so the fix was to shrink communication and grow compute per step, not to add machines that would only wait longer.

5. Overlap, and Why "Exposed" Is the Word That Matters Advanced

Everything above hinges on $T_{\text{comm}}$ being the communication left on the critical path, not the raw transfer time. The reason is the overlap trick of Section 4.10: gradients become ready layer by layer during the backward pass, so a framework can start the all-reduce of the last layer's gradients while still computing the earlier layers' gradients. The network work then happens under the compute, and only the part that does not fit behind compute is exposed. If the raw all-reduce takes $120$ ms and overlap hides $70\%$ of it, the exposed term is $36$ ms, and that $36$ is what belongs in $R$, as Output 5.4.1 shows. Using the raw $120$ ms instead would tell you the job is communication-bound when it is comfortably compute-bound, and you would reach for the wrong lever.

This is also why the ratio and the overlap trick must be read together. Overlap does not reduce the bytes moved, so it does not change the bandwidth term of the cost model; it changes only how much of that term lands on the critical path. There is a ceiling: once compute is fully packed with hidden communication, any further all-reduce time spills into the open and overlap can hide no more. Past that point the only remaining moves are the ones that touch the bytes themselves, compression and a larger local batch, or the ones that change how often you synchronize at all, which is where local-update methods enter. Measuring $R$ with the exposed term tells you exactly where on this progression a job sits.

Research Frontier: Driving the Exposed Ratio Toward Zero (2024 to 2026)

Because the exposed communication term is what caps scaling, recent work attacks it from three directions at once. Local-update training in the DiLoCo lineage (Douillard et al., 2024) lets workers take many optimizer steps between syncs, slashing how often the all-reduce runs and pushing the effective ratio low enough for genuinely over-the-internet and geo-distributed training; the streaming variant (Douillard et al., 2025) overlaps even those infrequent syncs. On the compression side, methods descended from PowerSGD and the work on 1-bit and 4-bit optimizers cut the bytes per collective with little accuracy loss, shrinking $T_{\text{comm}}$ at the source. And compute-communication overlap is now scheduled at the kernel and compiler level (for example the fine-grained overlap in DeepSeek-V3's training stack, 2024 to 2025), so that more of the all-reduce hides behind matrix multiplies. We give these methods the evaluation machinery they deserve in Chapter 10; the through-line is that the field now treats the exposed ratio as the primary quantity to engineer down, and reports it alongside accuracy.

Fun Note: The Office-Hours Test

A quick way to feel the ratio: imagine each worker is a student doing homework who must phone a study group to compare answers after every problem. If the problems take an hour and the calls take a minute, $R$ is tiny and adding students helps. If the problems take a minute and the calls take an hour, the group spends its night on the phone, and inviting more students just adds more callers to the line. Distributed training is the same study group, and overlap is doing the next problem while still on the call.

The ratio is the diagnostic; the next section turns the time you measure into money and energy. A worker that spends 40% of every step waiting on the network is not just slow, it is burning power and rented GPU-hours while idle, and Section 5.5 shows how to put a dollar and a joule figure on exactly that waste, closing the loop from "it does not scale" to "here is what the stall costs."

Exercise 5.4.1: From Curve to Cause Conceptual

A team reports a strong-scaling efficiency of 50% at 64 workers with a fixed global batch. Using $E = 1/(1+R)$, state the value of $R$ this implies. Then explain, in terms of $T_{\text{comp}}$ and $T_{\text{comm}}$, why holding the global batch fixed while quadrupling the workers from 16 to 64 is almost guaranteed to push $R$ up, and name which single lever from Table 5.4.1 would most directly restore compute per step without changing the gradient size.

Exercise 5.4.2: Measure the Levers Coding

Extend Code 5.4.1 into a function sweep(t_compute, t_comm_raw, overlap, compress) that returns $R$ and efficiency for any overlap fraction and any compression factor. Sweep overlap from 0 to 0.9 in steps of 0.1 at a fixed 4x compression, and plot or print efficiency against overlap. Identify the overlap fraction at which efficiency first exceeds 90%, and explain why, past full overlap of the compressible remainder, further overlap stops helping. Confirm your baseline matches Output 5.4.1.

Exercise 5.4.3: Predict the Wall Analysis

Take the ring all-reduce time $T_{\text{comm}} \approx 2(K-1)\alpha + 2\frac{K-1}{K}P\beta$ from Section 2 with $P = 10^9$ parameters at 2 bytes each, $\alpha = 5\ \mu s$, and $\beta$ corresponding to 50 GB/s. Assume a fixed global batch so that per-worker compute is $T_{\text{comp}} = 4000/K$ ms. Compute $R$ at $K = 8, 32, 128, 512$ workers, find the worker count where $R$ first exceeds 1, and discuss how 70% overlap shifts that crossover. Relate your crossover point to the threshold rows of Table 5.4.1.