Section 41.3: Building a Single-Machine Baseline

"A single process, honest about what it cannot hold. I am slow, I admit it. But every number you brag about later is measured against me, so choose your insults carefully."
A Baseline That Knows It Is the Denominator

Big Picture

Before you are allowed to distribute anything, you must build the smallest correct version that runs on one machine, measure it rigorously, and keep that measurement; every scale-out claim you make for the rest of the project is a ratio whose denominator is that single number. A speedup is meaningless without a $T_1$ to divide by, a correctness claim is meaningless without a ground-truth answer to check against, and a choice of distribution axis is guesswork without a profile of where the one-machine version actually spends its time. The baseline is not a warm-up exercise you skip when you are confident; it is the instrument that converts "the cluster version feels fast" into "the cluster version is $4.7\times$ faster at identical accuracy." This section builds that instrument, measures it on the same metric and configuration you will later compare against, and shows why a baseline you trust is the most valuable artifact in the whole capstone.

The previous section taught you to profile a workload and read off the axis along which it should be distributed. That profile presumes something concrete to profile: a working, end-to-end, single-machine implementation. This section builds it. The temptation, once you have settled on "this is a data-parallel training problem" or "this is a sharded-serving problem," is to start writing the distributed version immediately, because that is where the interesting engineering lives. Resisting that temptation for one more step, long enough to build and measure a serial baseline, is the single most important discipline in this book's evaluation philosophy, and it is the discipline that most capstone projects skip and most regret skipping.

The reason is arithmetic. Every quantity that makes a distributed system worth building is defined relative to the one-machine case. Speedup is the baseline time divided by the distributed time. Efficiency is that speedup divided by the machine count. "Scales well" means efficiency stays high as you add machines. None of these has a value, not even a wrong value, until you have measured the baseline. A capstone that reports "our system processes 10,000 documents per second" without a baseline has reported a number, not a result; the reviewer's first question, and yours, is "compared to what?"

1. Why the Baseline Is Non-Negotiable Beginner

Three distinct jobs are done by the single-machine baseline, and it helps to keep them separate because a baseline that does one well can still fail at the others. The first job is to be the denominator of speedup. Section 3.1 defined speedup and efficiency as functions of one measured quantity, the time a job takes; the baseline is where that quantity is pinned down for the single-machine case. Without a measured $T_1$, the speedup $S_p = T_1 / T_p$ is not a small number or a large number, it is undefined, and any percentage you quote is fabricated.

The second job is to establish correctness ground truth. The distributed version is more complex, and complexity is where bugs hide: a shard boundary that drops the last batch, an all-reduce that averages with the wrong weights, a race that corrupts an accumulator. The serial baseline, being simple, is the version you can convince yourself is correct by reading it. Its output then becomes the reference that the distributed version must reproduce, exactly for deterministic computations like the gradient identity of Section 1.1, or within a stated tolerance for stochastic ones. A faster wrong answer is not a speedup; it is a regression you have not noticed yet.

The third job is to reveal the real bottleneck, the input that Section 41.2 needs to pick a distribution axis. You cannot profile a system you have not built. The baseline is the thing the profiler attaches to, and its time breakdown (data loading versus compute versus serialization) is what tells you whether to distribute the data, the model, or the serving, rather than guessing from the architecture diagram.

Key Insight: You Cannot Claim a Speedup You Never Measured Against

Speedup, efficiency, and scalability are all ratios with $T_1$ in the denominator. A distributed system reported without a measured single-machine baseline has no speedup, only a throughput number floating free of any comparison. The baseline is not optional supporting evidence; it is the literal denominator of the project's central claim. Build it first, measure it carefully, and never overwrite the measurement, because every later number points back to it.

2. The Anatomy of a Trustworthy Baseline Beginner

A good baseline is the smallest correct end-to-end version of the system, and each word in that phrase is load-bearing. Smallest, because the baseline exists to be understood and trusted, not to be fast; every optimization you add to it is a place a bug can hide and a way the comparison can become unfair. Correct, because its whole value is as a reference, and a reference you do not trust is worse than none. End-to-end, because a baseline that measures only the inner loop hides the data loading, preprocessing, and output that often dominate the real cost and that the distributed version must also pay. Figure 41.3.1 shows the pipeline this section follows: build the serial version, instrument it, measure the four quantities that matter, and freeze the result as the reference point against which the distributed version is judged.

Figure 41.3.1: The baseline pipeline. A serial, end-to-end implementation is instrumented and measured on four quantities; the resulting wall-clock $T_1$ becomes the fixed reference point that every later speedup ratio divides by. The loop-back arrow stresses that the distributed system is judged against this measurement, not against a fresh, separately tuned single-machine run.

Four quantities are worth measuring, and you should measure all four because a distributed system can win on one while losing on another. Wall-clock time is the headline: how long the end-to-end job takes, which becomes $T_1$. Peak memory tells you which ceiling from Section 1.1 the single machine is closest to, and whether the binding constraint is time or space. Throughput, items processed per second, is the metric that scales most cleanly across machine counts and the one serving systems are usually judged by. Accuracy, or whatever quality metric the task defines, is the quantity the distributed version must hold fixed; a speedup bought by silently degrading the answer is not a speedup.

Thesis Thread: The Baseline Is Where Scale-Out Becomes Falsifiable

This book's thesis is that distributing the essential work across machines pays off, and the baseline is the only place that claim can be tested rather than asserted. Every chapter has argued that scale-out beats scale-up for a specific workload; the capstone is where you have to prove it with a number, and that number is a ratio against $T_1$. Without the baseline, "we scaled out" is a description of effort. With it, "we scaled out and achieved $S_p = 4.7$ at $90\%$ efficiency on the identical metric" is a result. The denominator is what turns the thesis from a slogan into a measurement.

3. Instrumenting the Baseline Intermediate

Measuring a baseline sounds trivial and is full of traps, most of which inflate or deflate $T_1$ in ways that make the eventual speedup a fiction. The cardinal rule is that the instrument must not change what it measures. The clearest example: a memory tracer records every allocation, and that recording is itself work, so timing your wall-clock while a tracer is running can inflate $T_1$ by an order of magnitude. The fix is to measure time and memory in separate passes, timing the clean run and tracing a second run, so that neither measurement taxes the other. Code 41.3.1 below does exactly this, and the difference it makes is large enough to change the sign of the conclusion if you get it wrong.

Three Python tools cover almost every baseline. The time.perf_counter function gives a high-resolution wall-clock for timing a region. The tracemalloc module reports peak memory by tracing allocations, in a pass you keep separate from the timed one. For finding the bottleneck rather than just the total, the cProfile module breaks the wall-clock down by function, which is the per-function view Section 41.2 reads to pick an axis. The library shortcut below packages the timing piece so you stop hand-rolling it.

Library Shortcut: timeit and tracemalloc Replace Hand-Rolled Stopwatches

Hand-written timing with a single perf_counter pair is fine for a coarse $T_1$, but it measures one noisy run and forgets warm-up and garbage-collection effects. The standard library does the careful version for you. For a stable per-call time, timeit runs the code many times, disables the garbage collector during the loop, and reports the best run, which is the convention for reproducible micro-measurements:

import timeit, tracemalloc

# Stable wall-clock: best of repeated runs, GC disabled inside the loop.
secs = timeit.timeit("baseline_serial(N)", globals=globals(), number=5) / 5

# Peak memory in one bracketed call, no manual high-water bookkeeping.
tracemalloc.start()
baseline_serial(N)
_, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
print(f"T1 ~ {secs:.3f} s, peak {peak/1024:.1f} KiB")

Code 41.3.1: The timeit and tracemalloc pair. Roughly a dozen lines of manual best-of-N timing and high-water-mark tracking collapse to two calls; the library handles GC suppression, repeat averaging, and the allocation high-water mark that a hand-rolled stopwatch silently gets wrong.

4. Strong-Scaling and Weak-Scaling Baselines Are Different Intermediate

There is not one baseline but two, and which one you build depends on the question your capstone answers. The distinction is the strong-versus-weak scaling split from Section 3.3, and it determines what you hold fixed when you measure. A strong-scaling study asks "can I solve the same problem faster with more machines?" Here the problem size is frozen, the baseline runs that fixed problem on one machine for time $T_1$, and the speedup is

$$S_p = \frac{T_1}{T_p}, \qquad E_p = \frac{S_p}{p} = \frac{T_1}{p \, T_p},$$

where $T_p$ is the time on $p$ machines, $S_p$ is the speedup, and $E_p$ is the efficiency: the fraction of each added machine that turned into real speed. Perfect strong scaling is $S_p = p$ and $E_p = 1$; the communication tax of Chapter 3 is what pulls both below their ideal.

A weak-scaling study asks a different question: "can I solve a proportionally bigger problem in the same time if I add machines in proportion?" Here the work per machine is frozen and the total problem grows with $p$. The baseline is one machine handling one unit of work in time $T_1$, and the relevant quantity is weak-scaling efficiency,

$$E^{\text{weak}}_p = \frac{T_1}{T_p},$$

where now $T_p$ is the time for $p$ machines to handle $p$ units of work. Ideal weak scaling keeps $T_p = T_1$, so $E^{\text{weak}}_p = 1$ means a $p$-times-bigger job finished in the same wall-clock. The two studies need two different baselines: a fixed-problem $T_1$ for strong scaling, a fixed-work-per-machine $T_1$ for weak scaling. Building the wrong one quietly invalidates every ratio you compute, because the denominator answers a question you were not asking.

Key Insight: Measure the Baseline on the Exact Metric and Config You Will Compare Against

The speedup is only meaningful if $T_1$ and $T_p$ are construct-matched: the same task, the same quality metric, the same data, the same hardware class, computed in one consistent measurement, not stitched together from two separately tuned runs. A baseline timed on a smaller dataset, or measured at a different accuracy target, or run with a different batch size, produces a denominator that does not correspond to the numerator, and the resulting "speedup" is a comparison of two different things. Co-compute the matched pair on one configuration and save it as one artifact; a number-by-number audit of your capstone passes only when each ratio's top and bottom came from the same measured ground.

5. A Baseline, Measured, and a First Speedup Intermediate

Code 41.3.2 makes the whole discipline concrete on a deliberately small workload: scoring two million items with a fixed, deterministic, CPU-bound function. It builds the serial baseline as the smallest correct end-to-end version, times it cleanly to get $T_1$, measures peak memory in a separate pass so the tracer never taxes the timed run, records the result as correctness ground truth, then runs a two-worker version on the identical workload and metric and reports the strong-scaling speedup and efficiency against the measured baseline. The worker code and the baseline code share the same inner function, so the comparison is construct-matched by construction.

import time, tracemalloc
from multiprocessing import Pool

N = 2_000_000

def score(i):                       # a fixed, deterministic per-item cost
    x = (i * 2654435761) & 0xFFFFFFFF
    for _ in range(8):
        x ^= x >> 13
        x = (x * 1274126177) & 0xFFFFFFFF
        x ^= x << 7 & 0xFFFFFFFF
    return x & 1

def chunked(n, parts):              # split [0, n) into 'parts' contiguous ranges
    step = -(-n // parts)
    return [range(k, min(k + step, n)) for k in range(0, n, step)]

def worker_sum(rng):                # the SAME inner loop the baseline runs
    return sum(score(i) for i in rng)

def baseline_serial(n):             # smallest correct end-to-end version
    return sum(worker_sum(rng) for rng in chunked(n, 1))

if __name__ == "__main__":
    # T1: time the CLEAN run, with no tracer attached to slow it down.
    t0 = time.perf_counter()
    ground_truth = baseline_serial(N)
    t1 = time.perf_counter() - t0

    # Peak memory: a SEPARATE pass, so tracing never taxes the timed run.
    tracemalloc.start()
    baseline_serial(N)
    _, peak = tracemalloc.get_traced_memory()
    tracemalloc.stop()

    print(f"baseline T1 wall-clock     : {t1:.3f} s")
    print(f"baseline peak memory       : {peak/1024:.1f} KiB")
    print(f"baseline throughput        : {N / t1:,.0f} items/s")
    print(f"correctness ground truth   : {ground_truth}")

    # Strong scaling: SAME fixed problem, now on p workers, SAME metric.
    P = 2
    t0 = time.perf_counter()
    with Pool(P) as pool:
        parts = pool.map(worker_sum, chunked(N, P))
    tp = time.perf_counter() - t0

    S = t1 / tp                     # speedup relative to the MEASURED T1
    print(f"\nworkers p                  : {P}")
    print(f"parallel Tp wall-clock     : {tp:.3f} s")
    print(f"result matches ground truth: {sum(parts) == ground_truth}")
    print(f"speedup  S = T1/Tp         : {S:.2f}x")
    print(f"efficiency E = S/p         : {S / P:.2f}")

Code 41.3.2: A complete baseline-and-first-speedup harness. The baseline and the workers call the identical worker_sum, so $T_1$ and $T_p$ measure the same computation; memory is traced in a separate pass; the two-worker speedup is reported as a ratio against the measured $T_1$, not against a re-tuned run.

baseline T1 wall-clock     : 5.335 s
baseline peak memory       : 0.7 KiB
baseline throughput        : 374,881 items/s
correctness ground truth   : 999621

workers p                  : 2
parallel Tp wall-clock     : 2.931 s
result matches ground truth: True
speedup  S = T1/Tp         : 1.82x
efficiency E = S/p         : 0.91

Output 41.3.2: The measured baseline and a real first speedup. The two-worker version reproduces the ground-truth answer exactly and runs $1.82\times$ faster than the measured $T_1$, an efficiency of $0.91$: sub-linear, because process spawn and inter-process transfer eat part of the second worker, exactly the communication tax Chapter 3 predicts. The $5.335$ s here is the clean time; tracing it in the same pass inflated it by roughly $9\times$ before the measurement was separated.

Two lessons survive the toy scale. First, the efficiency is $0.91$, not $1.0$, because even two local processes pay a coordination cost; this is the communication tax made visible at the smallest possible scale, and it only grows with $p$. Second, and more important for the capstone, the speedup of $1.82\times$ exists only because there is a measured $T_1$ of $5.335$ seconds to divide by. Had the baseline been timed with the tracer still running, $T_1$ would have read roughly $9\times$ larger, and the same two-worker run would have reported an absurd, fabricated speedup. The denominator is not a formality; it is the result.

Practical Example: The Speedup That Evaporated Under a Fair Baseline

Who: A graduate student presenting a distributed document-classification capstone.

Situation: The eight-worker Spark pipeline processed a corpus in nine minutes, and the draft claimed a "roughly $40\times$ speedup over single-machine."

Problem: The $40\times$ came from a single-machine baseline that loaded the data from a slow network mount, re-parsed JSON on every record, and ran an unvectorized Python loop, none of which the Spark version did.

Dilemma: Keep the flattering number from a deliberately weak baseline, or rebuild the baseline to match the distributed version's data path and per-item work and report whatever honest ratio survived.

Decision: They rebuilt the baseline construct-matched, same local data, same parser, same feature code, then measured $T_1$ on the identical accuracy target.

How: One clean serial pass for $T_1$, a separate tracemalloc pass for memory, and a cProfile run that revealed the real serial bottleneck was tokenization, not I/O.

Result: The honest speedup was $5.3\times$ at $66\%$ efficiency, and the profile pointed at the axis to push next; the $40\times$ had been measuring a bad baseline against a good cluster.

Lesson: An impressive speedup over a weak baseline is a measurement of the baseline's weakness. Match the baseline to the distributed version, then report what is left.

6. Freezing the Baseline as the Project's Reference Advanced

The baseline's value compounds only if you preserve it. Save $T_1$, the peak memory, the throughput, the accuracy, the exact dataset, the hardware, and the metric definition as one artifact, committed alongside the code, so that every later speedup ratio in the capstone can be traced to the same measured ground. This is the construct-matched, co-computed discipline applied across the life of the project: the metrics that Section 41.6 computes for the finished distributed system are all ratios against this frozen baseline, and they are only auditable because the denominator was fixed once and never quietly re-measured. When the evaluation methodology of Chapter 5 asks you to defend a number, the frozen baseline is the evidence.

Research Frontier: Honest Baselines and the Reproducibility Push (2024 to 2026)

The systems-ML community has converged on the view that weak baselines are the most common source of overstated speedups, and recent work tries to make the baseline a first-class, checked artifact. MLPerf's training and inference suites (Mattson et al.) fix the task, the quality target, and the reference implementation precisely so that a reported speedup divides matched numerators and denominators; a result that does not hit the quality bar is simply not a valid entry. Reproducibility efforts around artifact evaluation at MLSys and NeurIPS now ask authors to ship the single-machine reference alongside the distributed code, and tooling such as scalene and py-spy has pushed low-overhead profiling toward the point where the instrument genuinely does not perturb $T_1$. The frontier question is automating the fairness check itself: detecting when a quoted baseline differs in data path, precision, or quality target from the system it is compared against, before the speedup is believed. We connect these evaluation standards to the capstone's own metrics in Section 41.6.

Fun Note: The Baseline Always Wins the Argument

There is a recurring moment in capstone reviews where a student insists the cluster version is obviously faster, gestures at the architecture diagram, and quotes a throughput with no comparison. The baseline does not argue back. It simply sits there as a number, and the moment someone divides by it, the conversation ends one way or the other. A serial loop you can read in thirty seconds has settled more performance disputes than any benchmark slide deck, precisely because nobody can claim it is cheating.

You now have the discipline that the rest of the capstone depends on: build the smallest correct single-machine version, instrument it without perturbing it, measure wall-clock, memory, throughput, and accuracy on the exact configuration you will compare against, choose the strong or weak baseline that matches your question, and freeze the result as the denominator of every speedup you will report. The profile from Section 41.2 told you which axis to distribute; the baseline you just built tells you whether the distribution paid off. The next section turns to designing the distributed version itself, the numerator that this denominator will judge, beginning in Section 41.4.

Exercise 41.3.1: One Denominator, Two Questions Conceptual

A team reports "$8\times$ speedup on 16 machines." Without more information, explain why you cannot tell whether this is a strong-scaling or a weak-scaling result, and what the baseline $T_1$ would have to measure in each case. Then state, for each interpretation, whether $8\times$ on 16 machines is good or bad news, and what efficiency value it implies under the definitions in Section 4. Conclude with the one sentence the team must add to their report to make the number interpretable.

Exercise 41.3.2: Make the Instrument Lie, Then Fix It Coding

Take Code 41.3.2 and deliberately break the instrumentation: time the baseline with tracemalloc still running inside the timed region, the way a careless harness would. Record the inflated $T_1$, then recompute the two-worker speedup against it and observe the fabricated value. Now restore the separated passes and report the honest speedup. Quantify, as a ratio, how much the tracer inflated $T_1$ on your machine, and write two sentences explaining why "the instrument must not change what it measures" is not optional advice but a correctness requirement for any speedup claim.

Exercise 41.3.3: Profile to an Axis Analysis

Extend Code 41.3.2 with a realistic end-to-end step: before scoring, load the $N$ items from a file you generate on disk, and after scoring, write the per-item results back out. Run cProfile on the full serial baseline and report the wall-clock fraction spent in loading, scoring, and writing. Based on the profile, argue which distribution axis from Section 41.2 you would push first, and predict, using the strong-scaling efficiency definition from Section 4, an upper bound on the speedup that distributing only the scoring step could achieve while the I/O stays serial. Connect your bound to Amdahl's law from Chapter 3.