Section 3.2: Horizontal and Vertical Scaling

"They asked me to handle twice the traffic. I requested a twin. They sent me a bigger version of myself instead, and charged four times the rent."
A Single Node Negotiating Its Renewal

Big Picture

There are exactly two ways to give a system more capacity: make one machine bigger (scale up, vertical) or add more machines (scale out, horizontal), and the choice is governed by two curves that cross. Vertical scaling is simple and keeps everything in one address space, but its price climbs superlinearly as you reach the top of the product line and then stops entirely at a hard ceiling: the largest machine that exists. Horizontal scaling has a near-linear price and no ceiling worth naming, but it charges a communication tax that grows with the number of nodes and sits on top of a fixed coordination cost. Section 1.3 made this distinction conceptually; this section makes it numeric. We build a cost-and-throughput model, find the exact target at which scaling out becomes the cheaper way to hit a throughput goal, and show why real systems scale each node up to a sweet spot and only then scale out.

In Section 3.1 we defined what it means to scale at all: to add resources and get more useful work out, ideally in proportion. We left open the most basic engineering question that follows from that definition, namely where the extra resources come from. A bigger single machine and a larger fleet of ordinary machines both add resources, but they add them along different axes, with different price curves, different limits, and different failure behavior. Treating them as interchangeable is the first mistake a capacity plan can make, and it is a mistake you can avoid with a small amount of arithmetic.

This section is deliberately quantitative. The conceptual version of vertical versus horizontal lives in Section 1.3, where the point was to define the two directions and explain why this book leads with scale-out. Here the point is sharper: given a throughput target and a price list, which direction is cheaper, and at what target does the answer flip? We will end with a runnable model that computes that flip, the crossover point, from a handful of parameters you can read off a cloud provider's pricing page.

1. Two Directions, Two Price Curves Beginner

Vertical scaling, or scaling up, replaces one machine with a more capable one: more cores, more memory, a faster accelerator, a faster interconnect inside the box. Nothing about the program's structure changes, because there is still one machine and one address space; the same single-process code runs faster or holds more. This is the path of least resistance, and for a workload that comfortably fits on an affordable machine it is usually the right one. Its trouble is entirely on the price and availability side.

Horizontal scaling, or scaling out, keeps each machine the same and adds more of them, dividing the work across the fleet. The price of the fleet grows in proportion to its size, which is the good news, but the work now has to be partitioned, the partial results have to be combined, and the machines have to coordinate, all of which cost communication that did not exist when everything lived in one process. That communication is the tax this whole book is about, introduced as the seed of distributed training in Section 1.1 and modeled formally in the alpha-beta cost analysis of Section 3.8.

Figure 3.2.1: The two cost curves that govern the choice. Vertical scaling (orange) is cheapest at small targets but bends upward superlinearly and ends at a hard wall, the largest single machine that can be bought. Horizontal scaling (green) starts more expensive because of the fixed coordination cost and the communication tax, yet rises only near-linearly and continues past the wall. Their intersection is the crossover point: to the left, scale up; to the right, scale out. The demo in Section 5 computes this crossover for concrete numbers.

The shapes in Figure 3.2.1 are the whole story in miniature, and the rest of this section is an effort to put numbers on them. Two facts make the picture lopsided in a way that favors scaling out at the high end. First, the vertical curve does not merely get expensive; it terminates, because at some point you have specified a machine that no vendor sells. Second, the horizontal curve, for all its overhead, keeps going. The interesting region is where they cross, and the engineering judgment lives in knowing which side of the crossover your workload sits on.

2. The Cost of Going Up: Superlinear Price and a Hard Ceiling Intermediate

Within a single product line, price tracks capability roughly linearly: a machine with twice the cores often costs about twice as much. The trouble begins at the top of the range, where you pay a premium for the engineering required to put many cores, vast memory, and the fastest interconnect into one coherent box. Let $s$ be the size multiplier of a single machine relative to a commodity baseline, so $s = 1$ is one ordinary node and $s = 4$ is a node with four times the throughput. A useful model for the price of that bigger node is superlinear,

$$C_{\text{up}}(s) = C_1 \, s^{\,1 + \alpha}, \qquad \alpha > 0,$$

where $C_1$ is the price of the baseline node and $\alpha$ is a curvature that captures how much the top of the range overcharges relative to perfect proportionality. With $\alpha = 0$ price would be exactly proportional to size; a realistic $\alpha$ in the range of $0.5$ to $1.0$ means that doubling the size of a single node more than doubles, and often nearly quadruples, its price. That premium buys real value, namely a single address space with no communication tax, which is exactly why the smallest targets in Figure 3.2.1 favor scaling up.

The decisive feature is not the premium but the ceiling. There is a largest machine that exists, call its size $s_{\max}$, and the throughput it delivers, $s_{\max} \, T_1$, is the most a single node can ever do for this workload. Past that point the vertical price is not high; it is undefined, because you are specifying hardware nobody builds. Memory is usually the first wall to be hit in AI workloads: a model plus its optimizer state and activations can exceed the memory of even the largest accelerator, which is precisely the pressure that forces model sharding in Chapter 16. Throughput is the second wall, the one this section's model focuses on.

Key Insight: Vertical Scaling Fails Two Ways, and Only One Is About Money

Scaling up has a price problem and an existence problem, and they are different. The price problem is the superlinear term $s^{1+\alpha}$: the bigger node costs more than its share of capability, so beyond a sweet size it is simply a bad deal. The existence problem is the ceiling $s_{\max}$: above it there is no node to buy at any price. You can argue with a price; you cannot argue with a ceiling. The moment a workload's requirement exceeds $s_{\max} \, T_1$ in throughput or the largest accelerator's memory, scaling up is off the table and the only remaining direction is out.

3. The Cost of Going Out: Near-Linear Price, a Communication Tax, and a Floor Intermediate

Scaling out inverts both of vertical scaling's weaknesses. The price of $n$ identical commodity nodes is $n \, C_1$, linear by construction, plus a fixed coordination cost $F$ for the load balancer, scheduler, or service mesh that ties the fleet together. There is no superlinear premium and no ceiling: you can keep adding nodes far beyond the largest single machine. What you cannot escape is that $n$ nodes do not deliver $n$ times the throughput, because partitioning, combining, and coordinating cost communication, and that cost grows with $n$. We model the effective throughput of a cluster of $n$ nodes as

$$T_{\text{eff}}(n) = \frac{n \, T_1}{1 + \tau \,(n - 1)},$$

where $T_1$ is one node's throughput and $\tau \ge 0$ is the communication tax: the fractional efficiency a node loses for each other node it must coordinate with. When $\tau = 0$ the cluster is perfectly linear, $T_{\text{eff}} = n \, T_1$. For $\tau > 0$ the per-node throughput decays as the cluster grows, so reaching a target $T^\star$ requires more than the naive $T^\star / T_1$ nodes. This functional form is a sibling of Amdahl's law, which Section 3.5 derives from a serial fraction; here the same diminishing return arises from coordination rather than from a serial section, and we revisit the connection in Section 3.5.

Inverting the throughput model gives the node count needed for a target, and the price follows directly. The total horizontal price to reach throughput $T^\star$ is

$$C_{\text{out}}(T^\star) = n(T^\star)\, C_1 + F, \qquad n(T^\star) = \min\{\, n : T_{\text{eff}}(n) \ge T^\star \,\}.$$

Two numbers shape this curve. The floor $F$ lifts the whole horizontal line upward, which is why scaling out is rarely cheapest for tiny targets: you pay for a coordinator even to run a handful of nodes. The tax $\tau$ tilts the line, making each additional unit of throughput cost slightly more than the last as coordination overhead accumulates. Neither produces a ceiling, though, so the line in Figure 3.2.1 keeps climbing to the right while the vertical curve has already hit its wall.

Fun Note: The Tax Nobody Itemizes

A cloud invoice lists the nodes you rented, never the communication tax you paid. It hides inside wall-clock time: the cluster that should have done $10\times$ the work does $7\times$, and the missing $3\times$ quietly shows up as a longer, pricier run. The tax is real money; it just refuses to appear as a line item, which is exactly why you have to model it before you provision rather than discover it on the bill.

4. Where Each Wins, and Why Real Systems Do Both Intermediate

Putting the two curves together, the rule is short. For small throughput targets, scale up: one modest node has no communication tax and no coordination floor, so it undercuts a cluster that must pay both. For large targets, scale out: the vertical curve has either grown superlinearly expensive or run into its ceiling, while the horizontal curve, taxed but unbounded, keeps pace. The crossover point is the throughput at which the two prices are equal, and it is the single most useful number a capacity plan can compute, because it converts a vague architectural preference into a threshold you compare your actual target against.

Real systems do not pick one direction and commit to it; they compose the two in a specific order. The recipe that production teams converge on is to scale each node up to a sweet spot, the size just before the superlinear premium turns a bigger box into a bad deal, and then to scale out across many such right-sized nodes. This makes each unit of the fleet as capable as it can be without overpaying, which keeps the node count $n$ small, which in turn keeps the communication tax $\tau\,(n-1)$ small, since the tax grows with the number of nodes. Scaling up first is therefore not a rival to scaling out; it is the move that makes scaling out cheaper. Choosing that per-node sweet spot is the subject of per-node efficiency in Chapter 22, and packing the right-sized nodes onto real hardware is the job of cluster scheduling in Chapter 33.

Practical Example: The Embedding Service That Stopped Buying Bigger Boxes

Who: A platform engineer running a text-embedding service behind a product search bar.

Situation: Query volume had tripled in two quarters, and the service, running on a single large GPU instance, was saturating during peak hours and shedding requests.

Problem: The team's habit was to move up one instance size each time load grew, and the next size up cost roughly $2.4\times$ the current one for about $1.8\times$ the throughput, a clearly superlinear deal.

Dilemma: Keep climbing the vertical curve toward the largest GPU instance the provider offered, simple but increasingly overpriced and only one or two sizes from the ceiling, or re-architect the stateless service to run as many smaller replicas behind a load balancer.

Decision: They scaled out, because the service was stateless and embarrassingly parallel, so its communication tax $\tau$ was tiny: replicas shared nothing but a load balancer and a model cache.

How: They right-sized to a mid-tier GPU instance (the sweet spot before the price premium), containerized the service, and ran twelve replicas behind an autoscaler keyed to queue depth.

Result: At the new peak target the fleet cost roughly a third of what the equivalent single large instance would have, matching the crossover the model in Section 5 predicts, and capacity now grew by adding replicas rather than by waiting for a bigger instance type to exist.

Lesson: When the workload is stateless and the tax is low, the crossover arrives early, and climbing the vertical curve past it is paying a premium for a convenience you no longer need.

5. Computing the Crossover Intermediate

The model so far has three moving parts: a superlinear vertical price with a hard ceiling, a near-linear horizontal price with a floor, and a communication tax that erodes cluster throughput. The code below assembles all three and sweeps a range of throughput targets, reporting the cheaper direction at each target and locating the crossover where the answer flips from vertical to horizontal. Every parameter is something you can estimate from a price list and a small benchmark: the baseline node's throughput and price, the largest size sold, the price curvature, the coordination floor, and the measured per-node efficiency loss.

import math

# ---- Per-node baseline (one commodity node) ----------------------------------
T1    = 1000.0   # throughput of one commodity node, requests/sec
C1    = 2.0      # price of one commodity node, $/hour
TAU   = 0.03     # communication tax: per-node efficiency lost per added node
FLOOR = 1.5      # fixed coordination cost (scheduler, load balancer), $/hour

# ---- Vertical: a bigger single node ------------------------------------------
ALPHA = 0.9      # superlinear price curvature (0 = perfectly proportional)
SMAX  = 8.0      # largest single node sold: 8x base throughput, then the wall

def vertical_price(target):
    """$/hour to hit `target` req/s on ONE bigger node, or inf past the ceiling."""
    s = target / T1                       # size multiplier needed
    if s <= 1.0:
        return C1
    if s > SMAX:
        return math.inf                   # the wall: no single node is this big
    return C1 * s ** (1.0 + ALPHA)        # superlinear price

def horizontal_nodes(target):
    """Smallest node count whose taxed throughput meets `target`."""
    n = 1
    while True:
        eff = n * T1 / (1.0 + TAU * (n - 1))   # taxed cluster throughput
        if eff >= target:
            return n
        n += 1

def horizontal_price(target):
    return horizontal_nodes(target) * C1 + FLOOR   # near-linear hardware + floor

# ---- Sweep targets and find the crossover ------------------------------------
print(f"{'target req/s':>12} | {'vert $/h':>9} | {'horiz $/h':>10} | {'nodes':>5} | winner")
print("-" * 60)
crossover, prev = None, None
for target in range(1000, 16001, 1000):
    vp, hp, n = vertical_price(target), horizontal_price(target), horizontal_nodes(target)
    winner = "vertical" if vp < hp else "horizontal"
    if prev == "vertical" and winner == "horizontal" and crossover is None:
        crossover = target
    prev = winner
    vp_s = "inf" if math.isinf(vp) else f"{vp:8.2f}"
    print(f"{target:>12} | {vp_s:>9} | {hp:9.2f} | {n:>5} | {winner}")

print("-" * 60)
print(f"vertical ceiling (hard wall)     : {SMAX * T1:.0f} req/s")
print(f"crossover (vertical -> horizontal): {crossover} req/s")

Code 3.2.1: A self-contained cost model. vertical_price encodes the superlinear curve and its ceiling; horizontal_nodes inverts the taxed-throughput model to find the node count; horizontal_price adds the coordination floor. The sweep prints both prices per target and flags the crossover.

target req/s |  vert $/h |  horiz $/h | nodes | winner
------------------------------------------------------------
        1000 |      2.00 |       3.50 |     1 | vertical
        2000 |      7.46 |       7.50 |     3 | vertical
        3000 |     16.13 |       9.50 |     4 | horizontal
        4000 |     27.86 |      11.50 |     5 | horizontal
        5000 |     42.57 |      13.50 |     6 | horizontal
        6000 |     60.19 |      17.50 |     8 | horizontal
        7000 |     80.67 |      19.50 |     9 | horizontal
        8000 |    103.97 |      23.50 |    11 | horizontal
        9000 |       inf |      25.50 |    12 | horizontal
       10000 |       inf |      29.50 |    14 | horizontal
       11000 |       inf |      33.50 |    16 | horizontal
       12000 |       inf |      39.50 |    19 | horizontal
       13000 |       inf |      43.50 |    21 | horizontal
       14000 |       inf |      49.50 |    24 | horizontal
       15000 |       inf |      55.50 |    27 | horizontal
       16000 |       inf |      61.50 |    30 | horizontal
------------------------------------------------------------
vertical ceiling (hard wall)     : 8000 req/s
crossover (vertical -> horizontal): 3000 req/s

Output 3.2.1: The real run. Vertical scaling is cheapest up to 2000 req/s, then horizontal takes over at the crossover of 3000 req/s. At 8000 req/s vertical hits its hard wall (price becomes inf: no single node is large enough), while horizontal keeps scaling smoothly past it by adding nodes.

The output makes the two curves of Figure 3.2.1 concrete. Below the crossover at 3000 requests per second, one bigger node wins because it avoids the coordination floor of $1.50$ per hour that the cluster pays before it does any useful work. Above the crossover the superlinear vertical price overtakes the near-linear fleet, and the gap widens fast: by 8000 requests per second the single node costs over $100$ per hour against the cluster's $23.50$, and one step further the single node does not exist at all. Notice also how the communication tax shows up in the node count, which climbs from 11 nodes at 8000 req/s to 30 nodes at 16000 req/s, slightly faster than throughput, because each added node coordinates with more peers. That superlinear node growth is the horizontal curve's own version of a price premium, and Section 3.5 quantifies the limit it imposes.

Library Shortcut: Autoscalers Walk This Curve for You

Code 3.2.1 finds the node count by hand for a fixed target. In production you do not recompute it yourself as load shifts; a horizontal autoscaler measures a signal (queue depth, GPU utilization) and adjusts the replica count to track demand, walking up and down the horizontal curve automatically. On Kubernetes the entire policy is a short manifest:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: embed-svc }
spec:
  scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: embed-svc }
  minReplicas: 2          # never drop below a warm floor
  maxReplicas: 40         # the budget ceiling, not a hardware one
  metrics:
    - type: Resource
      resource: { name: cpu, target: { type: Utilization, averageUtilization: 70 } }

Code 3.2.2: A Kubernetes HorizontalPodAutoscaler. The dozen lines of horizontal_nodes logic in Code 3.2.1 become a declarative target utilization; the control plane adds and removes replicas to hold it, which is the crossover model running continuously instead of once. Cluster scheduling internals are the subject of Chapter 33.

6. The Limits of the Model Advanced

The model in Code 3.2.1 is a planning tool, not a law of nature, and its assumptions are worth naming so you know when to distrust it. The single tax parameter $\tau$ compresses a great deal of physics into one number: real communication cost depends on message size, network topology, and which collective the workload uses, all of which Section 3.8 and Chapter 4 model in proper detail. A stateless inference replica and a tightly synchronized training step have wildly different $\tau$ values, and using one figure for both will mislead you. The model also assumes a node fails independently and rarely, whereas at large $n$ the probability that something is broken at any instant approaches certainty, which adds a reliability cost that Chapter 18 turns into its own design problem.

Two refinements matter most in practice. First, the communication tax is rarely linear in $n$; for collective operations it often grows more gently, like $\log n$ for a well-implemented all-reduce, which pushes the crossover and flattens the node-count growth seen in Output 3.2.1. Second, vertical and horizontal scaling are not pure alternatives but composable layers: the real design space is a grid of (node size $\times$ node count), and the sweet-spot recipe of Section 4 is a heuristic for searching it, not a proof of optimality. The honest use of the crossover model is to get the order of magnitude right and the direction of the decision right, then to measure, because the parameters $\tau$, $\alpha$, and $s_{\max}$ are all empirical.

Research Frontier: Scaling Out Where Vertical Has No Answer (2024 to 2026)

The clearest evidence that the vertical ceiling is real comes from the systems built to climb over it. Frontier-model training now spans tens of thousands of accelerators precisely because no single node holds a trillion-parameter model, and the 2024 to 2025 literature on this regime is explicitly about taming the horizontal communication tax. Llama 3's training report (Dubey et al., 2024) documents four-dimensional parallelism across 16,000 GPUs and the network engineering required to keep the tax bounded at that scale. On the inference side, prefill and decode disaggregation, popularized by systems such as DistServe and Mooncake (2024), splits a single request across machines specialized for different phases, a horizontal move that a bigger box cannot replicate. A parallel thread pushes the floor down: serverless and scale-to-zero GPU platforms shrink the coordination cost $F$ so that scaling out is economical even for bursty, low-volume services. The common thread is that once a workload clears the vertical ceiling, the entire engineering problem becomes managing $\tau$, which is the agenda of Parts IV and V.

We now have a numeric grip on the most basic scaling decision: a cost model, a crossover, and the recipe of scaling up to a sweet spot before scaling out. What this section has not done is distinguish the two reasons you might add machines in the first place, namely to finish a fixed job faster or to take on a proportionally bigger job in the same time. Those are strong scaling and weak scaling, and they obey different laws and reward different strategies. Section 3.3 draws that distinction and shows why the answer to "did adding machines help?" depends entirely on which question you were asking.

Exercise 3.2.1: Read the Crossover Off the Curves Conceptual

Using only the shapes in Figure 3.2.1 and the rules in Sections 2 and 3, answer the following without any arithmetic. (a) If a competitor announces a single machine twice as large as today's biggest, which way does the vertical wall move, and what happens to the crossover point? (b) If your workload's communication tax $\tau$ doubles (for example, you move from a stateless service to a synchronized one), does the crossover move left or right, and why? (c) If the cloud provider drops the coordination floor $F$ to nearly zero with a serverless offering, which end of the horizontal curve changes, and how does that affect very small targets?

Exercise 3.2.2: Make the Tax Sublinear Coding

Modify Code 3.2.1 so the communication tax grows like $\log n$ instead of linearly: replace the effective-throughput denominator with $1 + \tau \log_2(n)$. Re-run the sweep and report the new crossover and the node count needed for 16000 req/s. Then, in two or three sentences, explain why a well-implemented ring or tree all-reduce (the subject of Chapter 4) makes the logarithmic model more realistic than the linear one for many training workloads, and what that implies for how far you can scale out before the tax dominates.

Exercise 3.2.3: Find the Per-Node Sweet Spot Analysis

Section 4 claims that scaling each node up to a sweet spot before scaling out minimizes total cost by keeping the node count, and therefore the tax, small. Make this precise. Suppose a node of size $s$ has throughput $s\,T_1$ and price $C_1 s^{1+\alpha}$, and you must hit a fixed target $T^\star$ using $n = \lceil T^\star / (s\,T_1) \rceil$ such nodes with total price $n\,C_1 s^{1+\alpha} + F$ (ignore the tax for this part). Show analytically that, ignoring the ceiling, raising $s$ trades fewer nodes for a superlinearly pricier node, so the cost-minimizing $s$ is not the largest available but an interior value, and describe how reintroducing the communication tax $\tau$ shifts that optimum upward. Relate your answer to why production teams size nodes below the top of the product line.