Section 41.5: Selecting Tools and Infrastructure

"They asked me to hide the hard parts. I hide most of them. The ones I leave showing are the ones you actually signed up to learn, and the ones I never warned you about are in the release notes you did not read."
A Framework, Promising to Hide the Hard Parts (Most of Them)

Big Picture

Once the distribution design exists, tool selection is mostly a lookup, not an invention: the design names a communication pattern, and a mature framework already implements that pattern better than you could in one term. The previous section turned your capstone goal into a distribution design, a choice of which axis to spread across machines and which collective carries the data between them. This section turns that design into a concrete stack. The discipline is the same "right tool" principle the book has pressed since Chapter 1: do not reinvent the all-reduce, the shuffle, or the request scheduler; pick the proven framework that already owns your pattern, size the smallest cluster that clears your ceilings, and decide with open eyes which pieces to build and which to buy. Reproducibility (Section 41.8) then demands that whatever you pick is pinned, so this choice outlives the demo.

A capstone fails far more often on its plumbing than on its idea. A team spends three weeks writing a gradient-synchronization loop that PyTorch ships in one line, or stands up a Kubernetes cluster to serve a model that runs comfortably on a single node, and the term ends before the actual research question is touched. The antidote is to treat tool selection as the direct consequence of the design you finished in Section 41.4, not as a fresh open-ended question. The design already told you which essential activity is distributed and which collective moves the bytes; your job here is to name the framework that implements exactly that, provision the smallest infrastructure that clears the binding ceiling, and stop. This section gives you the mapping, the sizing arithmetic, and the build-versus-buy decision that together produce a stack you can actually finish on.

Figure 41.5.1: The tool-selection map. The design from Section 41.4 names a need on the left (a distribution axis and the collective it relies on); each need maps to a mature framework in the middle that already implements that collective; the framework then runs on one of three infrastructure tiers on the right. The arrows into the tiers are illustrative defaults: data and serving often start local or managed, model-parallel training escalates to a self-managed cluster. The arithmetic that places a given project on a tier is the subject of Code 41.5.1.

1. The Design Dictates the Tool Beginner

The whole book has argued that a distributed AI method is, at its core, a choice of which activity to split and which collective recombines it. That single sentence is also the tool-selection rule. Your design in Section 41.4 ended on a communication pattern; every mature framework in the ecosystem is organized around implementing one such pattern superbly. So the mapping is nearly mechanical. If the design distributes data and leans on a shuffle, you reach for Spark, Ray Data, or Dask (the toolkit of Part II). If it distributes training across replicas that synchronize gradients by all-reduce, you reach for PyTorch DistributedDataParallel or DeepSpeed (Chapter 15). If the model itself does not fit on one device and must be sharded with reduce-scatter and all-gather, you reach for FSDP or Megatron-LM (Chapter 16). If the work is serving a fleet under a latency budget, you reach for vLLM, Ray Serve, or Triton (Part V).

Two cross-cutting needs sit underneath all four axes. Something must schedule the work and recover it when a machine dies, which is the province of Ray, Kubernetes, and Airflow from Part VII; and something must record what you ran so the result is believable, which is experiment tracking with Weights and Biases or MLflow from Chapter 26. Table 41.5.1 collects the mapping in one place so the design-to-tool step is a lookup you can do at a glance.

Table 41.5.1: From the distribution design to the mature framework that implements it. The collective in the second column is the load-bearing reason the framework exists; matching it to your design is the entire selection step.

Need from the design	Collective it relies on	Mature framework	Where the book builds it
Distribute data	shuffle	Spark, Ray Data, Dask	Part II
Distribute training	all-reduce	PyTorch DDP, DeepSpeed	Part III, Part IV
Distribute the model	reduce-scatter, all-gather	FSDP, Megatron-LM	Part IV
Distribute inference	request fan-out, all-reduce (tensor-parallel)	vLLM, Ray Serve, Triton	Part V
Coordinate the cluster	scheduling, failure recovery	Ray, Kubernetes, Airflow	Part VII
Track experiments	artifact and metric logging	W&B, MLflow	Chapter 26

Key Insight: Pick the Framework That Owns Your Collective

A framework is mature precisely because thousands of engineers have already paid the cost of making one communication pattern fast and reliable: gradient bucketing and overlap in DDP, fault-tolerant shuffle in Spark, continuous batching and KV-cache paging in vLLM. When your design's collective matches a framework's reason for existing, you inherit all of that for free. When it does not, you are about to reinvent a collective, which is the one thing a one-term capstone can never afford. So selection reduces to a single question asked of the design: which collective does this lean on, and who already owns it?

2. Build Versus Buy, Managed Versus Self Intermediate

Choosing the framework settles what runs; it does not settle where it runs or who keeps it alive. That is the build-versus-buy decision, and for a capstone it almost always tilts toward buy. You are not graded on having operated a Kubernetes control plane; you are graded on the distributed-AI result. Every hour spent debugging node networking, broken NCCL rendezvous, or a flaky autoscaler is an hour not spent on the question that motivated the project. The managed-versus-self axis from Section 33.9 applies directly: a managed cluster or training service hands you a working scheduler, a network that is already configured, and recovery you did not write, at a premium over raw cloud instances. For most capstones that premium buys back exactly the weeks you do not have.

The decision is not free of judgment, because managed convenience can collide with reproducibility and cost. We can make the cost side explicit. Let a self-managed cluster of $g$ accelerators cost $c$ dollars per accelerator-hour and run for $h$ hours; a managed service charges a premium factor $m > 1$ for the same compute but removes a fixed operations burden you would otherwise pay in your own time. Writing $T_{\text{ops}}$ for the hours you would spend running the cluster yourself and $v$ for the value you place on an hour of your team's term, the two totals are

$$ C_{\text{self}} = g\,c\,h + v\,T_{\text{ops}}, \qquad C_{\text{managed}} = m\,g\,c\,h. $$

Buying is the rational choice whenever $C_{\text{managed}} \le C_{\text{self}}$, which rearranges to $(m - 1)\,g\,c\,h \le v\,T_{\text{ops}}$: the dollar premium of managed compute must be no larger than the value of the operations time it saves you. For a capstone, $v\,T_{\text{ops}}$ is enormous (the operations time is measured in the same scarce term-weeks as the research), so the inequality is usually satisfied with room to spare. The cases where it flips, very long training runs where $h$ is large and the premium compounds, are exactly the cases Code 41.5.1 flags for a self-managed cluster.

Practical Example: The Capstone That Bought Its Cluster Back

Who: A two-person graduate team building a distributed retrieval-augmented assistant for their capstone.

Situation: Their design distributed inference (vLLM behind Ray Serve) and a nightly index rebuild (Ray Data), a serving and a data axis, no custom training.

Problem: Their first instinct was to self-host on raw cloud virtual machines to "learn the infrastructure", and they lost the first two weeks to networking and autoscaler failures.

Dilemma: Keep operating the cluster themselves, accumulating systems experience but burning term-weeks, or move to a managed Ray service and spend the saved time on retrieval quality, the thing they were actually graded on.

Decision: They moved to the managed service, because the premium over raw instances was a few hundred dollars while the operations time it saved was measured in the weeks the inequality above makes decisive.

How: They kept the same vLLM and Ray Serve code, pointed it at a managed cluster, and pinned every version for Section 41.8 reproducibility.

Result: The plumbing stabilized in a day, and the remaining weeks went to the retrieval evaluation that carried the grade.

Lesson: A capstone buys the cluster and builds the idea. Reserve "build" for the one component that is your contribution; buy everything that merely has to work.

3. Start Small, Then Scale the Tier Intermediate

The infrastructure tier is the last column of Figure 41.5.1, and the rule for choosing it is to start at the smallest tier that clears your binding ceiling and escalate only when forced. The smallest tier is a single multi-GPU node. A surprising fraction of capstone designs never need to leave it: a single box with eight accelerators runs data-parallel training, sharded training across those eight devices, and most serving workloads without a single inter-node hop. Single-node multi-GPU is where you debug your distributed code, because the collectives behave the same as on a cluster but the failure surface is a tenth the size. Only when the model no longer fits in one node, the throughput target exceeds one node, or the dataset forces a genuinely distributed shuffle should you escalate to multi-node, and there you prefer a managed cluster (Section 33.9) over a self-built one for the reasons of the previous section.

Sizing the tier is arithmetic, not intuition. The memory a training job needs is roughly the parameter count times the bytes each parameter consumes once you include gradients and optimizer state. For mixed-precision training with Adam, that overhead is about sixteen bytes per parameter, so a model with $P$ parameters needs on the order of $16P$ bytes of accelerator memory, and the number of accelerators to hold it is

$$ g_{\text{model}} = \left\lceil \frac{16\,P}{M_{\text{GPU}}} \right\rceil, $$

where $M_{\text{GPU}}$ is the memory of one accelerator. A serving design sizes instead by throughput: if one replica sustains $r$ requests per second and you must serve $Q$, you need $g_{\text{qps}} = \lceil Q / r \rceil$ replicas. The provisioned count is the larger of the two, and the node count follows by dividing by the accelerators per node and rounding up. Code 41.5.1 turns this arithmetic and the managed-versus-self test into a small helper that takes a problem profile and prints a recommended framework, accelerator count, and tier.

"""Tool-and-infrastructure sizing helper for a capstone project.

Given a problem profile (model size, dataset size, throughput target, budget,
and the dominant communication pattern dictated by the design), recommend a
distribution axis, a mature framework that implements it, and an infrastructure
tier (local single node, cloud multi-node, or managed service).
"""

# Per-axis framework map. The design's communication pattern picks the row.
AXIS_TOOLS = {
    "data":      ("Spark / Ray Data / Dask",      "shuffle"),
    "training":  ("PyTorch DDP, DeepSpeed",        "all-reduce"),
    "model":     ("FSDP, Megatron-LM",             "reduce-scatter / all-gather"),
    "serving":   ("vLLM, Ray Serve, Triton",       "request fan-out"),
}

GPU_MEM_GB = 80.0          # one modern data-center accelerator
LOCAL_NODE_GPUS = 8        # a single multi-GPU box
CLOUD_HOURLY_PER_GPU = 2.5 # on-demand $/GPU-hour (illustrative)
MANAGED_PREMIUM = 1.6      # managed service multiplier over raw cloud


def recommend(name, params_b, data_tb, qps, axis, train_hours, budget_usd):
    bytes_per_param = 16          # params + grads + Adam state, mixed precision
    model_mem_gb = params_b * bytes_per_param
    gpus_for_model = max(1, -(-int(model_mem_gb) // int(GPU_MEM_GB)))  # ceil
    # Throughput axis wants replicas; size by a nominal 40 qps per GPU replica.
    gpus_for_qps = max(1, -(-qps // 40)) if qps else 0
    gpus = max(gpus_for_model, gpus_for_qps, 1)

    tool, collective = AXIS_TOOLS[axis]
    nodes = max(1, -(-gpus // LOCAL_NODE_GPUS))
    raw_cost = gpus * CLOUD_HOURLY_PER_GPU * train_hours

    if gpus <= LOCAL_NODE_GPUS and data_tb < 1.0:
        tier = "Local single node (multi-GPU)"
    elif raw_cost * MANAGED_PREMIUM <= budget_usd and nodes <= 4:
        tier = "Managed cluster (Ch 33.9)"
    else:
        tier = "Self-managed cloud cluster (Ch 33)"

    print(f"{name}")
    print(f"  binding axis     : {axis:<9}  collective: {collective}")
    print(f"  framework        : {tool}")
    print(f"  GPUs (model/qps) : {gpus_for_model} / {gpus_for_qps}  ->  provision {gpus} GPU(s), {nodes} node(s)")
    print(f"  raw cloud cost   : ${raw_cost:,.0f}  (budget ${budget_usd:,.0f})")
    print(f"  infra tier       : {tier}")
    print()


if __name__ == "__main__":
    print("Capstone tool-and-infrastructure sizing helper\n" + "=" * 48 + "\n")
    # (name, params_B, data_TB, qps, axis, train_hours, budget_$)
    recommend("A. Fine-tune a 7B chat model on 20 GB of dialogue",
              7, 0.02, 0, "training", 30, 5_000)
    recommend("B. Pre-train a 70B model from a 4 TB corpus",
              70, 4.0, 0, "model", 400, 200_000)
    recommend("C. Serve a 13B assistant at 600 qps",
              13, 0.0, 600, "serving", 24 * 30, 40_000)
    recommend("D. Dedup + tokenize a 30 TB web crawl",
              0, 30.0, 0, "data", 12, 2_000)

Code 41.5.1: A sizing helper that maps a problem profile to a framework (via the design's axis), a provisioned accelerator and node count (via the memory and throughput formulas above), and an infrastructure tier (via the managed-versus-self cost test of Section 2). It encodes defaults, not laws; a real project replaces the illustrative constants with measured per-replica throughput and current quoted prices.

Capstone tool-and-infrastructure sizing helper
================================================

A. Fine-tune a 7B chat model on 20 GB of dialogue
  binding axis     : training   collective: all-reduce
  framework        : PyTorch DDP, DeepSpeed
  GPUs (model/qps) : 2 / 0  ->  provision 2 GPU(s), 1 node(s)
  raw cloud cost   : $150  (budget $5,000)
  infra tier       : Local single node (multi-GPU)

B. Pre-train a 70B model from a 4 TB corpus
  binding axis     : model      collective: reduce-scatter / all-gather
  framework        : FSDP, Megatron-LM
  GPUs (model/qps) : 14 / 0  ->  provision 14 GPU(s), 2 node(s)
  raw cloud cost   : $14,000  (budget $200,000)
  infra tier       : Managed cluster (Ch 33.9)

C. Serve a 13B assistant at 600 qps
  binding axis     : serving    collective: request fan-out
  framework        : vLLM, Ray Serve, Triton
  GPUs (model/qps) : 3 / 15  ->  provision 15 GPU(s), 2 node(s)
  raw cloud cost   : $27,000  (budget $40,000)
  infra tier       : Self-managed cloud cluster (Ch 33)

D. Dedup + tokenize a 30 TB web crawl
  binding axis     : data       collective: shuffle
  framework        : Spark / Ray Data / Dask
  GPUs (model/qps) : 1 / 0  ->  provision 1 GPU(s), 1 node(s)
  raw cloud cost   : $30  (budget $2,000)
  infra tier       : Managed cluster (Ch 33.9)

Output 41.5.1: The helper run on four representative capstone profiles. Profile A fits one node and stays local; B and C escalate to multi-node because the model or the throughput exceeds a single box, and C tips into self-managed because its long serving horizon makes the managed premium exceed the budget; D needs only modest compute but a genuinely distributed shuffle, so it buys a managed data cluster. Each recommendation is the smallest tier that clears the profile's binding ceiling.

Library Shortcut: One Launch Line Per Tier

The reason starting small costs almost nothing is that every framework offers a launcher that hides the rendezvous, process-group setup, and placement that Chapter 33 unpacks. The same code moves from one node to many by changing only the launch line, not the model. Below, the three canonical launchers for the three tiers: torchrun for single-node multi-GPU, ray up for a managed or self-managed Ray cluster, and sbatch for a Slurm multi-node job.

# Tier 1: single node, 8 GPUs. torchrun spawns the processes and the rendezvous.
torchrun --standalone --nproc_per_node=8 train.py --fsdp

# Tier 2: bring up a Ray cluster from one config, then submit. ray up does the
# provisioning, autoscaling, and head/worker wiring you would otherwise script.
ray up cluster.yaml -y
ray submit cluster.yaml serve_app.py        # vLLM behind Ray Serve, e.g.

# Tier 3: multi-node training on a Slurm cluster. sbatch queues the gang-scheduled
# job; srun fans torchrun across the allocated nodes.
sbatch --nodes=4 --gpus-per-node=8 --wrap="srun torchrun \
  --nnodes=4 --nproc_per_node=8 --rdzv_backend=c10d train.py --fsdp"

Code 41.5.2: The launch line is the only thing that changes across tiers. The model code wrapped in DDP or FSDP is identical whether torchrun runs it on one node or srun fans it across four, which is exactly why debugging on the single-node tier transfers cleanly to the cluster.

Thesis Thread: The Capstone Picks an Axis and Lets the Tool Follow

From Chapter 1 the spine of this book has been that scaling out means choosing which activity to distribute and which collective recombines it. The capstone is where that choice becomes a committed decision rather than a worked example. You name the binding axis, the all-reduce or shuffle or fan-out it depends on follows, the framework that owns that collective follows from it, and the infrastructure tier follows from the size. Tool selection is not a separate discipline bolted onto the design; it is the design read off in the vocabulary of frameworks and clusters. Defend the axis well in Section 41.4 and the stack in this section is nearly forced.

4. Match Tool Maturity to the Term Intermediate

A one-term timeline is a hard constraint on tool choice, and it argues for maturity over novelty at every juncture. A framework that has shipped stable releases for years, carries documentation and a community, and has been run at scales far beyond yours is one whose failures are already known and worked around. A framework released last month may implement your collective more elegantly, but you will be its bug-finder, and a capstone has no slack for that role. The rule is blunt: do not reinvent the collectives, and do not bet the term on an unproven framework when a proven one covers your pattern. The all-reduce in Chapter 15 took the field years to make fast and correct; you inherit that by calling DDP, and you forfeit it the moment you decide to write your own.

Maturity also serves the reproducibility you will be held to in Section 41.8. A pinned version of a mature framework still installs and still behaves the same way when your grader runs it weeks later; a bleeding-edge dependency may have moved, broken its API, or vanished. Pinning is only meaningful when the thing you pin is stable, so choosing mature tools and pinning them are two halves of one decision. The tracking layer from Chapter 26 closes the loop: log the framework versions, the cluster shape, and the launch command alongside every run, so the stack that produced a number is recoverable from the experiment record, not from memory.

Research Frontier: Compilers and Auto-Parallelizers (2024 to 2026)

The selection step in this section is manual: you read the collective off the design and pick the framework that owns it. A research line is trying to automate exactly this mapping. torch.compile and the broader PyTorch 2 compiler stack fuse and schedule operations that practitioners once hand-tuned, and distributed extensions in the lineage of Alpa and the GSPMD partitioner search over data, tensor, and pipeline splits to synthesize a parallelization plan from an unannotated model. Newer auto-sharding work and the DTensor abstraction push toward declaring a device mesh and letting the framework choose the collectives. For a capstone today these tools sit alongside, not in place of, an explicit design: they reduce the hand-tuning once the axis is chosen, but the choice of axis, the decision this section turns into a stack, remains yours. Watching whether auto-parallelizers eventually subsume that choice is one of the more consequential open questions in distributed-AI tooling.

Fun Note: The Two-Week Collective

There is a rite of passage in which a team, distrustful of black boxes, writes its own ring all-reduce "to really understand it". Two weeks later they have a correct, slow, single-failure-prone implementation and a deep respect for the one line of torch.distributed.all_reduce they replaced. Understanding the collective is worth a weekend with the math in Chapter 4; shipping your own in a capstone is worth a missed deadline. Read the source if you must, then call the library.

Exercise 41.5.1: Read the Tool Off the Design Conceptual

For each capstone sketch, state the binding distribution axis, the collective it relies on, the mature framework you would select, and the smallest infrastructure tier you would start on: (a) a model that needs 240 GB of accelerator memory to train, served on 80 GB devices; (b) a recommendation pipeline that must shuffle and join two multi-terabyte logs nightly with no custom model; (c) a chat assistant that must answer 50 requests per second from a 7-billion-parameter model that fits on one device. For each, name one thing you would refuse to build yourself and explain why the term does not allow it.

Exercise 41.5.2: Extend the Sizing Helper Coding

Modify Code 41.5.1 so the memory estimate distinguishes inference (about 2 bytes per parameter, weights only) from training (about 16 bytes per parameter), and add a fifth profile: serving a 70-billion-parameter model at 200 requests per second. Confirm that the inference memory path now lets a model fit on far fewer accelerators than the training path would, and report the framework, accelerator count, and tier the helper recommends. Explain why mixing up the two byte-per-parameter constants would mis-size a serving cluster.

Exercise 41.5.3: When Does Self-Managed Win? Analysis

Using the cost model $C_{\text{self}} = g\,c\,h + v\,T_{\text{ops}}$ and $C_{\text{managed}} = m\,g\,c\,h$, take $g = 16$ accelerators, $c = \$2.5$ per accelerator-hour, a managed premium $m = 1.6$, and operations time $T_{\text{ops}} = 40$ hours valued at $v = \$60$ per hour. Find the training duration $h$ at which self-managed becomes cheaper than managed. Interpret the result for a capstone whose total compute fits in a few hundred accelerator-hours, and state what would have to change about a project for self-managed to be the rational choice.