"They spun me up on Tuesday, ran a fine-tune, and forgot me by Friday. I held a thousand gradients and never learned my own provider's name. It was, in its way, a perfect life."
A Managed Cluster, Spun Up and Forgotten by Friday
A managed platform is the whole of this chapter, Sections 33.1 through 33.8, sold as an API: you hand it a job and it provisions the nodes, gang-schedules the gang, bids on spot, autoscales the serving fleet, mounts the storage, and tears it all down when you are done. The previous sections taught you to build a cluster substrate by hand: the resource model and the scheduler, gang scheduling so distributed jobs start atomically, topology-aware placement, spot and preemption survival, GPU partitioning, and fair sharing across teams. A managed platform automates every one of those mechanisms behind a few lines of client code, and the price of that automation is the central decision of this section. You trade control, a per-hour markup, and a degree of lock-in for not staffing a platform team. This section frames managed platforms as a build-versus-buy choice, walks the three that dominate practice (Databricks, Amazon SageMaker, and Google Vertex AI), and gives you a break-even calculation that tells you, in accelerator-hours per year, which side of the trade you are on. It then closes the chapter.
Everything in Chapter 33 so far has assumed you operate the cluster. You wrote the resource request, the scheduler matched it to nodes, you reasoned about gang admission and topology and spot reclamation, and you accepted the operational burden that comes with all of it: a control plane to keep alive, an on-call rotation when the scheduler wedges, a quota policy to arbitrate between teams, and the slow accretion of tooling that any production cluster grows. For a large fraction of organizations that burden is not the work they want to do. They want to train a model and serve it, not to run Kubernetes and Slurm and a spot-bidding daemon underneath. A managed platform exists precisely to absorb that burden: it presents the cluster as a service, hides the scheduler behind a job-submission API, and bills you for the result. The question is never whether managed platforms remove work; they plainly do. The question is whether what they remove is worth what they charge and what they take away, and that question has a numerical answer.
1. The Build-Versus-Buy Line, and What Crosses It Beginner
Every cluster decision in this chapter can be read as choosing where to draw one horizontal line in Figure 33.9.1. Below the line is what you operate; above it is what someone operates for you. A bare-metal cluster with your own scheduler draws the line at the floor: you own the resource model, gang admission, placement, spot logic, partitioning, quotas, and storage, which is the entire content of Sections 33.1 through 33.8. A managed platform draws the line near the ceiling: you own a training script and a serving function, and the provider owns the rest. The two are not a binary; the line can sit anywhere, and most real organizations place it somewhere in the middle, running a managed control plane over hardware they reserve, or a self-managed scheduler over a managed Kubernetes service. What moves when the line moves is always the same triple: control, cost, and lock-in.
You trade away control. A self-managed cluster lets you tune the scheduler's gang-admission policy, pick your exact spot-bidding strategy, and lay out a job on the interconnect by hand, as Chapter 4 motivates for topology-sensitive collectives. A managed platform makes those choices for you, and usually makes them well, but it makes them, and when its defaults are wrong for your workload you have limited recourse. You trade away a degree of cost transparency and often pay a per-hour markup over raw hardware, because the provider's margin and its operational staff are folded into the bill. And you accept lock-in: your training jobs speak the platform's SDK, your data sits in its managed store, and your pipelines assume its orchestration, so leaving is a migration, not a configuration change. In return, you delete the platform team, the on-call burden, and the months of engineering that Sections 33.1 to 33.8 represent if you build them yourself.
The mechanisms in Sections 33.1 to 33.8, gang scheduling, topology placement, spot survival, GPU partitioning, fair sharing, and managed storage, are exactly what a managed platform packages and resells. Buying the platform does not free you from understanding them; it makes understanding them more valuable, because every line you drew in this chapter is now a setting you must reason about through the platform's abstraction. A team that buys Databricks or SageMaker without knowing what gang scheduling is will misconfigure a distributed job and pay for idle gangs; a team that knows this chapter will read the platform's documentation and recognize each knob as a mechanism it already understands. Managed platforms raise the abstraction; they do not remove the need to know what sits beneath it.
2. Databricks: Spark, MLflow, and Clusters You Do Not Provision Beginner
Databricks began as the commercial home of Apache Spark, the engine that Chapter 7 develops, and its core proposition is that you never provision a Spark cluster yourself. You declare a job or open a notebook, name a cluster configuration (instance types, a minimum and maximum worker count, a spot policy), and the platform creates the cluster on demand, runs the work, and terminates it. The autoscaling that the batch schedulers of Section 33.4 described as a control loop you would build is here a checkbox: Databricks watches the Spark scheduler's pending-task backlog and adds or removes workers within the bounds you set, and its spot integration bids for the cheap capacity of Section 33.8 and falls back to on-demand instances when the bid is reclaimed. The cluster you reasoned about by hand becomes a transient object that exists for the life of a job.
Around that compute, Databricks adds two pieces that matter for AI work. MLflow, originating at Databricks, tracks experiments, registers models, and stages them toward deployment, which is the experiment-tracking and model-registry half of the MLOps discipline that Chapter 26 treats in full. Delta Lake and the Unity Catalog provide the managed, transactional storage and governance layer that Section 33.1 described as the cluster's storage tier and data-loading substrate, now with versioning and access control handled for you. The shape of a Databricks AI workflow is therefore familiar from this book: a Spark job (Chapter 7) prepares features into managed storage (Section 33.1), a training run logs to MLflow (Chapter 26), and the registered model is promoted to a serving endpoint, all without you ever running the scheduler underneath.
In a self-managed world you would request nodes from your scheduler (Section 33.4), wait for a gang to be admitted (Section 33.5), run Spark, and release the nodes. The Databricks Jobs API collapses that into a single declarative submission: you describe the cluster you want and the task to run, and the platform creates the cluster, runs the task, and tears it down. The roughly hundred lines of provisioning, autoscaling, and cleanup logic this chapter implied become one JSON object:
# Submit a Spark + ML training job that provisions and tears down its own cluster.
from databricks.sdk import WorkspaceClient
from databricks.sdk.service import jobs, compute
w = WorkspaceClient() # auth from env / profile
run = w.jobs.submit(
run_name="nightly-ranker-train",
tasks=[jobs.SubmitTask(
task_key="train",
new_cluster=compute.ClusterSpec(
spark_version="15.4.x-gpu-ml-scala2.12", # GPU + MLflow preinstalled
node_type_id="g5.2xlarge",
autoscale=compute.AutoScale(min_workers=2, max_workers=16), # 33.4
aws_attributes=compute.AwsAttributes( # spot with on-demand fallback
availability=compute.AwsAvailability.SPOT_WITH_FALLBACK), # 33.8
),
notebook_task=jobs.NotebookTask(notebook_path="/Repos/team/train_ranker"),
)],
).result() # blocks until the run finishes
print("run state:", run.state.result_state)
3. SageMaker: Managed Training Jobs, Distributed Training, and Endpoints Intermediate
Amazon SageMaker draws the line at the level of the job rather than the cluster. Its central object is the training job: you give it a container or a framework Estimator, an instance type, an instance count, and a path to data, and SageMaker provisions exactly those instances, pulls the data, runs your script to completion, uploads the model artifacts, and then releases the instances. You are billed for the seconds the job ran, not for a cluster you keep alive, which is the ephemeral pattern of the epigraph: a gang spun up, used, and forgotten. When you ask for several instances, SageMaker sets up the distributed environment for you, wiring the process group that Chapter 15 builds by hand, so a multi-node data-parallel run is a parameter, not an infrastructure project. For the largest models it offers managed sharded and pipeline parallelism in the lineage of Chapter 16, and its managed spot training implements the checkpoint-and-resume preemption survival of Section 33.8 and the elastic recovery of Chapter 18 without you writing the checkpoint loop.
The serving half is symmetric. A SageMaker endpoint takes a registered model and stands up an autoscaling, load-balanced fleet of inference replicas, which is the managed-serving story that Chapter 24 develops for large models and Chapter 23 for inference systems generally. The endpoint scales its replica count against request load using the kind of autoscaling control loop the batch schedulers of Section 33.4 introduced, batches requests, and can split a large model across several accelerators behind one address. The same chapter you read about replicating a serving fleet by hand describes exactly what the endpoint does for you; the difference is that you declare a target and the platform runs the loop.
To launch a four-node data-parallel training run yourself you would provision a gang (Section 33.5), establish the process group (Chapter 15), and arrange checkpointing for spot survival (Section 33.8, Chapter 18). The SageMaker PyTorch Estimator expresses all of it as constructor arguments; the highlighted lines turn on multi-node distribution and managed spot training with automatic checkpoint resume:
# A managed 4-node data-parallel training job on spot capacity.
from sagemaker.pytorch import PyTorch
estimator = PyTorch(
entry_point="train.py", # your ordinary DDP training script
role="arn:aws:iam::...:role/SageMakerRole",
framework_version="2.3", py_version="py311",
instance_type="ml.p4d.24xlarge",
instance_count=4, # 4-node gang; SageMaker wires the group
distribution={"torch_distributed": {"enabled": True}}, # sets up DDP (Ch 15)
use_spot_instances=True, # managed spot training (Sec 33.8)
max_wait=36000, max_run=24000, # wait for spot, then run budget
checkpoint_s3_uri="s3://my-bucket/ckpts", # auto checkpoint + resume (Ch 18)
)
estimator.fit({"train": "s3://my-bucket/train", "val": "s3://my-bucket/val"})
predictor = estimator.deploy( # stand up an autoscaling endpoint (Ch 24)
initial_instance_count=2, instance_type="ml.g5.2xlarge")
4. Vertex AI: Managed Training, Serving, and Pipelines Intermediate
Google Vertex AI offers the same three primitives, training, serving, and orchestration, under one console and SDK. A Vertex custom training job mirrors the SageMaker pattern: you specify a container, a machine type, an accelerator type, and a replica count, and Vertex provisions the workers, including the multi-worker setup for distributed training, runs the job, and releases the resources. Its serving side deploys a model to an endpoint with autoscaling and traffic splitting, the same managed-fleet abstraction of Chapter 24, with built-in canary and blended-traffic deployment for the progressive rollouts that Chapter 26 treats as core MLOps practice.
Vertex's distinctive emphasis is the pipeline. Vertex Pipelines, built on the Kubeflow Pipelines and TFX lineage, lets you express an end-to-end workflow, data preparation, training, evaluation, and deployment, as a directed acyclic graph of containerized steps, each scheduled onto managed compute with its dependencies tracked. This is the orchestration layer of MLOps (Chapter 26) delivered as a managed service: the gang scheduling and fair sharing you would run across these steps yourself (Sections 33.5 and 33.4) are handled by the underlying managed cluster, and you reason about the workflow rather than the scheduler. Across all three platforms the pattern is identical, which is the point: a managed training job, a managed serving endpoint, and a managed pipeline, each absorbing one column of Figure 33.9.1.
Who: A four-person machine learning team at an early-stage startup, with no platform or infrastructure engineer.
Situation: They needed to fine-tune open models weekly and serve them behind a low-latency endpoint, with traffic that spiked unpredictably during customer demos.
Problem: Building the cluster substrate of Sections 33.1 to 33.8 (a scheduler, gang admission, spot logic, an autoscaling serving fleet) would have consumed their entire engineering budget before shipping a product.
Dilemma: Hire a platform engineer and self-manage for control and lower per-hour cost, or buy a managed platform at a markup and ship the product with the people they had.
Decision: They bought managed training jobs and autoscaling endpoints, accepting the markup and the lock-in, because their total accelerator-hours were far below any plausible break-even.
How: Weekly fine-tunes ran as managed training jobs on spot capacity with automatic checkpoint resume; serving used an autoscaling endpoint that scaled to two replicas overnight and to twelve during demos.
Result: They shipped in weeks, never paged anyone for a wedged scheduler, and their managed bill stayed well under the fully-loaded cost of the platform engineer they did not hire.
Lesson: Below the break-even, the markup a managed platform charges is smaller than the salary a self-managed cluster requires. The next section makes that crossover a number.
5. The Break-Even: When Buying Stops Being Cheaper Intermediate
The build-versus-buy decision has a clean economic core, and writing it as a total-cost-of-ownership comparison turns intuition into a threshold. Self-managing costs you raw hardware plus a fixed platform team, and your reserved capacity exceeds your useful work because some of it sits idle (the utilization gap that gang scheduling and bin-packing in Sections 33.2 and 33.6 exist to shrink). Buying costs you a per-hour markup over raw hardware but no platform team, and the provider's tighter packing raises your effective utilization. Let $u$ be the useful accelerator-hours per year you actually need, $r$ the raw hardware cost per hour, $\eta_s$ and $\eta_m$ the self-managed and managed utilizations, $m$ the managed markup, and $T$ the fixed annual cost of the platform team. The two totals are
$$C_{\text{self}}(u) = \frac{u}{\eta_s}\,r + T, \qquad C_{\text{buy}}(u) = \frac{u}{\eta_m}\,m\,r.$$Self-managing carries a large fixed term $T$ and a cheaper marginal rate; buying carries no fixed term and a marginal rate inflated by the markup $m$ but deflated by the better utilization $\eta_m$. Setting $C_{\text{self}}(u^\star) = C_{\text{buy}}(u^\star)$ and solving for the crossover gives
$$u^\star = \frac{T}{r\left(\dfrac{m}{\eta_m} - \dfrac{1}{\eta_s}\right)},$$which exists only when the bracket is positive, that is, when the managed marginal rate genuinely exceeds the self-managed one. Below $u^\star$ the fixed platform team dominates and buying wins; above it the markup compounds with scale and building wins. The program below evaluates both curves on a realistic parameter set and reports the crossover.
# Build-vs-buy break-even: self-managed cluster vs managed platform (TCO/year).
raw_hour = 12.0 # $/hr of raw accelerator capacity
eng_salary = 220_000 # fully-loaded $/yr per platform engineer
team_size = 2.0 # engineers to run scheduling/scaling/storage (33.1-33.8)
util_self = 0.55 # self-managed utilization (idle reserved capacity)
markup = 1.6 # managed $/hr = markup * raw (the "buy" premium)
util_managed = 0.80 # provider autoscaling + bin-packing runs tighter
hours_year = 24 * 365
def tco_self(u): return u / util_self * raw_hour + eng_salary * team_size
def tco_buy(u): return u / util_managed * (markup * raw_hour)
print(f"{'useful acc-hr/yr':>16} | {'self-managed $':>16} | {'managed $':>14} | cheaper")
print("-" * 70)
for u in [2_000, 10_000, 50_000, 150_000, 400_000, 1_000_000, 3_000_000]:
s, b = tco_self(u), tco_buy(u)
print(f"{u:>16,} | {s:>16,.0f} | {b:>14,.0f} | {'managed' if b < s else 'self'}")
# Analytic crossover u* where the two totals meet.
coef = raw_hour * (markup / util_managed - 1.0 / util_self) # marginal rate gap
u_star = (eng_salary * team_size) / coef
print("-" * 70)
print(f"break-even useful acc-hr/yr : {u_star:,.0f}")
print(f" ... about {u_star / hours_year:,.1f} accelerators running flat-out all year")
tco_self carries the fixed platform-team term $T$; tco_buy carries only the marked-up marginal rate. The analytic crossover u_star is the $u^\star$ of the equation, the useful accelerator-hours per year at which the two costs meet.useful acc-hr/yr | self-managed $ | managed $ | cheaper
----------------------------------------------------------------------
2,000 | 483,636 | 48,000 | managed
10,000 | 658,182 | 240,000 | managed
50,000 | 1,530,909 | 1,200,000 | managed
150,000 | 3,712,727 | 3,600,000 | managed
400,000 | 9,167,273 | 9,600,000 | self
1,000,000 | 22,258,182 | 24,000,000 | self
3,000,000 | 65,894,545 | 72,000,000 | self
----------------------------------------------------------------------
break-even useful acc-hr/yr : 201,667
... about 23.0 accelerators running flat-out all year
The threshold is not a universal constant; it moves with every parameter. A cheaper or smaller platform team (a lower $T$) pushes the crossover down and favors building sooner. A larger markup or a wider utilization advantage for the provider pushes it up and favors buying longer. The honest use of this calculation is not to memorize $201{,}667$ but to plug in your own numbers, because the structure is robust even when the constants are not: a fixed people cost that buying eliminates, against a marginal markup that building eliminates, crossing at a scale you can compute. A team well below its crossover that builds anyway is paying salaries to save pennies; a team well above it that buys anyway is paying a markup on an enormous base to avoid a hire it should have made.
This chapter built the cluster as the substrate on which every distributed method in the book runs: the scheduler that admits a gang so a data-parallel job (Chapter 15) can start its all-reduce, the topology placement that keeps a collective fast (Chapter 4), the spot logic that lets training survive preemption (Chapter 18), and the fair sharing that lets many teams coexist. A managed platform does not change the thesis; it relocates the boundary of who operates the substrate. The scale-out reasoning of this book is exactly as necessary when you buy the cluster as when you build it, because the platform's knobs are this chapter's mechanisms wearing an API. Distribution is forced by a ceiling whether you run the scheduler or rent it; the platform only decides whose pager rings when the gang fails to admit.
The managed-platform frontier is pushing the build-versus-buy line still higher, toward not provisioning anything at all. Serverless inference offerings now scale endpoints to zero between requests and bill per token or per second, so the autoscaling control loop the schedulers of Section 33.4 introduced disappears entirely from the user's view; the open question is whether cold-start latency can be hidden well enough for interactive workloads, and systems work on snapshot-and-restore fast-start is active. On the training side, managed disaggregated serving in the lineage of the prefill-decode split (DistServe and Splitwise, 2024) is moving from research into platform features, letting a provider place the compute-bound and memory-bound phases of Chapter 24 on different managed hardware automatically. A parallel thread is managed multi-cloud and spot-arbitrage schedulers (in the lineage of SkyPilot) that bid Section 33.8's spot markets across providers at once, treating the entire public cloud as one elastic cluster and shrinking the utilization gap that drives the break-even of Output 33.9.3. The constant across all three is that the platform keeps absorbing mechanisms this chapter built by hand, and the buyer's job keeps shifting from operating the substrate to reasoning about its economics.
6. Chapter Summary: The Cluster as Substrate Beginner
This section closes Chapter 33, and the chapter has one spine worth stating plainly. A distributed AI workload does not run on machines; it runs on a cluster, and the cluster is a substrate that must be scheduled, placed, shared, and paid for before any all-reduce can fire. We began with the anatomy of the cluster and the compute that fills it (Sections 33.1 and 33.2): how a node is wired, how an accelerator imposes its ceilings, and how a control plane turns submitted work into placements. Containers and Kubernetes followed (Section 33.3), making a heterogeneous cluster addressable, and the batch schedulers of Section 33.4 (Slurm, Kubernetes batch, Volcano) showed how a job declares what it needs and how a scheduler matches that declaration to nodes. Gang scheduling and collective-aware placement (Section 33.5) confronted the demand that makes AI distinctive: a data-parallel or sharded job is all-or-nothing, so the scheduler must admit the whole gang or none of it, and it must land that gang where the collectives of Chapter 4 stay cheap. Multi-tenant GPU sharing (Section 33.6) packed many small jobs onto one device through MIG, MPS, and time-slicing; Ray (Section 33.7) gave a Python-native substrate for fine-grained work; and spot and preemptible scheduling (Section 33.8) made the cluster cheap and unreliable at once, tying the scheduler to the elastic, fault-tolerant training of Chapter 18. This final section folded all of it into the build-versus-buy choice: every mechanism the chapter built can be bought as a managed service, and the break-even of Output 33.9.3 tells you which side of the trade your scale puts you on.
The cluster is the substrate every distributed method runs on, and scheduling it well is six decisions. (1) The resource model and scheduler decide what a job asks for and where it lands. (2) Gang scheduling admits a distributed job atomically so no worker idles on a barrier. (3) Topology-aware placement keeps the gang's collectives fast by respecting the interconnect. (4) Spot and preemption trade reliability for cost and demand checkpoint-and-resume survival. (5) GPU partitioning and fair sharing pack many jobs and many teams onto shared hardware without starvation. (6) Managed platforms package all five behind an API, and the build-versus-buy break-even (about $u^\star = T / [r(m/\eta_m - 1/\eta_s)]$ useful accelerator-hours per year) decides whether you operate the substrate or rent it. Know the mechanisms either way: a bought cluster is this chapter's knobs wearing a different name.
Each prompt is sized to a weekend and exercises one mechanism from this chapter on hardware you already have or can rent cheaply. Carry one all the way through and it becomes a strong capstone seed for Chapter 41.
1. A gang-aware mini-scheduler. Write a simulator that receives a stream of jobs, each requesting either one node or a gang of $k$ nodes, against a fixed pool. Implement first-fit placement first, observe distributed jobs starving while their partial gangs idle, then add gang admission (reserve all $k$ or none) and measure the change in idle-node-hours and average gang wait. Report the trade between utilization and gang latency.
2. A spot-preemption survival simulator. Model a training run on $n$ spot nodes where each node is independently reclaimed with hazard rate $\lambda$ per hour, and a checkpoint costs $c$ minutes taken every $\tau$ minutes. Simulate many runs, sweep the checkpoint interval $\tau$, and find the interval that minimizes expected wall-clock to a target step count. Connect the optimum back to the elastic-training logic of Chapter 18.
3. A MIG-packing benchmark. If you have a partitionable accelerator, split it into several instances and benchmark the aggregate throughput of many small inference jobs packed across the partitions against one large job using the whole device. Find the job size at which partitioning stops helping, and explain the result with the per-node economics of Chapter 22.
4. Your own break-even. Take Code 33.9.3, replace every constant with quotes for your actual hardware, a real platform-engineer salary in your market, and a managed price list from one of the three platforms in this section, and compute your true $u^\star$. Then estimate your project's real useful accelerator-hours per year and state, with the number, which side of the line you are on.
For each of the following managed-platform features, name the self-managed mechanism from Sections 33.1 to 33.8 that it automates, and state one thing you lose by letting the platform decide it: (a) Databricks autoscaling between a minimum and maximum worker count; (b) a SageMaker training job with use_spot_instances=True and a checkpoint_s3_uri; (c) a Vertex endpoint with traffic splitting across two model versions; (d) a managed pipeline that runs four containerized steps with tracked dependencies. For at least two, describe a workload whose defaults the platform would get wrong, and what you would do about it.
Extend Code 33.9.3 into a function break_even(raw_hour, eng_salary, team_size, util_self, markup, util_managed) that returns $u^\star$. First reproduce the $201{,}667$-hour crossover, then answer with numbers: (a) how far does $u^\star$ move if the platform team shrinks from two engineers to one; (b) how far if the managed markup rises from $1.6$ to $2.2$; (c) at what managed utilization $\eta_m$ does the bracket $m/\eta_m - 1/\eta_s$ go non-positive, meaning managed is cheaper at every scale and no break-even exists. Explain in two sentences which single parameter your own organization could most realistically change to shift the decision.
Pick one real system you know or have read about and decide where it should draw the build-versus-buy line of Figure 33.9.1. Estimate its useful accelerator-hours per year, compare against a plausible $u^\star$ from Exercise 33.9.2, and name the one mechanism from Sections 33.1 to 33.8 you would self-manage even if you bought everything else (for example, keeping topology placement in-house for a latency-critical collective). Argue from the break-even and from the control-cost-lock-in triple why a fully managed or fully self-managed choice would be worse than your hybrid boundary.