"They trained me once, deployed me everywhere, and then acted surprised when the world kept moving without asking my permission."
A Model Drifting Quietly in Production
Everything the previous twenty-five chapters built, the distributed data pipelines, the parallel training jobs, the sharded models, and the serving fleets, has to keep running in production for months without a human watching every node, and MLOps is the operational discipline that makes that possible. A model is not a binary you ship once; it is a living function of data, code, and configuration that decays the moment the world it was trained on changes. Operating it at fleet scale means treating the whole machine learning lifecycle, from raw data to a monitored prediction and back to a retrained model, as one continuous, automated loop spanning thousands of machines. This section frames that loop, names what is genuinely harder about closing it at scale, and previews how the rest of the chapter builds each stage. By the end you will see why a deployed model is a process to be operated, not an artifact to be archived.
For most of this book the goal has been to make one stage of the machine learning lifecycle work across many machines: process the data in Part II, train the model in Parts III and IV, serve it from a fleet in Chapters 23 to 25. Each of those was treated as a problem you solve once and then have. In production none of them stays solved. Data arrives continuously and its distribution shifts; a model that was accurate at launch grows stale; a new model version must be rolled out across the fleet without dropping requests; a bad release must be detected and reverted before it harms users. The discipline that keeps all of these stages running, connected, and automated is machine learning operations, or MLOps, and at the scale this book cares about it is itself a distributed-systems problem.
The defining idea is that the lifecycle is a loop, not a line. Data feeds training, training produces a versioned model, the model is deployed to the serving fleet, the fleet is monitored, and the monitoring signal decides when to gather fresh data and retrain, closing the circle. Figure 26.1.1 lays this loop over the distributed stack so you can see which part of the book owns each stage. The whole of this chapter is a walk around that loop, one stage per section.
1. The Lifecycle Is a Loop, Not a Handoff Beginner
A useful way to read Figure 26.1.1 is to follow a single training example all the way around. It enters through the data pipeline, where it is ingested, cleaned, validated, and turned into features, the distributed-data machinery of Chapter 8 extended into a recurring production schedule in Section 26.2. It then contributes to a gradient inside a distributed training job, the exact computation proved correct back in Section 1.1. The trained weights are stamped with a version and written to a model registry, the subject of Section 26.3, so that any prediction can later be traced back to the precise model that produced it. The registered model is deployed onto the serving fleet of Chapter 24, where it answers live requests while the monitoring layer watches its inputs and outputs. The instant those inputs start to look unlike the training data, the loop must close: gather fresh data and train again.
The reason this must be a loop rather than a one-time handoff is that the deployed model is the only fixed thing in a world that keeps changing. Users behave differently, upstream data sources shift their formats, adversaries adapt, and seasons turn. A model trained on last quarter's distribution is, by next quarter, answering questions about a world it never saw. The monitoring-to-retrain edge in Figure 26.1.1 is what converts this slow decay from a silent failure into an automatic correction. A team that ships a model and walks away has not deployed a system; it has scheduled an outage.
The file containing your trained weights is the least interesting part of a production AI system. What keeps the system useful is the loop around that file: the pipeline that keeps feeding it fresh data, the registry that lets you say exactly which weights answered which request, the monitoring that notices when the weights no longer fit the world, and the automation that retrains and redeploys before anyone files a complaint. Operate the loop, not the file. The moment you treat the model as a finished artifact to archive rather than a running process to supervise, decay begins and no one is watching for it.
2. What Changes at Fleet Scale Intermediate
Every team that ships software runs some version of a build-deploy-monitor loop, so it is fair to ask what is genuinely new here. Three things change when the loop carries machine learning instead of ordinary code, and each one is amplified by distribution.
The first is that the artifacts are enormous. A web application deploys a few megabytes of code; a foundation model is tens or hundreds of gigabytes of weights, and the dataset behind it is measured in terabytes or petabytes. You cannot email a model around or diff two datasets by eye. Versioning, transferring, and reproducing these artifacts is a storage and bandwidth problem in its own right, which is why model registries and data versioning, rather than a git repository alone, sit at the center of the loop. The second is that the systems are distributed end to end. The data pipeline runs on a cluster, training spans many accelerators, and serving spans a fleet, so monitoring and deployment are not actions on one box but coordinated operations across thousands of nodes that fail independently, the same reliability reality that Chapter 2 introduced and that haunts every later part. Rolling out a new model version without dropping a single request is a distributed-coordination problem, and observing the health of a prediction service means aggregating signals from every replica at once.
The third change is the most conceptual, and it reframes everything. In ordinary software the deployed behavior is a function of the code alone, so versioning the code captures the system. A model is a function of three inputs at once: the data it was trained on, the code that defined the training, and the configuration (hyperparameters, random seeds, feature definitions) that parameterized the run. Writing this as a reproducibility relation,
$$\text{model} = f(\text{data},\ \text{code},\ \text{config}),$$makes the obligation explicit: to reproduce or audit a model, all three arguments must be pinned, not just the code. Pin only the code and a rerun on shifted data or a different seed yields different weights, and you can no longer explain why a given prediction happened. This is the same reproducibility discipline that Section 5.7 established for evaluation and that the data-versioning machinery of Section 8.9 makes practical for terabyte datasets; MLOps is where the two are joined so that every deployed model carries a complete, replayable provenance record.
Pin the code but forget the data and config, and your "reproducible" model is a three-legged stool missing two legs. It looks like furniture right up until someone sits on it during an audit. Teams discover the missing legs at the worst possible moment: when a regulator, a customer, or a postmortem asks "why did the model say that?" and the only honest answer is "we are not entirely sure."
3. MLOps Versus DevOps, and the LLMOps Additions Beginner
It helps to place MLOps next to the DevOps practice it grew out of. DevOps automates the build, test, deployment, and monitoring of code, and MLOps inherits all of that machinery. The decisive difference is that in MLOps, data and models are first-class entities with their own versioning, testing, and lineage, not just inputs to a code pipeline. A DevOps test asks "does the code do what the test expects?"; an MLOps pipeline must also ask "is the incoming data within the distribution the model expects?" and "is the new model better than the one in production?", questions that have no analog in pure software delivery. Table 26.1.1 lays the contrast out concept by concept.
| Concern | DevOps | MLOps adds | Developed in |
|---|---|---|---|
| Versioned unit | Code | Code + data + config + model weights | §26.3 |
| Test gate | Unit and integration tests | Data validation and model-quality evaluation | §26.4 |
| What you track | Builds and releases | Experiments, metrics, and lineage | §26.5 |
| Monitoring target | Latency, errors, saturation | + input drift and prediction quality | §26.6, §26.7 |
| Trigger to redeploy | A code commit | + a drift signal or quality regression | §26.7 |
When the model in question is a large language model, the loop grows three further responsibilities that the rest of this chapter weaves in, sometimes called LLMOps. Prompts become a versioned artifact in their own right, because the same weights behave very differently under different instructions, so a prompt change is a deployment that must be tracked exactly like a code change. Evaluation shifts from a single accuracy number to a suite of behavioral evals, since "is this generation good?" has no closed-form answer and must be judged by held-out tests, model graders, and human review. And guardrails, the filters that catch unsafe or off-policy outputs, become a monitored, updatable part of the serving path rather than an afterthought. These additions do not replace the loop of Figure 26.1.1; they thicken its monitoring and registry stages, and we flag them as they arise.
You will not build the lifecycle loop from scratch in production. Two open-source platforms package most of it. MLflow gives you experiment tracking, a model registry with stage transitions (staging, production, archived), and a packaging format, so registering and promoting a versioned model is a few API calls rather than a home-grown service. Kubeflow runs the pipeline itself on Kubernetes, expressing the data-to-train-to-deploy stages of Figure 26.1.1 as a directed graph of containerized steps that the cluster schedules and retries for you. A minimal registration with MLflow is about three lines:
import mlflow
with mlflow.start_run():
mlflow.log_params({"lr": 3e-4, "seed": 7}) # the 'config' leg of f(data, code, config)
mlflow.log_metric("val_loss", 0.182) # what experiment tracking (§26.5) records
mlflow.sklearn.log_model(model, name="ranker", # write the weights to the registry (§26.3)
registered_model_name="ranker")
4. The Loop in Code: A Drift Signal Triggers a Retrain Intermediate
The cleanest way to feel why the lifecycle must close is to run it. The program below is a deliberately tiny but complete model of Figure 26.1.1 in pure Python. The "model" is a single number, the mean its training data was centered on, which is enough to make drift visible. Each cycle the program serves the deployed model, lets the world's true data center move on (the world does not wait for us), then monitors the gap between what the model was trained for and what the fleet now sees. When that gap crosses a drift budget, the monitor fires a retrain trigger that gathers fresh data, trains a new model, registers a new version, and redeploys it, exactly the dashed red edge in the figure.
import random
random.seed(7)
class Registry: # the model registry of §26.3
def __init__(self):
self.versions = []
def register(self, model):
v = f"v{len(self.versions) + 1}"
self.versions.append((v, model))
return v
def collect_batch(center):
"""One window of fleet data. 'center' drifts over time as the world changes."""
return [random.gauss(center, 1.0) for _ in range(2000)]
def train(batch):
"""Fit the model: its parameter is the mean the data was centered on."""
return sum(batch) / len(batch)
def deploy(version, model):
print(f" deploy : serving {version} (mu={model:+.3f}) across the fleet")
def monitor(model, live_center, threshold):
"""Compare what serving sees now against what the model was trained for."""
live = collect_batch(live_center)
live_mean = sum(live) / len(live)
drift = abs(live_mean - model)
flag = "DRIFT" if drift > threshold else "ok"
print(f" monitor : trained mu={model:+.3f} live mu={live_mean:+.3f} "
f"|delta|={drift:.3f} [{flag}]")
return drift > threshold
reg = Registry()
world_center = 0.0 # the true data distribution the fleet observes
threshold = 0.25 # drift budget before retrain is triggered
deployed = None
for cycle in range(1, 6):
print(f"cycle {cycle}")
if deployed is None: # first turn around the loop: train and deploy
batch = collect_batch(world_center)
ver = reg.register(train(batch))
deploy(ver, reg.versions[-1][1])
deployed = reg.versions[-1][1]
world_center += 0.18 # the world keeps moving; the model stays fixed
if monitor(deployed, world_center, threshold):
print(" TRIGGER : drift exceeds budget -> launch retrain on fresh data")
ver = reg.register(train(collect_batch(world_center)))
deploy(ver, reg.versions[-1][1])
deployed = reg.versions[-1][1]
print()
print(f"registry : {len(reg.versions)} model versions, "
f"latest = {reg.versions[-1][0]} (mu={reg.versions[-1][1]:+.3f})")
cycle 1
deploy : serving v1 (mu=+0.017) across the fleet
monitor : trained mu=+0.017 live mu=+0.163 |delta|=0.146 [ok]
cycle 2
monitor : trained mu=+0.017 live mu=+0.381 |delta|=0.364 [DRIFT]
TRIGGER : drift exceeds budget -> launch retrain on fresh data
deploy : serving v2 (mu=+0.378) across the fleet
cycle 3
monitor : trained mu=+0.378 live mu=+0.524 |delta|=0.145 [ok]
cycle 4
monitor : trained mu=+0.378 live mu=+0.719 |delta|=0.340 [DRIFT]
TRIGGER : drift exceeds budget -> launch retrain on fresh data
deploy : serving v3 (mu=+0.735) across the fleet
cycle 5
monitor : trained mu=+0.735 live mu=+0.912 |delta|=0.176 [ok]
registry : 3 model versions, latest = v3 (mu=+0.735)
Read the output as a story about the feedback edge. In cycle 1 the freshly trained model fits the world it was trained on, and the drift is small. By cycle 2 the world center has moved twice and the gap crosses the budget, so the trigger fires and version 2 is born already aligned to the new center. The same thing happens again at cycle 4. The registry ends with three versions, a complete deployment history, which is exactly what Section 26.3 argues every fleet needs so that any past prediction can be traced to the model that made it. Delete the monitor-and-trigger block and the deployed mean stays at $+0.017$ while the live mean marches off toward $+0.9$: a model growing quietly, catastrophically wrong with no alarm. That silent divergence is the failure MLOps exists to prevent, and Section 26.7 develops the drift detection that makes the trigger statistically sound rather than a hand-tuned threshold.
Who: An ML platform engineer responsible for the recommendation models at a large online retailer.
Situation: A click-through model served from a fleet of hundreds of replicas, retrained on a fixed monthly schedule regardless of what the data was doing.
Problem: In late November the shopping distribution shifted hard toward holiday gifting, but the next scheduled retrain was three weeks away, so the live model kept recommending as if it were still autumn.
Dilemma: Keep the simple monthly cron, which is predictable and cheap but blind to the calendar, or move to drift-triggered retraining, which catches shifts early but costs unscheduled distributed training runs and demands trustworthy monitoring.
Decision: They added a drift trigger on the input feature distribution, keeping the monthly job only as a floor, so retraining fires whenever the world moves enough to matter, exactly the edge in Code 26.1.2.
How: The serving fleet logged feature statistics to a monitoring store; a job compared the live distribution against the training distribution each hour and, on a sustained exceedance, launched a distributed retrain and a registry promotion behind a canary.
Result: The holiday shift was caught within hours instead of weeks, recommendation quality held through the peak, and the team retired the habit of "we will catch it at the next scheduled run."
Lesson: A schedule retrains on the calendar; a loop retrains on reality. When the data can move faster than your cron, the trigger has to come from monitoring, not from a clock.
5. A Map of the Chapter Beginner
The remaining sections walk the loop of Figure 26.1.1 stage by stage, then handle the operational realities of running it. Section 26.2 turns the data pipeline into a recurring, distributed production schedule that feeds training continuously. Section 26.3 builds the model registry and the versioning that gives every deployed model a traceable identity. Section 26.4 develops CI/CD for distributed ML, where the test gate includes data validation and model quality, not just code tests. Section 26.5 covers distributed experiment tracking, the lineage record that makes $f(\text{data},\ \text{code},\ \text{config})$ replayable. Section 26.6 builds fleet-wide monitoring and observability across thousands of serving nodes, and Section 26.7 makes the drift detection behind the retrain trigger rigorous. Section 26.8 covers A/B testing and shadow deployment at scale, the safe way to compare a new model against the incumbent on live traffic. Section 26.9 closes with rollbacks and incident response, what to do when a release goes wrong across a fleet. Together they turn the one-paragraph loop of this section into an operable system.
MLOps is not a new layer bolted on top of the book; it is the operational supervisor of everything that came before, and it inherits their distributed character wholesale. The data stage runs on the storage and loading machinery of Chapter 8; the training stage is the data-parallel all-reduce of Chapter 15; the serving stage is the distributed inference fleet of Chapter 24; the monitoring and trigger stages demand the fault-tolerance thinking of Chapter 2. The contribution of this chapter to the book's spine is the realization that distributing each stage is necessary but not sufficient: at fleet scale you also have to distribute the supervision of the loop that connects them, and keep it running without a human in the inner loop.
The operational loop is being stretched in two directions by current work. The first is LLMOps: as large language models moved into production, the community built tooling for the prompt-eval-guardrail additions named above, with frameworks such as LangSmith for tracing and evaluating prompt chains, and a rapidly maturing literature on LLM-as-judge evaluation and online guardrailing (for example the Llama Guard line of input/output classifiers, Inan et al., 2023, and its 2024 successors). The second, newer frontier is agent operations, sometimes "AgentOps": when the deployed system is not a single model but a multi-step agent that calls tools, retrieves context, and plans (the systems of Chapter 32), monitoring must follow a whole trajectory of decisions rather than one prediction, and "drift" can mean a tool whose behavior changed underneath the agent. Work in 2024 to 2026 on agent observability, trajectory-level evaluation, and automated regression suites for agents is extending every stage of Figure 26.1.1 from one-shot predictions to long, branching interactions, and it is among the least settled areas in the whole field.
We now have the frame: the lifecycle is a closed loop, what is hard about closing it at scale is the size of the artifacts, the distribution of the systems, and the three-way reproducibility of $f(\text{data},\ \text{code},\ \text{config})$, and MLOps is the discipline that automates the loop where DevOps would only automate the code. The next section starts the walk where the loop itself starts, at the data, building the distributed data and training pipelines that keep the whole circle fed. That walk begins in Section 26.2.
A colleague says their training is "fully reproducible" because the training script lives in git and every commit is tagged. Using the relation $\text{model} = f(\text{data},\ \text{code},\ \text{config})$, describe two concrete situations in which rerunning the exact tagged commit nonetheless produces a different deployed model. For each, name what additional artifact would have to be versioned (tie your answer to the data versioning of Section 8.9 and the reproducibility discipline of Section 5.7), and explain why pinning the code alone cannot fix it.
Code 26.1.2 retrains the instant the drift crosses the threshold for a single cycle, which on noisy real data would cause it to retrain on every transient spike. Modify the program so a retrain fires only after the drift has exceeded the budget for two consecutive cycles, and add a second, larger "emergency" threshold that retrains immediately on a single exceedance. Run it with a noisier world (increase the per-cycle drift and add randomness to the step) and report how many retrains each policy triggers. Explain the cost tradeoff between retraining too eagerly and too late, and relate it to the distributed-training expense each trigger incurs.
Consider operating the loop of Figure 26.1.1 for a 40-gigabyte model retrained on a 2-terabyte dataset. Suppose a drift trigger fires and the loop must: read the dataset for training, write the new model to the registry, and replicate the new model to a serving fleet of 500 replicas. Using a network that moves 5 gigabytes per second, estimate the bytes moved and the wall-clock time for each of the three transfers, ignoring compute. Argue from these numbers alone why frequent retraining is expensive at fleet scale, why the registry replication (not the training read) often dominates when the fleet is large, and what this implies for how aggressive the drift threshold of Section 26.7 should be.