"My tests were green, my linter was happy, my build was reproducible. Then the holdout set looked at me and asked who said I was any good."
A Candidate Model Awaiting Promotion
Continuous integration and delivery for machine learning keeps the familiar shape of software CI/CD, a pipeline triggered by a change that builds, tests, and ships an artifact, but two boxes are swapped: the build is a distributed training job that can run for hours across many machines, and the artifact is a model whose quality is a measured number, not a passing test. Because of that swap, the decisive gate is not "do the tests pass" but "does the new model beat the one in production on a held-out evaluation, and does it clear the safety bar." This section adapts each stage of the pipeline, CI, continuous training, and CD, to that reality, and then builds the evaluation gate in pure Python so you can watch it promote a good model and block two bad ones for two different reasons.
By this point in the chapter you have a fleet that runs models (Section 26.1), pipelines that produce data and trained checkpoints (Section 26.2), and a registry that versions every artifact (Section 26.3). What is still missing is the machinery that decides, automatically and repeatedly, whether a freshly trained checkpoint deserves to replace the one currently serving traffic. That machinery is continuous integration and delivery, and for models it is a different animal than the green-checkmark pipeline a software team is used to. A code change still triggers the pipeline, but the pipeline's central act is to train a candidate on a distributed cluster and then judge it, and judgment of a model is a comparison of measured quality, not a binary pass or fail.
1. The Build Is a Training Job, the Artifact Is a Model Beginner
In software CI/CD a commit triggers a pipeline that compiles the code, runs a test suite, and, if everything is green, publishes a binary. The contract is crisp: the tests are deterministic, they finish in seconds to minutes, and their verdict is binary. Map that pipeline onto machine learning and every one of those properties bends. The compile step becomes a training job that may occupy hundreds of accelerators for hours, the kind of data-parallel run developed in Chapter 15. The test suite becomes an evaluation suite whose verdict is a quality score on held-out data, a continuous quantity rather than a pass or fail. And the published binary becomes a model checkpoint registered with its lineage in the model registry of Section 26.3.
This is why an ML pipeline keeps the ordinary software tests but cannot stop there. On any change to code, data, or configuration, a sensible pipeline first runs the cheap, fast checks: unit tests on the data-processing and model code, schema and distribution validation on the input data (a column changed type, a feature's mean drifted, a label set lost a class), and a training smoke test that runs a handful of steps on a tiny sample to confirm the loss decreases and nothing throws. These cheap stages exist to fail fast, before the pipeline commits an expensive cluster to a full run. Only when they pass, and only when the change warrants it, does the pipeline trigger the full distributed training job whose output is the candidate model.
Software CI promotes a build when the tests pass. ML CI promotes a model when it is measurably better than the model already in production and clears explicit quality and safety bars. Passing unit tests and a smoke run is necessary but nowhere near sufficient: a model can be perfectly well-formed, fully reproducible, and lint-clean while being worse than what it would replace. The decisive comparison is always candidate-versus-incumbent on a frozen holdout set, which means your pipeline must hold a reference to the current production model and re-score both, not just inspect the new one in isolation.
2. Continuous Training: Retraining as a Triggered Event Beginner
Software pipelines fire on a commit. Model pipelines have a second, distinctly ML trigger: new data and changing data. A model's quality decays as the world drifts away from its training distribution, so beyond the human-initiated "someone changed the code" path there is an automated "the data changed" path, and wiring that path into the pipeline is what the field calls continuous training. A continuous-training pipeline retrains the model on a schedule (every night on the day's fresh logs), on a volume threshold (whenever a million new labeled examples have accumulated), or on a drift signal raised by the monitoring system, the detector built in Section 26.7. The important design point is that continuous training does not bypass the gate; a drift-triggered retrain produces a candidate that must still beat the incumbent on the holdout before it ships. Automation decides when to train, never whether to deploy.
Continuous training is where the distributed nature of the pipeline bites hardest, because the trigger and the training run live on different timescales and different machines. The drift signal is computed by a streaming monitor watching the live serving fleet (Chapter 23); the retrain it kicks off is a multi-node job scheduled on a training cluster; the resulting candidate is evaluated by yet another job and, if promoted, rolled out across the inference fleet. Each handoff crosses a machine boundary and therefore needs the same care, idempotency, retries, lineage tracking, that the rest of this chapter brings to fleet operations. A continuous-training pipeline that cannot tell you exactly which data snapshot produced which candidate is a pipeline you cannot debug.
3. Continuous Delivery: Rolling a Model to the Fleet Intermediate
Once a candidate clears the evaluation gate, continuous delivery packages it and rolls it out to the serving fleet of Chapter 23. For a distributed serving system, packaging means more than copying a checkpoint: it means producing a deployable bundle (the weights, the tokenizer or feature transforms, the runtime image, the serving config) and registering it so every replica in the fleet pulls the identical version. The offline evaluation gate, however good, was measured on a frozen holdout set, and a holdout set is never quite the live traffic. A model that wins offline can still regress online because the live distribution shifted, because latency under real load is worse, or because some slice of users the holdout under-represented gets hurt. So continuous delivery never flips the whole fleet at once.
Instead it uses progressive delivery: the new model first serves a small canary slice of traffic, or runs in shadow mode scoring real requests whose answers are discarded, while online metrics are compared against the incumbent. This is the canary-and-shadow machinery developed in Section 26.8, and it is the online complement to the offline gate. If the canary's live metrics hold, the rollout widens in stages until the new model owns the fleet; if a regression appears, the system automatically rolls back to the previous version, the incident-response reflex of Section 26.9. Offline evaluation gates promotion to a canary; online evaluation gates promotion to the fleet; automatic rollback is the safety net under both.
You rarely build the orchestration by hand. A common open stack is a CI runner (GitHub Actions, GitLab CI) that launches the distributed training job, MLflow to log metrics and register the candidate in the model registry, and CML (Continuous Machine Learning) to post the candidate-versus-baseline comparison straight into the pull request as a report. The gate logic itself is a few lines that read both scores from the tracking server and set the job's exit status:
# .github/workflows/train.yml (excerpt)
jobs:
train-and-gate:
runs-on: gpu-cluster
steps:
- uses: actions/checkout@v4
- run: torchrun --nnodes=4 --nproc_per_node=8 train.py # the distributed build
- run: python evaluate.py --candidate runs/latest --baseline registry:prod
- run: |
cml comment create report.md # candidate-vs-baseline table in the PR
python gate.py runs/latest registry:prod # exit 0 promotes, exit 1 blocks
torchrun build, MLflow versions the artifact, CML renders the comparison into the pull request, and a short gate.py (the logic of Code 26.4.2) decides promotion by its exit code. The dozen lines of scheduling, logging, and reporting glue that you would otherwise maintain by hand collapse into off-the-shelf actions.4. The Evaluation Gate in Pure Python Intermediate
The heart of the pipeline is small enough to write from scratch, and writing it makes the contract unmistakable. The gate trains a candidate, scores it and the production baseline on the same frozen holdout set, and promotes the candidate only if it improves accuracy by at least a margin and passes a safety check. Formally, with candidate accuracy $a_c$, baseline accuracy $a_b$, a promotion margin $\delta$, a fairness gap $g$, and a safety threshold $\tau$, the gate promotes exactly when
$$\text{promote} \iff (a_c - a_b \ge \delta) \ \wedge \ (g \le \tau).$$The conjunction is the whole point: both conditions must hold. A model that improves accuracy but opens a fairness gap is blocked, and so is a model that is perfectly fair but no better than what it replaces. The code below implements exactly this on a logistic-regression candidate, where the "safety check" is the largest accuracy gap between two protected groups. The three candidate builds are constructed to land in three different verdicts.
import numpy as np
rng = np.random.default_rng(7)
# A held-out evaluation set the gate scores both models on.
N, d = 4000, 12
X_holdout = rng.standard_normal((N, d))
w_star = rng.standard_normal(d)
y_holdout = (X_holdout @ w_star + 0.3 * rng.standard_normal(N) > 0).astype(float)
def train(seed, n_examples):
"""Stand-in for a distributed training job: fit a logistic model."""
r = np.random.default_rng(seed)
Xt = r.standard_normal((n_examples, d))
yt = (Xt @ w_star + 0.3 * r.standard_normal(n_examples) > 0).astype(float)
w = np.zeros(d)
for _ in range(300):
p = 1.0 / (1.0 + np.exp(-(Xt @ w)))
w -= 0.1 * (Xt.T @ (p - yt)) / n_examples
return w
def accuracy(w):
p = 1.0 / (1.0 + np.exp(-(X_holdout @ w)))
return float(((p > 0.5).astype(float) == y_holdout).mean())
def max_disparity(w, bias=0.0):
"""Safety check: largest accuracy gap across two protected groups.
`bias` simulates a model that quietly degrades on one group while keeping
high overall accuracy: exactly the regression a global metric would miss."""
g = (X_holdout[:, 0] > 0)
logits = (X_holdout @ w) - bias * g # push group-A toward the wrong class
p = (1.0 / (1.0 + np.exp(-logits)) > 0.5).astype(float)
acc_a = (p[g] == y_holdout[g]).mean()
acc_b = (p[~g] == y_holdout[~g]).mean()
return float(abs(acc_a - acc_b))
def gate(candidate, baseline, bias=0.0, min_gain=0.005, max_gap=0.10):
"""Promote only if candidate BEATS baseline AND passes the safety check."""
cand_acc, base_acc = accuracy(candidate), accuracy(baseline)
gap = max_disparity(candidate, bias)
beats = (cand_acc - base_acc) >= min_gain
safe = gap <= max_gap
promote = beats and safe
reason = ("promoted" if promote else
"blocked: no eval gain" if not beats else
"blocked: failed safety check")
return promote, cand_acc, base_acc, gap, reason
# Production baseline trained on a smaller, older snapshot.
baseline = train(seed=1, n_examples=600)
print(f"production baseline accuracy : {accuracy(baseline):.4f}\n")
# Three candidate builds: (model, safety_bias). build-C is accurate overall
# but quietly regresses on one protected group.
candidates = {
"build-A (more data)" : (train(seed=2, n_examples=8000), 0.0),
"build-B (stale shard)" : (train(seed=99, n_examples=120), 0.0),
"build-C (group regress)": (train(seed=2, n_examples=8000), 3.0),
}
for name, (cand, bias) in candidates.items():
promote, ca, ba, gap, reason = gate(cand, baseline, bias)
flag = "PROMOTE" if promote else "BLOCK "
print(f"{flag} {name:24s} acc={ca:.4f} (base {ba:.4f}) gap={gap:.3f} -> {reason}")
production baseline accuracy : 0.9647
PROMOTE build-A (more data) acc=0.9730 (base 0.9647) gap=0.009 -> promoted
BLOCK build-B (stale shard) acc=0.9157 (base 0.9647) gap=0.016 -> blocked: no eval gain
BLOCK build-C (group regress) acc=0.9730 (base 0.9647) gap=0.333 -> blocked: failed safety check
Build-C is the case that justifies the whole apparatus. Its overall accuracy ties the promoted build-A to four decimal places, so a pipeline that gated on accuracy alone would have shipped it. The safety check catches what the aggregate metric hides, a model that is excellent on average while quietly failing one group of users. This is why the gate is a conjunction and why ML CI/CD keeps a suite of metrics, not one scalar; the multi-metric evaluation discipline of Chapter 5 is exactly what feeds this gate.
Who: An ML platform engineer running the continuous-training pipeline for a content-ranking model at a streaming service.
Situation: A nightly job retrained the ranker on the day's interaction logs across a 16-GPU cluster and auto-deployed any candidate whose offline AUC beat the incumbent.
Problem: One night a holiday traffic spike skewed the logs; the retrained candidate's aggregate AUC rose, but its ranking quality for a small non-holiday user segment collapsed.
Dilemma: Keep the simple single-metric gate that had shipped fine for months and ship the candidate, or add per-segment and safety gates that would occasionally block a model with higher headline AUC.
Decision: They made the gate a conjunction: a candidate had to beat the baseline on aggregate AUC and on every monitored segment and clear a guardrail on a fairness metric, exactly the structure of Code 26.4.2.
How: The evaluation step scored candidate and incumbent on a frozen holdout plus per-segment slices, posted the comparison to the pull request with CML, and set the deploy job's exit status from the conjunction.
Result: The very next skew-day candidate was blocked on the segment gate and never reached production; the incumbent kept serving, and the on-call engineer reviewed the diff in the morning instead of firefighting a live regression.
Lesson: An aggregate win can hide a slice-level loss. A gate that scores only one number will eventually promote a model that is better on average and worse where it matters.
5. Why ML CI/CD Is Harder Than Software CI/CD Intermediate
It is worth naming directly why the pipeline above is harder to build and operate than its software cousin, because each difficulty drives a concrete design choice. Four properties separate them. First, nondeterminism: two training runs from the same code can produce different models because of random initialization, data-loader shuffling, and nondeterministic floating-point reductions across GPUs, so "rerun and compare" is not the clean equality check it is for a deterministic build. Second, expensive long builds: a software build retries cheaply, but a training run can cost hours of cluster time, which makes failing fast on the cheap checks and reusing artifacts essential rather than nice to have. Third, data dependence: the pipeline's output depends on data that changes underneath it, so a green pipeline yesterday says nothing about today, and the data itself must be validated and versioned as carefully as the code. Fourth, quality is not binary: there is no pass or fail, only a measured score against a moving baseline, which is why the gate is a comparison and a threshold rather than an assertion.
A software engineer's nightmare is a flaky test that fails one run in fifty. An ML engineer's nightmare is a flaky test that passes one run in fifty: the candidate that beat the baseline by luck of the random seed, sailed through the gate, and now has to be explained in the postmortem. Nondeterminism does not just break your equality checks; it occasionally lies to your face in the friendly direction.
These four properties also explain why the gate must be statistically honest. Because quality is a noisy measured score, "candidate beat baseline by 0.0003" may be within the run-to-run noise of the nondeterminism above, and promoting on it is promoting on chance. Mature gates therefore set the promotion margin $\delta$ above the measured noise floor, or require the improvement to clear a confidence interval, rather than reacting to any positive difference. The evaluation rigor of Chapter 5 is what tells you where that floor sits.
6. The LLMOps Version: Eval Suites and Guardrails as Gates Advanced
For large language models and the agentic systems of Chapter 24, the candidate that flows through CI/CD is often not a freshly trained model at all but a changed prompt, a swapped base model, a new retrieval index, or a tweaked tool definition. The pipeline shape is identical, but the gate's contents change: instead of a single accuracy number, the gate runs an evaluation suite. Offline evaluation scores the candidate on a curated set of tasks; an LLM-as-judge step uses a separate, strong model to grade open-ended outputs for helpfulness and correctness where no exact-match key exists; and red-team and guardrail tests probe for jailbreaks, prompt injection, toxic or unsafe completions, and policy violations. A candidate prompt or model is promotable only if it improves the task scores and passes every guardrail, the same conjunction as Code 26.4.2 with a richer right-hand side.
This matters because an LLM change has no compiler to catch it. A reworded system prompt that lifts answer quality can simultaneously open a new injection vector, and nothing but an explicit red-team gate in the pipeline will catch that before users do. Treating the eval suite and the guardrail tests as required CI gates, run on every prompt and model change, is what turns "we tried the new prompt and it seemed better in the playground" into a defensible, repeatable release decision.
Two research lines are reshaping the LLM gate. The first is eval-driven CI: open frameworks such as OpenAI Evals, EleutherAI's lm-evaluation-harness, and judge-based libraries make a versioned evaluation suite a first-class pipeline artifact, with active work on the reliability of LLM-as-judge scoring (calibrating judge models, measuring their bias and agreement with humans) so the gate's verdict can be trusted. The second is automated red-teaming: rather than hand-writing adversarial prompts, methods in the lineage of Anthropic's constitutional and automated red-teaming work and Meta's open guardrail models (Llama Guard) generate and continuously expand attack suites, and standardized probes such as the MLCommons AILuminate safety benchmark (2024 to 2025) are turning safety from a one-off audit into a regression test that runs on every change. The frontier question is how to keep these eval and red-team suites themselves from going stale as models learn to pass yesterday's tests, which pushes toward adversarially generated, self-refreshing gates.
CI/CD for models is not a one-way pipeline; it is a feedback loop spread across machines. The drift detector on the serving fleet (Section 26.7) triggers a retrain on the training cluster, whose candidate is judged by an evaluation gate, rolled out by a canary on the inference fleet (Section 26.8), and rolled back by the incident system (Section 26.9) when it regresses. Each arrow crosses a machine boundary, so the same primitives that move gradients and shard models, reliable messaging, lineage tracking, idempotent retries, here move decisions. Scale-out is not only how the model trains and serves; it is how the system decides what to ship.
For each of the four properties from Section 5 (nondeterminism, expensive long builds, data dependence, quality not binary), name one concrete change you would make to the pipeline of Figure 26.4.1 to cope with it, and say which stage of the figure your change lives in. Then explain why none of these changes would be necessary in an ordinary software CI pipeline that compiles a library and runs unit tests.
Extend Code 26.4.2 so the gate accounts for evaluation noise. Score each model on several bootstrap resamples of the holdout set, estimate a confidence interval on the accuracy difference $a_c - a_b$, and promote only when the lower bound of that interval clears the margin $\delta$. Construct a candidate whose point estimate beats the baseline by a hair but whose interval includes zero, and confirm your stricter gate blocks it where the original Code 26.4.2 would have promoted it. Explain how this connects to the nondeterminism property of Section 5.
Suppose the cheap checks (unit tests, data validation, smoke test) take 3 minutes and catch a defect in 20 percent of changes, while the full distributed training run takes 4 hours on a cluster that costs \$120 per hour. Compare the expected cluster cost per change of two orderings: running the cheap checks first and gating the training run on them, versus always launching the training run and checking afterward. Generalize to an arbitrary cheap-check catch rate $p$ and state the rule for when fail-fast ordering pays off. Relate your answer to the "expensive long builds" property of Section 5.