Part V: Distributed Inference and Serving
Chapter 26: MLOps for Distributed AI

Model and Prompt Registries

"Half the fleet was serving version 7, the other half version 9, and nobody could tell me which one was production. We did not have a model; we had a rumor."

A Serving Replica Loading the Wrong Weights
Big Picture

A model registry is the system of record for an ML platform: a consistency-critical shared service that names every trained artifact, records exactly which data and code produced it, and holds the single agreed answer to the question every serving node keeps asking, "which version am I supposed to run?" Once a model is served from more than one machine, "the production model" stops being a file on someone's laptop and becomes shared distributed state that the whole fleet must read and agree on. Getting two replicas to disagree about which weights are live is not a cosmetic bug; it splits your traffic across two different models and quietly poisons every metric downstream. This section treats the registry as what it is, a small control-plane service with a consensus problem at its heart, and builds one from scratch so the moving parts (versions, lineage, stage transitions, and one authoritative production pointer) are concrete before you reach for a managed tool.

The previous section built the distributed data and training pipelines that turn raw data into trained weights. Those pipelines produce artifacts, and an artifact with no record of where it came from is a liability rather than an asset. A multi-gigabyte checkpoint sitting in a bucket tells you nothing on its own: not which dataset version trained it, not which code commit, not what accuracy it reached, not whether it is allowed to serve real traffic. The model registry is the service that attaches all of that to the artifact and makes it queryable, auditable, and, above all, agreed upon across every machine that serves the model. In a single-process notebook none of this matters. Across a fleet it is the difference between an ML platform and a pile of files.

We treat the registry as a distributed-systems object for two concrete reasons. First, the production pointer (the mapping from a model name to the one version that should be live) is shared mutable state that many serving nodes read concurrently and that a deployment must update atomically, which is the control-plane consensus problem of Chapter 2 wearing an MLOps hat. Second, the data it manages is split by nature: the heavy artifacts live in object storage (Chapter 8) while the lightweight metadata and lineage live in a database, so a single logical "model version" is really a coordinated record spanning two very different storage systems.

1. The Registry as the System of Record Beginner

A registry stores three things that a bare artifact store does not. It stores versioned metadata: for every model version, the training data snapshot, the code commit, the hyperparameters, the evaluation metrics, and the timestamp and author. It stores a lifecycle: each version moves through stages, typically staging, then production, then archived, and those transitions are gated by approvals rather than happening by accident. And it stores a pointer: the single authoritative binding from a model name to the version currently serving production traffic. The artifact bytes themselves usually do not live in the registry at all; the registry holds a reference (a URI into object storage) and owns the metadata, the lifecycle, and the pointer.

This separation is deliberate and it mirrors the storage split from Chapter 8. Model checkpoints are large, immutable, and written once; object storage is exactly the right home for blobs with those properties. Metadata is small, queried constantly ("show me every production model trained on data older than thirty days"), and mutated on every stage transition; a transactional database is exactly the right home for that. The registry is the service that keeps these two stores consistent so that a version record never points at a checkpoint that does not exist, and a checkpoint is never silently promoted without a corresponding metadata update.

Key Insight: The Production Pointer Is Consensus, Not Configuration

The most important field in the entire registry is one tiny pointer: model name to production version. It looks like a configuration value, but it behaves like distributed consensus. Every serving replica reads it, and a deployment writes it; if two replicas can observe different values at the same instant, your traffic is silently split across two different models and every metric you collect becomes an average of two systems. Updating that pointer is therefore an atomic, agreed, control-plane operation (Chapter 2), not a casual write. Treat "which version is production?" with the same seriousness you would treat a leader election, because functionally it is one.

2. Lineage: Pinning a Model to Its Exact Inputs Intermediate

Reproducibility and audit both reduce to one requirement: given a model version, you can recover the exact inputs that produced it. We capture lineage by hashing the things a model depends on. If a version records the content hash of its training-data snapshot and the commit hash of its training code, then "reproduce version 7" becomes a deterministic recipe rather than an archaeology project. Let a version $v$ depend on a data snapshot $D_v$ and a code state $C_v$. We store the pair

$$\text{lineage}(v) = \big(\,h(D_v),\; h(C_v)\,\big),$$

where $h$ is a collision-resistant hash. Two versions are reproducibly identical in their inputs exactly when both hashes match, and any drift in data or code shows up immediately as a changed hash. This is the same data-versioning-and-lineage discipline developed in Chapter 8, now anchored to a deployable model rather than to a dataset. The probability that two genuinely different snapshots collide to the same hash is negligible: for a $b$-bit hash it is on the order of $2^{-b}$, so a 256-bit hash makes accidental confusion a non-issue.

Lineage is what turns a registry from a convenience into a governance instrument. When a regulator, a customer, or your own incident review asks "what produced this prediction?", a registry with lineage answers with a data hash, a code commit, a metric set, and an approval trail; a registry without it shrugs. Figure 26.3.1 shows how the pieces fit together and where the fleet plugs in.

Object storage (Ch 8) checkpoint v1 (3.2 GB) checkpoint v2 (3.2 GB) checkpoint v3 (3.4 GB) Registry metadata + lineage (DB) v2 → data_hash, code_commit params, metrics, stage = production v3 → data_hash, code_commit stage = staging (awaiting approval) v1 → stage = archived artifact URI references Stage transitions (approval-gated) staging production archived production pointer ranker → v2 Serving fleet node 1 loads v2 node 2 loads v2 node K loads v2 every node reads the same authoritative pointer
Figure 26.3.1: Anatomy of a model registry. Large artifacts live in object storage (left); the registry database (center top) holds one metadata-and-lineage record per version, each referencing its artifact by URI. Versions move through approval-gated stages (center). A single production pointer (red) names the live version, and every node in the serving fleet (right) resolves that one pointer, so the whole fleet loads identical weights.

3. A Registry From Scratch Intermediate

The clearest way to see why the production pointer is the heart of the design is to build a registry with nothing but a dictionary and a hash function. The implementation below registers three versions (each with a data hash and a code commit for lineage), exposes stage transitions, and keeps exactly one production pointer that a simulated fleet resolves. It then performs a promotion sequence and a rollback, the two operations every real deployment depends on. The key invariant: whatever the production pointer says, every node in fleet_resolve reads the same value, so the fleet never splits.

import hashlib, json

def _h(s):                                  # 12 hex chars of SHA-256 is plenty for a demo
    return hashlib.sha256(s.encode()).hexdigest()[:12]

class ModelRegistry:
    def __init__(self, name):
        self.name = name
        self.versions = {}                  # version -> metadata + lineage record
        self._next = 1
        self.production = None              # THE single source of truth pointer

    def register(self, data_snapshot, code_commit, params, metrics):
        v = self._next; self._next += 1
        self.versions[v] = {
            "version": v, "stage": "staging",
            "data_hash": _h(data_snapshot),  # lineage: which data produced this
            "code_commit": code_commit,      # lineage: which code produced this
            "params": params, "metrics": metrics,
        }
        return v

    def promote(self, version):             # atomic stage transition
        if self.production is not None:
            self.versions[self.production]["stage"] = "archived"
        self.versions[version]["stage"] = "production"
        self.production = version            # one write flips the live model

    def rollback(self, to_version):         # point production back at a known-good version
        if self.production is not None:
            self.versions[self.production]["stage"] = "archived"
        self.versions[to_version]["stage"] = "production"
        self.production = to_version

    def serving_version(self):              # what every fleet node must agree to load
        return self.production

def fleet_resolve(reg, fleet_size):         # each node independently asks the registry
    return [reg.serving_version() for _ in range(fleet_size)]

reg = ModelRegistry("ranker")
v1 = reg.register("corpus@2026-05-01", "git:a1b9f3", {"lr": 3e-4, "layers": 12}, {"auc": 0.901})
v2 = reg.register("corpus@2026-06-10", "git:c7d201", {"lr": 3e-4, "layers": 12}, {"auc": 0.917})
v3 = reg.register("corpus@2026-06-14", "git:e4f880", {"lr": 1e-4, "layers": 16}, {"auc": 0.908})
print("registered versions :", sorted(reg.versions))

reg.promote(v1)
print("after promote v1     : production =", reg.serving_version(),
      "| fleet loads", set(fleet_resolve(reg, 8)))
reg.promote(v2)
print("after promote v2     : production =", reg.serving_version(),
      "| v1 stage =", reg.versions[v1]["stage"])

# v3 trained but never approved; a canary on v2 regresses, so we roll back to v1.
reg.rollback(v1)
print("after rollback to v1 : production =", reg.serving_version(),
      "| fleet loads", set(fleet_resolve(reg, 8)))
print("lineage of v2        :", json.dumps({
    "version": reg.versions[v2]["version"], "data_hash": reg.versions[v2]["data_hash"],
    "code_commit": reg.versions[v2]["code_commit"], "metrics": reg.versions[v2]["metrics"]}))
print("stages               :", {v: reg.versions[v]["stage"] for v in sorted(reg.versions)})
Code 26.3.1: A model registry in one class. Lineage is two hashes per version, the lifecycle is a stage string, and the entire notion of "the live model" is the single self.production pointer that fleet_resolve reads on behalf of every node.
registered versions : [1, 2, 3]
after promote v1     : production = 1 | fleet loads {1}
after promote v2     : production = 2 | v1 stage = archived
after rollback to v1 : production = 1 | fleet loads {1}
lineage of v2        : {"version": 2, "data_hash": "d44dc25ac185", "code_commit": "git:c7d201", "metrics": {"auc": 0.917}}
stages               : {1: 'production', 2: 'archived', 3: 'staging'}
Output 26.3.1: Promotion moves the pointer forward and archives the version it replaced; rollback moves it back to a prior known-good version. The fleet set is always a singleton ({1}), which is the visible proof that every node agreed on one model. Version 3 was trained but never promoted, so it sits in staging untouched.

Three details in Output 26.3.1 carry the whole lesson. The fleet always resolves to a single value, never a split, because there is exactly one pointer. Promotion is reversible: the rollback to version 1 is just another pointer write, so recovering from a bad deploy is as cheap and atomic as the deploy was. And the lineage of version 2 survives every stage change untouched, so the answer to "what produced this model?" is stable even after the model is archived. A production registry adds durability, access control, and a real consensus store behind the pointer, but the shape of the thing is exactly Code 26.3.1.

Thesis Thread: One Pointer the Whole Fleet Reads

The registry is where this book's spine resurfaces in operations. The serving fleet of Chapter 24 is many machines that must behave as one model, and the thing that makes them one model rather than many is a single agreed pointer, the same control-plane consensus that elected a leader in Chapter 2. Scaling out the serving did not remove the need for a single source of truth; it concentrated that need into one tiny, heavily-read field. Whenever you see a fleet behaving coherently, look for the small piece of agreed state that makes the coherence possible.

4. Prompts Are Code Too: The Prompt Registry Intermediate

The LLMOps era adds a second class of artifact that behaves exactly like a model and is too often treated like a sticky note: the prompt. A prompt or prompt template is a deployable artifact whose change alters system behavior just as surely as new weights do. Editing the system prompt of a customer-facing assistant is a deployment, with the same blast radius as shipping a new checkpoint, yet many teams still paste prompts inline in code or, worse, edit them live in a console with no version, no test, and no way back. A prompt registry fixes this by giving prompts the same treatment models get: every prompt template carries a version, a hash, a test result, and a stage, and "the production prompt" is a single pointer the fleet resolves, exactly as in Code 26.3.1.

The payoff is symmetry. A prompt regression (a reworded instruction that quietly raises the refusal rate or breaks JSON formatting) becomes a one-line rollback to the previous prompt version instead of a frantic console edit under incident pressure. Pinning a model version and a prompt version together as a single deployed unit is what makes an LLM application reproducible: the same weights and the same prompt give you the same behavior, and changing either is a tracked, reversible event. The registry abstraction does not care whether the artifact is three gigabytes of weights or three hundred bytes of text; both need a version, lineage, a stage, and one production pointer.

Fun Note: The Hotfix Heard Round the Fleet

The classic prompt incident: someone "just tweaks" the system prompt at 2 a.m. to fix one weird answer, the new wording trips the model into ignoring its output format, and by morning every downstream parser in the company is choking on malformed responses with no record of what changed. A prompt registry turns that ghost story into a boring rollback("prompt-v8") and a diff you can read over coffee. Prompts are code; uncommitted code that runs in production is just a future incident with good intentions.

5. Reaching for a Real Registry Beginner

You build Code 26.3.1 once to understand the moving parts, then you adopt a managed registry that provides the durability, access control, and consensus-backed pointer you do not want to implement yourself. The major options share the shape we built. MLflow Model Registry gives you named models, numbered versions, stage transitions, and a Python and REST API over a backing database and artifact store. Weights & Biases offers artifact lineage and a registry with rich experiment links. Hugging Face Hub versions models (and increasingly prompts and datasets) with git-style history and model cards. The cloud platforms, Vertex AI Model Registry and SageMaker Model Registry, integrate the registry with their managed serving so the production pointer drives the deployed endpoint directly.

Library Shortcut: MLflow, W&B, and the Hugging Face Hub

The roughly fifty lines of Code 26.3.1 become a handful of calls against a durable, multi-user service. The same register-promote-resolve flow in MLflow, with the backing database and artifact store handling persistence and the atomic pointer:

import mlflow
from mlflow.tracking import MlflowClient

# Register an artifact already logged to the tracking server's artifact store (Ch 8).
mlflow.register_model(model_uri="runs:/<run_id>/model", name="ranker")   # -> version N

client = MlflowClient()
client.transition_model_version_stage("ranker", version=2, stage="Production")  # promote
client.transition_model_version_stage("ranker", version=1, stage="Archived")    # rollback target kept

# Every serving node resolves the SAME production pointer, no split possible:
prod = client.get_model_version_by_alias("ranker", "production")   # -> the one live version
model = mlflow.pyfunc.load_model(f"models:/ranker@production")     # load by the pointer, not a path
Code 26.3.2: The register-promote-resolve cycle from Code 26.3.1 in MLflow. The library owns the database transaction behind the stage change, the artifact reference into object storage, and the access control, leaving you the same three verbs. Weights & Biases (wandb.Artifact with aliases) and the Hugging Face Hub (huggingface_hub with revisions) expose the identical register-version-promote pattern.

Use the from-scratch version to reason about correctness and the managed version in production. The line-count reduction is real, but the durability and the consensus-backed pointer are the parts you were never going to reimplement well.

Practical Example: The Split-Brain Recommendation Fleet

Who: An ML platform engineer at a streaming service running a recommendation model across forty serving replicas.

Situation: A new model was rolled out by copying the checkpoint to each replica and restarting it, replica by replica, over twenty minutes.

Problem: During the rollout window, half the fleet served the new model and half served the old one, and an experiment running at the same time reported nonsensical, drifting metrics that nobody could reproduce.

Dilemma: Speed up the rollout so the inconsistent window shrinks (still leaving a window, and still split-brained for minutes), or stop deploying by file copy entirely and route every replica through one agreed production pointer.

Decision: They moved to a registry-backed deploy: replicas resolve the production version from MLflow at load time and on a refresh signal, so the live model is whatever the single pointer says, not whatever file happens to be on local disk.

How: Promotion became a single registry stage transition; replicas watched the pointer and hot-reloaded; rollback was the same transition pointed at the prior version, completing in seconds instead of a twenty-minute reverse copy.

Result: The split-brain window closed, the experiment metrics became reproducible, and the next bad model was rolled back fleet-wide in under a minute by flipping one pointer, exactly the behavior of the rollback in Output 26.3.1.

Lesson: Deploy by changing the agreed pointer, never by racing files onto machines. A registry makes "which model is live?" a single answer the whole fleet reads, which is the only way a multi-replica deploy stays consistent.

6. Governance and the Research Frontier Advanced

A registry that records lineage is one step from a registry that enforces governance. Because every version already carries its data hash, code commit, metrics, and stage, the registry is the natural place to attach a model card (the documented intended use, evaluation results, and known limitations of a version) and to require an approval before a version may enter production. The stage transition becomes a policy gate: no card, no approval, no promotion. This closes the loop between the lineage of Section 2 and the operational control of Section 1, and it is increasingly a compliance requirement rather than a nicety.

Research Frontier: Versioned Prompts and Governed Model Cards (2024 to 2026)

Two threads are reshaping registries right now. The first is treating prompts as first-class registered artifacts: tools such as LangSmith, Langfuse, and PromptLayer (2024 to 2025) add prompt versioning, evaluation, and rollback, and MLflow's prompt registry brings prompt templates under the same stage-and-version machinery as models, making a prompt change a tracked deployment rather than a console edit. The second is governance as code: building on the model-cards line (Mitchell et al., 2019) and the model-card-toolkit lineage, registries now bind cards and approval policies to stage transitions, sharpened by the documentation and risk-management expectations of the EU AI Act (in force 2024, phased through 2026) and the NIST AI Risk Management Framework. The research and tooling question is how to make lineage and approval verifiable and tamper-evident, so a model card and its data-and-code provenance are auditable claims rather than self-reported text. We connect the retrieval and provenance side of this story to web-scale RAG in Chapter 25.

The registry is now in place as the platform's shared source of truth: every model and prompt has a version, a lineage, a lifecycle, and one authoritative production pointer the fleet agrees on. What it does not yet have is an automated path from a passing test to a promoted version. That path, the pipeline that builds, validates, and ships a model change with the same rigor software gets, is continuous integration and delivery for distributed ML, and it is the subject of Section 26.4.

Exercise 26.3.1: What Must Lineage Capture? Conceptual

Code 26.3.1 records two lineage hashes: the training-data snapshot and the code commit. Name at least three further inputs that can change a trained model's behavior even when those two hashes are identical (consider randomness, environment, and dependencies), and for each, state what you would hash or pin to capture it. Then argue which of your additions are essential for bit-for-bit reproducibility versus merely helpful for an audit, and why a registry might choose to record the audit-only items anyway.

Exercise 26.3.2: Add Approval Gates and a Prompt Registry Coding

Extend Code 26.3.1 so that promote raises an error unless the version has an attached approval (add an approve(version, approver) method and a check). Then add a parallel PromptRegistry with the same register/promote/rollback/resolve interface for prompt templates, and a DeployedUnit that pins one model version and one prompt version together. Show a promotion of the unit and a rollback of only the prompt while the model stays fixed. Print the fleet-resolved (model, prompt) pair before and after to confirm it stays a singleton.

Exercise 26.3.3: The Cost of a Stale Pointer Analysis

Suppose a fleet of $K$ replicas caches the production pointer and refreshes it every $T$ seconds, so after a promotion a replica may serve the old model for up to $T$ seconds. If a fraction $f$ of replicas have refreshed at a given instant during the window, write the fraction of traffic served by the new model as a function of $f$, and explain why any $0 < f < 1$ is a split-brain state. Then reason about the trade-off: shrinking $T$ reduces the inconsistency window but increases read load on the registry. Relate this to the consensus and staleness discussion of Chapter 2 and propose one mechanism (for example, a push notification or a version barrier) that closes the window without polling harder.