Section 35.7: Auditability and Governance Across a Fleet

"I am an audit log, remembering what ten thousand nodes would rather forget. They each kept only the last hour; I keep the question someone will ask in a year."
An Append-Only Ledger With a Long Memory

Big Picture

Governing a single model is a paperwork problem; governing a fleet is a distributed-systems problem. When one model runs in one process, you can answer "which data trained it, who changed it, and how is it behaving?" by inspecting one box. When thousands of model replicas run across regions, tenants, and edge devices, no single box holds the answer. Governance becomes the same engineering act as everything else in this book: instrument every node, emit evidence as it happens, ship that evidence to a place where it cannot be quietly rewritten, and roll it up into a fleet-level view that a regulator, an incident responder, or a future engineer can trust. This section treats accountability itself as a workload that must scale out, collected from every node and aggregated, with the same care for cost, tamper-resistance, and completeness that we gave to gradients and to checkpoints.

The previous sections of this chapter hardened the fleet against failure and attack: reliable training, secure aggregation, and the privacy controls of Section 35.6. Reliability and security answer "is the system working and is it safe?" Governance answers a different and longer-lived question: "can we prove, after the fact, what the system did and why?" That question is asked by auditors, by regulators enforcing a right to explanation, by an incident review after a bad deployment, and by the next engineer who inherits the system. None of them can log into ten thousand nodes. They need the evidence already collected, already linked, and already verifiable. Building that evidence pipeline is the subject of this section, and it is distributed engineering through and through.

1. Governance as a Distributed-Systems Problem Beginner

The instinct from single-machine ML is to treat governance as documentation: write down the dataset, note the hyperparameters, file a report. That instinct fails at fleet scale for the same reason a print statement fails as a distributed tracing system. The facts you must govern are scattered. The training data lived on a data lake (Chapter 8); the training run executed across many workers (Chapter 15); the resulting model was registered, versioned, and promoted through an MLOps pipeline (Chapter 26); and the deployed replicas now serve from a fleet of nodes across several regions and tenants. A governance question touches all of these, and the answer must be assembled from evidence each layer emitted independently.

Reframing governance this way makes its hard parts the familiar hard parts of distributed systems. You need a consistent naming and versioning scheme so that a fact emitted on one node refers to the same model and the same dataset as a fact emitted on another. You need an append-only, tamper-evident store so that evidence written under load cannot be silently edited later. You need aggregation that rolls per-node observations into a fleet view without losing the per-node detail an investigator will eventually want. And you need all of it to keep working while nodes crash, restart, and get preempted, exactly the conditions Chapters 18 and 33 taught us to expect. Figure 35.7.1 shows the shape of the pipeline: every node emits, a durable lineage and audit store collects, and a governance dashboard rolls the evidence up.

Figure 35.7.1: Governance as a scale-out pipeline. Each node in the fleet emits signed log records and metrics as it runs; a durable append-only store collects them into a lineage DAG and a hash-chained audit log alongside the model and data cards; a governance view rolls the evidence into fleet-level drift summaries, policy-violation reports, and provenance queries. No single node holds the answer to an audit question; the answer is assembled from evidence every node contributed.

2. Provenance and Lineage: Which Data Trained Which Model Intermediate

The foundational governance question is provenance: for any deployed model replica, which exact data, code, and configuration produced it? Answering it requires that every artifact in the pipeline carry an immutable identity and that the pipeline record the edges between them. A dataset snapshot gets a content hash; a training run records that hash, the code commit, the hyperparameters, and the environment; the resulting model version records the run that produced it; and each deployment records the model version it serves. Chained together, these edges form a directed acyclic graph, the lineage DAG, that lets you walk backward from a serving replica to the precise bytes it learned from. This is the same registry-and-versioning discipline that Chapter 26 builds for MLOps; governance simply insists that the DAG be complete and queryable, because an auditor's first question is almost always "where did this come from?"

Lineage is what makes reproducibility possible across a distributed pipeline, and reproducibility is the operational core of governance. If a deployed model behaves badly, you must be able to rebuild it from its recorded inputs to investigate, and rebuilding requires that the recorded inputs fully determine the output. In a distributed training run that is harder than it sounds: nondeterministic reduction order in collective operations, asynchronous data loading, and floating-point non-associativity can make two runs from identical inputs diverge. Governance-grade reproducibility therefore pins not just the data and code but the seeds, the device topology, the library versions, and where necessary deterministic kernels, so that the lineage edge "this run produced this model" is a fact you can re-derive rather than merely a claim you logged.

Key Insight: Provenance Is an Edge, Not a Note

The difference between governance that survives an audit and governance that does not is whether provenance is recorded as machine-readable edges in a versioned graph or as prose in a wiki. An edge ("model v37 was produced by run r912, which read dataset snapshot d44 at commit c8f1") can be queried, traversed, and verified automatically across the whole fleet. A note can only be read by a human who already knows where to look. At fleet scale the human cannot look everywhere, so only the graph scales. Record every artifact with a content-addressed identity and every transformation as an edge, and the question "which data trained which deployed model?" becomes a graph traversal instead of an investigation.

3. Audit Logging and Tamper-Evidence Intermediate

Provenance records how an artifact came to be; an audit log records what happened to it afterward, who accessed it, who changed a policy, which request it served, and when. For governance the audit log has a property ordinary application logs lack: it must be tamper-evident, so that no one, including a privileged insider, can quietly alter the record after the fact. The standard construction is a hash chain. Each log entry stores a cryptographic hash of its own contents together with the hash of the previous entry, so the records form a chain in which changing any past entry changes every hash after it. A periodically published chain head (or a Merkle-tree root anchored to an external timestamp) then lets an auditor verify that the prefix they saw last week is still a prefix of the log they see today. This is the same idea that underlies blockchains and transparency logs, borrowed here for the specific AI operation of proving a deployment history was not rewritten.

At fleet scale the audit log is itself distributed: each node appends to its own local segment, and the segments are shipped to the durable store and ordered there. The hash chain makes this tractable because verification is local to a segment and the segment heads can be cross-signed, so you do not need a single global lock on every append, which would throttle the fleet. The cost model is the familiar one from Chapter 4: appends are cheap and local, the expensive operation is the periodic aggregation of segment heads, and you tune its frequency against how quickly you need cross-fleet tamper-evidence. Code 35.7.2 in the library-shortcut below shows the hash-chain append in a dozen lines, and the demo of Section 5 shows the metric side of the same emit-and-aggregate pattern.

Library Shortcut: MLflow Lineage and a Tamper-Evident Append in a Few Lines

You rarely build the lineage store or the audit chain from scratch. A model registry such as MLflow records the run-to-model-to-deployment edges automatically when you log a run, and a model card travels with the registered version. The lineage side is one block:

import mlflow

with mlflow.start_run() as run:
    mlflow.log_param("data_snapshot", "d44@c8f1")     # content-addressed input
    mlflow.log_param("seed", 7)                         # pin for reproducibility
    mlflow.log_metric("val_auc", 0.913)
    mlflow.sklearn.log_model(model, name="ranker")     # registers a versioned model
# The registry now holds the edge: this run -> this model version -> (on deploy)
# this serving replica. Querying lineage later is a registry call, not a wiki search.

Code 35.7.1: The lineage DAG of Section 2, recorded by the registry rather than by hand. Each logged run becomes a node with edges to its inputs and its output model version; deployment records close the chain to the serving replica.

The tamper-evident audit log is just as compact. Each entry carries the hash of the previous one, so any later edit breaks the chain:

import hashlib, json

def append(log, event):
    prev = log[-1]["hash"] if log else "0" * 64
    body = json.dumps(event, sort_keys=True)
    h = hashlib.sha256((prev + body).encode()).hexdigest()  # chain to predecessor
    log.append({"event": event, "prev": prev, "hash": h})
    return log

audit = []
append(audit, {"actor": "svc-deployer", "action": "promote", "model": "ranker:v37"})
append(audit, {"actor": "ops-7", "action": "set_policy", "tenant": "eu", "pii": "deny"})
# Verifying the chain is O(n) and detects any silent edit to any past entry.

Code 35.7.2: A hash-chained audit log. The prev field links each record to its predecessor, so altering an earlier entry changes every subsequent hash and the tampering is detectable on a single linear pass.

4. Documentation, Access Control, and Multi-Tenant Policy Intermediate

Provenance and audit logs are machine-facing evidence; model cards and data cards are the human-facing layer that makes a model governable by people who did not build it. A model card states a model's intended use, its evaluation results, its known limitations, and the populations on which it was and was not validated; a data card documents a dataset's collection process, consent basis, and composition. In a single-model world a card is a document. In a fleet it is structured metadata attached to a registered version and rolled up across deployments, so that a governance query like "which serving replicas run a model whose card declares it unvalidated for medical use?" can be answered without reading thousands of documents. The evaluation results a card cites are themselves fleet-rolled-up metrics of the kind Chapter 5 computes; governance reuses that evaluation machinery rather than inventing a parallel one.

Access control and policy enforcement are where governance meets the multi-tenant, multi-region reality of a real fleet. A single store may hold models and data belonging to many tenants and subject to different regional rules; data residency law may forbid a European tenant's data from being processed on a node in another region. Enforcing this means policy is not a setting on one box but an invariant maintained across the fleet: every node must agree on who may invoke which model on which data in which region, and every access must be checked and logged against that policy. The scalable pattern is to express policy declaratively in one authoritative place and distribute it to every node as a versioned artifact, so the rule and its version appear in the audit log next to each decision. When the European policy in Code 35.7.2 says pii: deny, every node in scope must enforce it identically, and the audit log must show that it did.

Practical Example: The Audit Nobody Could Answer From One Box

Who: An ML platform team running a credit-risk scoring service across three regions for several banking tenants.

Situation: A regulator invoked the right to explanation and asked, for a specific declined application from eight months earlier, which model version scored it, which data trained that version, and whether the tenant's data had stayed in-region.

Problem: The serving nodes kept only a rolling week of logs; the training cluster had been torn down and rebuilt many times; the dataset had been refreshed since. No single machine still held the answer.

Dilemma: Reconstruct the history by hand from scattered backups and hope it was complete and defensible, or concede that the system could not explain its own decision, a finding with regulatory teeth.

Decision: Because the team had earlier wired the fleet to emit signed records into an append-only lineage and audit store, they ran a provenance query instead of an investigation.

How: The audit log resolved the request to model v37; the lineage DAG walked v37 back to the dataset snapshot and code commit; the residency policy version logged beside each access proved the tenant's data never left its region.

Result: The full chain, data to model to decision to residency proof, was produced in under a day and verified against the published chain head, satisfying the regulator.

Lesson: Governance evidence must be collected continuously from every node while the system runs, because you cannot reconstruct from a torn-down fleet what you did not record. Accountability has to scale out alongside the workload it accounts for.

5. Distributed Monitoring and Drift Detection Across the Fleet Advanced

A model that was correct at deployment can become wrong as the world shifts under it, and across a fleet each replica sees its own slice of the world. A model serving European traffic may stay calibrated while the same model serving a different region drifts, because the input distribution there has moved. Governance therefore demands continuous, distributed monitoring: each node measures the distribution of the features (or predictions) it actually sees, compares it to the reference distribution the model was validated against, and emits a drift signal that the fleet view rolls up. The reference is exactly the distribution recorded on the model card when the version was signed off, which is why the card, the evaluation rollup, and the monitor are one connected system rather than three.

The standard fleet-friendly drift signal is the Population Stability Index. Bin the reference distribution into $B$ buckets (commonly its own deciles, so each bucket holds equal reference mass) with fractions $r_b$, measure the live fractions $\ell_b$ a node sees in those same buckets, and compute

$$\mathrm{PSI} = \sum_{b=1}^{B} (\ell_b - r_b)\,\ln\!\frac{\ell_b}{r_b}.$$

The PSI is a symmetrized relative-entropy quantity, closely related to the Kullback-Leibler divergence $D_{\mathrm{KL}}(\ell \,\|\, r) = \sum_b \ell_b \ln(\ell_b / r_b)$ but symmetric in $\ell$ and $r$, which makes it stable when a bucket on either side is sparse. A widely used rule of thumb flags a material shift at $\mathrm{PSI} > 0.2$ and a minor one in $[0.1, 0.2]$. The signal is cheap to compute per node and additive to summarize: a node ships its bucket counts, the fleet view computes each node's PSI and then a fleet rollup (mean and max PSI, count of flagged nodes), and an investigator drills from the rollup back into the offending node. Code 35.7.3 computes exactly this across a small fleet, flags the drifted nodes, and rolls the per-node evidence into one audit summary.

Because a fleet emits far more data than you can store forever, monitoring usually samples. If a node logs a fraction $p$ of its requests for audit, then to retain an expected $m$ audited records for a slice you must serve about $m / p$ requests in that slice, so rare slices need a higher sampling rate to stay auditable. Choosing $p$ per slice rather than globally is how you keep audit coverage uniform across a skewed traffic distribution without paying to store everything, a direct application of the sampling reasoning from Chapter 5.

import numpy as np

rng = np.random.default_rng(7)

# Reference distribution: the feature distribution captured when each model
# in the fleet was validated and its model card was signed off.
ref = rng.normal(0.0, 1.0, size=200_000)

# Edges from the reference quantiles: 10 equal-mass bins on the reference.
edges = np.quantile(ref, np.linspace(0.0, 1.0, 11))
edges[0], edges[-1] = -np.inf, np.inf

def binned_fraction(x, edges):
    counts, _ = np.histogram(x, bins=edges)
    return counts / counts.sum()

def psi(p_ref, p_live, eps=1e-6):
    p_ref = np.clip(p_ref, eps, None)
    p_live = np.clip(p_live, eps, None)
    return float(np.sum((p_live - p_ref) * np.log(p_live / p_ref)))

p_ref = binned_fraction(ref, edges)
THRESHOLD = 0.2  # industry rule of thumb: PSI > 0.2 is a material shift

# A fleet of edge/region nodes, each serving the SAME deployed model but
# seeing its own local input stream. Two have drifted (shift + scale change).
nodes = {
    "node-eu-1":  rng.normal(0.00, 1.00, 40_000),
    "node-eu-2":  rng.normal(0.05, 1.02, 40_000),
    "node-us-1":  rng.normal(0.60, 1.15, 40_000),   # drifted
    "node-ap-1":  rng.normal(-0.03, 0.98, 40_000),
    "node-ap-2":  rng.normal(1.10, 1.40, 40_000),   # drifted hard
}

print(f"{'node':<11}{'PSI':>9}   status")
flagged, psis = [], []
for name, x in nodes.items():
    s = psi(p_ref, binned_fraction(x, edges))
    psis.append(s)
    status = "DRIFT" if s > THRESHOLD else "ok"
    if s > THRESHOLD:
        flagged.append(name)
    print(f"{name:<11}{s:>9.4f}   {status}")

# Roll the per-node evidence up into one fleet-level audit summary.
print("-" * 34)
print(f"fleet nodes audited  : {len(nodes)}")
print(f"fleet mean PSI       : {np.mean(psis):.4f}")
print(f"fleet max  PSI       : {np.max(psis):.4f}")
print(f"nodes flagged        : {len(flagged)}  {flagged}")
print(f"fleet verdict        : {'INVESTIGATE' if flagged else 'stable'}")

Code 35.7.3: Per-node PSI drift detection rolled up into a fleet audit summary. Each node compares its live feature distribution to the reference recorded on the model card, computes a PSI, and is flagged past the $0.2$ threshold; the fleet view then aggregates the per-node signals into mean PSI, max PSI, a flagged-node list, and a single verdict.

node             PSI   status
node-eu-1     0.0003   ok
node-eu-2     0.0027   ok
node-us-1     0.3013   DRIFT
node-ap-1     0.0019   ok
node-ap-2     0.8158   DRIFT
----------------------------------
fleet nodes audited  : 5
fleet mean PSI       : 0.2244
fleet max  PSI       : 0.8158
nodes flagged        : 2  ['node-us-1', 'node-ap-2']
fleet verdict        : INVESTIGATE

Output 35.7.3: Two of the five nodes breach the $0.2$ threshold, with node-ap-2 at a severe $\mathrm{PSI} = 0.82$. The fleet mean of $0.22$ would by itself flag the fleet, but the per-node detail is what an investigator needs: the drift is concentrated in two regions, not spread evenly, and the rollup preserves that fact rather than averaging it away.

Key Insight: Roll Up the Signal, Keep the Detail

A fleet-level governance number is only useful if you can drill from it back to the node that produced it. The fleet mean PSI in Output 35.7.3 says "something is wrong"; the per-node breakdown says "in these two regions, by this much." An aggregation that discards the per-node terms to save space destroys the very evidence an audit needs. The discipline is to roll up for the dashboard and retain (at a sampled rate) for the investigation, so accountability scales out without collapsing into a single uninterpretable average.

6. Regulatory Pressure and the Limits of Distributed Auditability Advanced

The engineering above exists because external rules increasingly demand it. A right to explanation requires that an automated decision be traceable to the model and inputs that produced it, which is precisely the provenance query of Section 2. Data-residency law requires that a tenant's data be processed and stored only in permitted regions, which is the multi-region policy enforcement of Section 4 plus the audit log that proves compliance. Model-risk-management regimes, originally written for financial models and now extended toward AI systems, require documented validation, ongoing monitoring, and independent review, which map onto the model card, the fleet drift monitor, and the tamper-evident audit log respectively. The recurring lesson is that compliance is not a document you produce at the end; it is a property you must have instrumented the fleet to demonstrate continuously.

Decentralized settings push auditability to its hardest form. In federated learning (Chapter 14) the training data never leaves the participating devices by design, so the data lineage edge "this model learned from this data" cannot be a content hash of data the coordinator is permitted to see. Governance there must prove properties of training (that secure aggregation ran, that a differential-privacy budget was respected as in Section 35.6, that no single client dominated) without inspecting the data itself, which is a genuinely open problem. The same tension appears at the edge fleet of Chapter 34, where intermittently connected devices cannot ship a continuous audit stream and must instead carry a local tamper-evident log that reconciles when connectivity returns. Auditing a system whose whole premise is that you cannot see inside it is where fleet governance is genuinely unsolved.

Research Frontier: Verifiable and Privacy-Preserving Audit (2024 to 2026)

Two research lines are converging on auditability that does not require trusting the operator. Cryptographic provenance and supply-chain attestation for ML, in the lineage of the SLSA framework and model-signing efforts (Sigstore-style signing of model artifacts, now adopted by major model registries in 2024 to 2025), aim to make a model's training-to-deployment chain independently verifiable rather than self-reported. In parallel, zero-knowledge machine learning (zkML) and proof-of-inference work let a node prove it ran a specific model on a specific input and produced a specific output without revealing the model weights or the input, which would let a fleet emit cryptographically verifiable audit records even across mutually distrusting tenants. A third thread builds privacy-preserving drift and fairness monitoring over federated fleets, computing the rollups of Section 5 without centralizing the raw distributions. The open question that ties them together is whether governance evidence can be made trustworthy enough to satisfy a regulator while remaining cheap enough to emit from every node in a fleet that already counts its all-reduces, a cost-versus-trust trade-off in the spirit of Chapter 4.

Thesis Thread: Accountability Must Scale Out Too

The book's spine is that the essential activities of an AI system, data, computation, model, inference, and decision-making, are distributed across many machines and must be coordinated to act as one. This section adds accountability to that list. You cannot inspect a single box to govern a fleet any more than you can train a foundation model on one. The same moves return: emit from every node, ship the evidence to a durable place, and roll it up correctly, with the same attention to communication cost and fault tolerance the rest of the book demanded of gradients and checkpoints. Governance is not a postscript to scale-out engineering; it is one more axis of it.

Exercise 35.7.1: What the Lineage DAG Must Hold Conceptual

A regulator asks you to prove, for a single declined loan decision made nine months ago, the complete chain from training data to the served decision. List the artifacts and the edges between them that your lineage DAG and audit log must contain to answer this, starting from the dataset snapshot and ending at the served request. For each edge, state what would make the chain unverifiable if it were missing, and explain why a wiki page describing the pipeline could not substitute for the recorded edges at fleet scale.

Exercise 35.7.2: Tamper-Evidence and Sampled Coverage Coding

Extend the hash-chain append of Code 35.7.2 with a verify(log) function that recomputes every hash and returns the index of the first tampered entry (or -1 if intact); demonstrate it by mutating one entry's body in place and confirming detection. Then, separately, suppose a node samples a fraction $p$ of its requests for audit and you need an expected $m = 200$ audited records in a slice that receives only $0.5\%$ of traffic. Using the relation that retaining $m$ records needs about $m/p$ served requests in the slice, compute the per-slice sampling rate that keeps this rare slice auditable when the node serves one million requests, and explain why a single global sampling rate would under-cover it.

Exercise 35.7.3: Reading the Rollup Analysis

In Output 35.7.3 the fleet mean PSI is $0.22$, already past the $0.2$ threshold, yet three of the five nodes are well below $0.01$. Construct a second fleet of five nodes whose individual PSIs are all in the minor-shift band $[0.1, 0.2]$ but whose mean is still about $0.15$, and contrast the governance response it warrants with the response warranted by Output 35.7.3. Argue from these two cases why a fleet-level governance dashboard must expose the distribution of per-node signals (or at least the max and the flagged count) rather than the mean alone, and connect this to the "keep the detail" insight of Section 5.