AI Security

Training Data Provenance for Enterprise Fine-Tuning

Fine-tuning corpora are supply chain artifacts. We cover the provenance signals, attestations, and drift controls enterprises need before pushing weights to prod.

Nayan Dey
Senior Security Engineer
5 min read

A regional bank we audited in October 2025 fine-tuned a Llama 3.1 70B checkpoint on 14 million internal tickets to power an internal assistant. Nobody could answer a basic question six months later: which tickets were in the training set? The data engineering team had pulled from a Snowflake view that had been redefined twice; the view definition wasn't versioned; the underlying tables had GDPR-driven deletions applied after training. The model had almost certainly memorized content that should no longer exist. There was no way to prove it hadn't without re-running the extraction, which was impossible because the source state was gone. This is the default posture for enterprise fine-tuning today. The corpus is treated as a one-shot input, not a versioned artifact with a provenance chain. When a regulator asks "what's in the model," the honest answer is "we don't know."

Why is fine-tuning data a supply chain problem?

Because a fine-tuned model is a derived artifact whose behavior is a function of inputs that are almost never signed, never hashed, and rarely content-addressed. Pre-training gets the attention — the Pile, RefinedWeb, FineWeb, C4 — and those corpora at least have published checksums and data cards. Enterprise fine-tuning corpora are ad-hoc SQL dumps, SharePoint exports, and CSVs passed between teams. The SLSA v1.1 provenance model, the in-toto attestation spec, and the emerging ML-BOM schema from CycloneDX 1.6 all have the primitives to describe training inputs. Few teams use them. The result is that when a prompt injection later extracts a verbatim customer record, nobody can tell whether the record was in the training set, the RAG index, or the prompt.

What provenance fields actually matter for fine-tuning?

The minimum viable set is: source system identifier, query or extraction definition, extraction timestamp, row-count, content hash of the serialized corpus, transformations applied, and the identity that performed the extraction. CycloneDX 1.6 adds formulation blocks that carry this cleanly; MLflow 2.17 has dataset lineage that maps to the same fields. The important property is that the content hash covers the final, post-transform corpus — the exact bytes fed to the trainer. Hashing the source query is insufficient because two runs of the same query against a mutable source produce different corpora. We've seen teams rely on S3 object versioning, which is better than nothing but doesn't cover in-memory transformations in a Ray or Spark job that never persists intermediate state.

How do you handle deletions after the model ships?

You can't delete from weights, but you can maintain an obligation ledger. When a record is deleted from the source system, an obligation is written: "record X was present in training corpus Y with hash Z, and has been deleted from source on date D." At audit time, you can answer "what's in the model that shouldn't be?" by joining the obligation ledger against the corpus manifest. This doesn't unlearn the data, but it is what regulators under the EU AI Act and the GDPR-AI guidance published in September 2025 are starting to ask for. Unlearning research from Microsoft and Google through 2025 has produced techniques like SISA and approximate unlearning, but the machine-verifiable audit trail matters more for compliance than the unlearning itself.

What about synthetic data and LLM-generated augmentation?

Synthetic data inherits the provenance of the generator. If you use GPT-5 or Claude 4 to augment a classification dataset, your corpus provenance now includes a model identifier, a model version, the system prompt, the sampling parameters, and ideally the inference timestamp. OpenAI's November 2025 data policy change means outputs generated via the API are not used for training, but it does not prevent the output from carrying their copyrighted training content through memorization. A customer of ours found that roughly 0.7% of GPT-4o-generated product descriptions contained verbatim strings from competitor sites — measurable only because they had hashed the synthetic corpus and compared it against a web snapshot. Without the provenance, the leak would have shipped into their fine-tuned model.

How does reachability apply to training data?

Reachability here means: which rows in the corpus materially affected which model behaviors? Exact answers require expensive influence-function work (TRAK from MIT's CSAIL, 2024, remains the best open technique), but coarse answers are cheap. Tag every training row with a topic cluster at ingest; measure model behavior change on held-out probes before and after fine-tuning; correlate behavior shifts with cluster presence. When the model starts confidently answering questions about a product line you didn't train on, the provenance index tells you which cluster seeped in. We've used this pattern to catch a 3,100-row subset of a support corpus that carried internal pricing data the model then surfaced verbatim under specific prompts.

Who signs the corpus and the resulting weights?

The same entities that sign your container images. Sigstore's cosign, extended with the model-signing attestation format published by the OpenSSF AI/ML Security SIG in June 2025, lets you produce an in-toto statement binding the corpus hash, the training script commit, the trainer identity, and the output weight hash. We've deployed this on Kubeflow and Vertex AI pipelines; it adds perhaps 45 seconds to a training job and gives you a cryptographic chain from SQL query to safetensors file. When an incident response later asks "did this checkpoint come from an authorized pipeline," the attestation answers in seconds instead of days.

How Safeguard Helps

Safeguard treats fine-tuning corpora as first-class artifacts in the AI-BOM, capturing source extractions, transformation lineage, and post-transform content hashes alongside the resulting weight artifacts. Griffin AI correlates corpus clusters with model-behavior drift across evaluation runs, so regression in one topic surfaces against the data subset that likely caused it. Policy gates block training pipelines that produce weights without a valid provenance attestation, and the reachability view shows which downstream products consume which checkpoints — so when a deletion obligation lands, you know which systems need to be re-evaluated.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.