Almost every AI safety claim in 2026 rests on a benchmark result. Your model scores X on HarmBench, Y on AdvBench, Z on the latest BIG-bench safety subset, or some number on a proprietary internal eval. The claim lands in a model card, a regulator's filing, a procurement questionnaire, or a board deck, and moves on. What the claim almost never carries is the provenance of the dataset that produced the number.
I spent a chunk of Q4 2025 auditing how the industry manages safety eval data, and I came away convinced that the eval corpora are a supply chain surface in the exact sense we usually reserve for code dependencies. They are pulled from remote sources, they are cached, they are vendored, they are trusted, and they are almost never attested. Every vulnerability pattern we know from code supply chain, including typosquatting, version pinning drift, maintainer compromise, and silent contamination, has a clean analogue in eval data. This post is the map.
Why the eval corpus is a dependency
A safety eval dataset is, functionally, a dependency of your release pipeline. Your model card says "we tested against X" and X is code plus data. The code lives in a repo. The data lives wherever the original authors hosted it, which is usually Hugging Face Datasets, GitHub, or a university server.
Treating this as a supply chain question means asking the same things you ask of a pip package. Who publishes it. What is their signing key. What version are you pinned to. How do you detect when the published version changes. Do you verify the hash. Have you reviewed the code and data at the version you are using. Almost nobody I audited in late 2025 could answer those for their primary safety benchmarks. The data is loaded by name, not by hash, from the default datasets hub, at whatever revision was live at the time of the run.
That is the same posture pip users had in 2015, and we all know how that went.
Dataset poisoning as a deliberate attack
A motivated attacker who wants a specific behaviour to pass safety eval has a direct path. They contribute or influence the eval dataset. HarmBench, TruthfulQA, BIG-bench, RealToxicityPrompts, the Anthropic HH corpus, and dozens of specialised corpora all accept community contributions through PRs and dataset-hub uploads. A contribution that looks like a legitimate edge case but is carefully chosen to be easy for the attacker's own model to pass is effectively a backdoored benchmark.
The impact compounds because downstream users rarely read the diff on a dataset revision. They pull the latest, run the eval, compare to the published baseline, and publish their score. An attacker who gets a single adversarial sample into a high-profile benchmark moves the baseline for the entire industry.
The most interesting 2025 precedent was the MMLU contamination discussion that ran through the research community from late 2024 into mid-2025. MMLU is not a safety eval, but the contamination pattern was identical. Test questions leaked into training corpora, and models that saw the test questions during pretraining were scoring higher than models that did not. The baseline itself became meaningless without contamination analysis, and the whole field spent six months trying to reconstruct "clean" MMLU variants.
Safety evals are more vulnerable to this because they are more specialised and have fewer contributors watching for poisoning. A subtle change in a jailbreak eval is much harder to spot than a subtle change in a math eval.
Contamination is the boring cousin that does more damage
Deliberate poisoning is the scary version. Contamination is the boring version, and it does more damage in aggregate because it is ambient. Every major safety benchmark has leaked into pretraining corpora by now. The Common Crawl snapshots that feed most pretraining pipelines include GitHub, arXiv, and Hugging Face, all of which host the eval questions and reference answers.
The effect is that a 2026 model pretrained on recent web data has almost certainly seen most of the questions on most of the major safety benchmarks. The benchmark is no longer measuring generalisation. It is measuring something closer to recall with a thin layer of reformulation on top.
The research community knows this. Papers on MMLU, HumanEval, and MATH contamination have been landing steadily since 2023. The safety-eval side has been slower to confront it, but the contamination is the same. If your 2026 release candidate scores 96 on a safety benchmark that was published in 2022 and has been in every web corpus since, the 96 does not mean what you think it means.
The quiet version is silent dataset drift
The third failure mode is version drift. A benchmark is published, you run it in March, you get a score. You rerun it in November and get a different score. You debug for a week, chasing evaluator bugs, prompt-format changes, decoding-parameter changes, until you realise the dataset itself changed. The maintainers accepted three PRs between March and November that added 40 samples, removed 12, and reworded 80. Your score difference is entirely explained by dataset drift.
This happened to me twice in 2025 on production releases. Both times the drift was benign, but both times it cost a week of investigation and delayed the release. If the drift had been malicious, we would have caught it only because we pinned hashes going forward, not because of anything in the dataset-hub UX.
What to do before your 2026 release cycle
Pin dataset revisions by hash, not by name or tag. Hugging Face Datasets supports revision pinning via commit SHA. Pip has taught us this lesson, and there is no reason to relearn it on eval data.
Maintain an internal mirror of the eval corpora you rely on. If your compliance story depends on a benchmark, you cannot let the only copy live on a third-party server that can rewrite history. Mirror it, sign it, and store the signature next to the model card.
Run contamination analysis as part of your eval pipeline. Anthropic's approach of checking eval samples against a pretraining-corpus embedding index is the right shape. If you find high-similarity matches between your eval and your training data, treat the score from that eval as a floor, not a ceiling.
Review the contributor list and recent changes for every safety benchmark you rely on. If an unknown contributor added 20 samples to a jailbreak eval last month, that is a supply chain event worth reviewing, not a "community contribution" worth trusting by default.
Publish an eval SBOM with your model card. The model card should carry not just "we tested on HarmBench" but "we tested on HarmBench at commit abc123 with the following contamination analysis and the following sample-level hash manifest." This is the standard I expect regulators to start asking for through 2026 as the EU AI Act conformity assessments mature.
The broader point
The AI safety field has been treating eval datasets as a scientific artifact and not as a software supply chain dependency. The two framings have different security postures, and the scientific framing is losing under adversarial pressure. The 2026 model release cycle is going to include at least one high-profile incident where a safety benchmark score turns out to have been produced against a subtly manipulated dataset, and the industry will spend six months reconstructing trusted versions of the major corpora.
If you do not want to be the case study, treat eval data the way you treat pip dependencies. Pin, verify, mirror, sign, and audit. The tooling is already there. The mindset is the missing piece.
How Safeguard Helps
Safeguard extends AI-BOM to cover eval datasets, not just models and weights. We inventory every safety benchmark, evaluator, and reference corpus in your release pipeline, capture their commit-level provenance, and verify hashes at eval time against signed manifests. Griffin AI runs contamination analysis between your eval samples and your training indexes to flag where scores may be inflated by memorisation, and our policy gates refuse release when eval data has drifted from the pinned revision without review. When regulators or auditors ask how you know your safety claim is defensible, the answer is a signed SBOM with sample-level attestation, not a benchmark name.