When a model's training data includes the test cases used to benchmark it, the benchmark measures memorisation rather than capability. Contamination is widespread in general ML benchmarks; for AI-for-security benchmarks specifically, the problem is worse because public CVE databases, advisory corpora, and security research are heavily represented in training data. A model asked about a well-known CVE has likely seen the full advisory during training. The benchmark reports accuracy that doesn't generalise.
How contamination happens in security evals
Three common paths:
- CVE database ingestion. NVD, GitHub Advisory Database, and vendor advisory pages are scraped into training corpora.
- Writeup availability. Bug bounty writeups, CTF solutions, and security research blogs are public and well-represented.
- Code repository inclusion. Public vulnerable-by-design applications are in training data.
Any benchmark constructed from public data has a contamination floor.
How to detect it
Three techniques:
- Temporal cutoff. Use CVEs disclosed after the model's training cutoff. If the model can't have seen them, the result is uncontaminated.
- Private datasets. Construct benchmarks from private code and findings not in public training data.
- Paraphrase variance. Compare the model's accuracy on the original description vs a paraphrase. Large gaps signal memorisation.
Griffin AI uses all three approaches in its eval harness.
How to mitigate it
Four practices:
- Keep some eval sets private. Don't publish the dataset, only the methodology and numbers.
- Use temporal separation rigorously. Each model's eval uses only post-training-cutoff data.
- Include synthetic vulnerabilities. Cases not derived from any public source.
- Disclose contamination risk explicitly. Methodology documents should acknowledge it.
What customers should evaluate
Three questions:
- What is the contamination mitigation strategy for the eval benchmarks quoted?
- Are results stable across paraphrased variants?
- Is post-training-cutoff data included?
How Safeguard Helps
Safeguard's Griffin AI eval harness includes post-training-cutoff data, private datasets, and paraphrase variance testing. Published numbers are uncontaminated to the extent possible. For customers whose procurement evaluation includes benchmark reliability, the methodology discipline is the differentiator.