AI Security

Benchmark Contamination Concerns In Security Evals

When the test set is in the training set, the benchmark is broken. Security eval contamination is widespread and the mitigations are specific.

Nayan Dey
Senior Security Engineer
2 min read

When a model's training data includes the test cases used to benchmark it, the benchmark measures memorisation rather than capability. Contamination is widespread in general ML benchmarks; for AI-for-security benchmarks specifically, the problem is worse because public CVE databases, advisory corpora, and security research are heavily represented in training data. A model asked about a well-known CVE has likely seen the full advisory during training. The benchmark reports accuracy that doesn't generalise.

How contamination happens in security evals

Three common paths:

  • CVE database ingestion. NVD, GitHub Advisory Database, and vendor advisory pages are scraped into training corpora.
  • Writeup availability. Bug bounty writeups, CTF solutions, and security research blogs are public and well-represented.
  • Code repository inclusion. Public vulnerable-by-design applications are in training data.

Any benchmark constructed from public data has a contamination floor.

How to detect it

Three techniques:

  • Temporal cutoff. Use CVEs disclosed after the model's training cutoff. If the model can't have seen them, the result is uncontaminated.
  • Private datasets. Construct benchmarks from private code and findings not in public training data.
  • Paraphrase variance. Compare the model's accuracy on the original description vs a paraphrase. Large gaps signal memorisation.

Griffin AI uses all three approaches in its eval harness.

How to mitigate it

Four practices:

  • Keep some eval sets private. Don't publish the dataset, only the methodology and numbers.
  • Use temporal separation rigorously. Each model's eval uses only post-training-cutoff data.
  • Include synthetic vulnerabilities. Cases not derived from any public source.
  • Disclose contamination risk explicitly. Methodology documents should acknowledge it.

What customers should evaluate

Three questions:

  1. What is the contamination mitigation strategy for the eval benchmarks quoted?
  2. Are results stable across paraphrased variants?
  3. Is post-training-cutoff data included?

How Safeguard Helps

Safeguard's Griffin AI eval harness includes post-training-cutoff data, private datasets, and paraphrase variance testing. Published numbers are uncontaminated to the extent possible. For customers whose procurement evaluation includes benchmark reliability, the methodology discipline is the differentiator.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.