AI Security

Benchmark Contamination Concerns In Security Evals

When the test set is in the training set, the benchmark is broken. Security eval contamination is widespread and the mitigations are specific.

Nayan Dey
Senior Security Engineer
2 min read

When a model's training data includes the test cases used to benchmark it, the benchmark measures memorisation rather than capability. Contamination is widespread in general ML benchmarks; for AI-for-security benchmarks specifically, the problem is worse because public CVE databases, advisory corpora, and security research are heavily represented in training data. A model asked about a well-known CVE has likely seen the full advisory during training. The benchmark reports accuracy that doesn't generalise.

How contamination happens in security evals

Three common paths:

  • CVE database ingestion. NVD, GitHub Advisory Database, and vendor advisory pages are scraped into training corpora.
  • Writeup availability. Bug bounty writeups, CTF solutions, and security research blogs are public and well-represented.
  • Code repository inclusion. Public vulnerable-by-design applications are in training data.

Any benchmark constructed from public data has a contamination floor.

How to detect it

Three techniques:

  • Temporal cutoff. Use CVEs disclosed after the model's training cutoff. If the model can't have seen them, the result is uncontaminated.
  • Private datasets. Construct benchmarks from private code and findings not in public training data.
  • Paraphrase variance. Compare the model's accuracy on the original description vs a paraphrase. Large gaps signal memorisation.

Griffin AI uses all three approaches in its eval harness.

How to mitigate it

Four practices:

  • Keep some eval sets private. Don't publish the dataset, only the methodology and numbers.
  • Use temporal separation rigorously. Each model's eval uses only post-training-cutoff data.
  • Include synthetic vulnerabilities. Cases not derived from any public source.
  • Disclose contamination risk explicitly. Methodology documents should acknowledge it.

What customers should evaluate

Three questions:

  1. What is the contamination mitigation strategy for the eval benchmarks quoted?
  2. Are results stable across paraphrased variants?
  3. Is post-training-cutoff data included?

How Safeguard Helps

Safeguard's Griffin AI eval harness includes post-training-cutoff data, private datasets, and paraphrase variance testing. Published numbers are uncontaminated to the extent possible. For customers whose procurement evaluation includes benchmark reliability, the methodology discipline is the differentiator.

Related articles in AI Security

AI Security

Safeguard Now Supports Every Major AI Model Family for Zero-Day Discovery: Anthropic, OpenAI, Gemini, Microsoft, Meta, and Your Own Models

You should not have to choose between your organization's AI strategy and your security platform. Safeguard's agentic zero-day discovery and remediation pipeline now works on Anthropic Claude Fable 5, OpenAI GPT, Google Gemini, Microsoft Phi, Meta Llama, Safeguard native models, and privately hosted custom models — all running as first-class agents in the same Multi-Agent TAOR Deep Think AI Engine.

June 9, 2026Read
AI Security

Anthropic Claude Mythos Releases Tomorrow: Capabilities, Benchmarks, and What Security Teams Must Do Now

Anthropic's Claude Mythos model goes public on June 10, 2026 — a frontier AI that scored 97.6% on the Math Olympiad, completed expert-level hacking tasks at 73% success, and found 271 vulnerabilities in Firefox 150. Here is everything security teams need to know before it lands, and how Safeguard already supports Mythos zero-day discovery natively.

June 9, 2026Read
AI Security

Claude Fable 5: Anthropic's Most Capable Public Model Is Here — Benchmarks, Capabilities, and What It Means for Security

Anthropic just released Claude Fable 5, its most capable publicly available model and the first Mythos-class AI open to everyone. 80.3% on SWE-Bench Pro, 88% on Terminal-Bench 2.1, state-of-the-art across software engineering, vision, and scientific research. Safeguard has already integrated Fable 5 natively — here is everything you need to know.

June 9, 2026Read

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.