AI Security

Benchmark Contamination Concerns In Security Evals

When the test set is in the training set, the benchmark is broken. Security eval contamination is widespread and the mitigations are specific.

When a model's training data includes the test cases used to benchmark it, the benchmark measures memorisation rather than capability. Contamination is widespread in general ML benchmarks; for AI-for-security benchmarks specifically, the problem is worse because public CVE databases, advisory corpora, and security research are heavily represented in training data. A model asked about a well-known CVE has likely seen the full advisory during training. The benchmark reports accuracy that doesn't generalise.

How contamination happens in security evals

Three common paths:

CVE database ingestion. NVD, GitHub Advisory Database, and vendor advisory pages are scraped into training corpora.
Writeup availability. Bug bounty writeups, CTF solutions, and security research blogs are public and well-represented.
Code repository inclusion. Public vulnerable-by-design applications are in training data.

Any benchmark constructed from public data has a contamination floor.

How to detect it

Three techniques:

Temporal cutoff. Use CVEs disclosed after the model's training cutoff. If the model can't have seen them, the result is uncontaminated.
Private datasets. Construct benchmarks from private code and findings not in public training data.
Paraphrase variance. Compare the model's accuracy on the original description vs a paraphrase. Large gaps signal memorisation.

Griffin AI uses all three approaches in its eval harness.

How to mitigate it

Four practices:

Keep some eval sets private. Don't publish the dataset, only the methodology and numbers.
Use temporal separation rigorously. Each model's eval uses only post-training-cutoff data.
Include synthetic vulnerabilities. Cases not derived from any public source.
Disclose contamination risk explicitly. Methodology documents should acknowledge it.

What customers should evaluate

Three questions:

What is the contamination mitigation strategy for the eval benchmarks quoted?
Are results stable across paraphrased variants?
Is post-training-cutoff data included?

How Safeguard Helps

Safeguard's Griffin AI eval harness includes post-training-cutoff data, private datasets, and paraphrase variance testing. Published numbers are uncontaminated to the extent possible. For customers whose procurement evaluation includes benchmark reliability, the methodology discipline is the differentiator.

ai-security benchmarks contamination evals

Back to all articles

More on #ai-security

View all →

AI Security

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.

Benchmark Contamination Concerns In Security Evals

How contamination happens in security evals

How to detect it

How to mitigate it

What customers should evaluate

How Safeguard Helps

More on #ai-security

API Surface Reviewed: Griffin AI vs Mythos

Real-World Deployment: Griffin AI vs Mythos

Scaling Across Repos: Griffin AI vs Mythos

Tool-Call Hijacking: Griffin AI vs Mythos

Related articles in AI Security

Building an Eval Suite for Your Security LLM Workflows

Zero-Day Discovery With LLM-Augmented Reachability: A Safeguard Engine Walkthrough

Frontier LLM Vendors Are Not Your Supply Chain Security Vendor

Never miss an update

Product

Solutions

Compare

Resources

Company

Legal

Developers