AI Security

The Reproducibility Crisis In AI Security Evals

ML research has a reproducibility crisis. AI security evaluation inherits it. Vendors publishing numbers that can't be reproduced are the norm — not the exception.

Academic ML has a well-documented reproducibility crisis. AI-for-security evaluation inherits the problem and adds its own. Vendors quote accuracy numbers that customers cannot reproduce. Dataset details are private. Methodology documents are marketing. The pattern is common enough to call a crisis in the narrow sense — the ability to independently verify claims is rare. Customers adapting to this reality have specific evaluation practices.

What reproducibility requires

Five components:

Dataset with provenance.
Methodology document.
Version pinning of model and tooling.
Scoring code that can be rerun.
Variance reporting across runs.

Missing any of these breaks reproducibility.

What vendor numbers usually provide

Three components out of five:

Dataset: usually private.
Methodology: often partial.
Version pinning: sometimes missing.
Scoring code: rarely shared.
Variance: frequently absent.

The gap is the reproducibility gap.

What to demand

Three asks for any AI-for-security vendor:

Dataset description with construction methodology.
Scoring code or pseudocode.
Variance numbers across repeated runs.

If the vendor can't provide all three, the numbers they quote are less reliable than they appear.

How Safeguard Helps

Safeguard publishes all five reproducibility components for its eval benchmarks. Customers can rerun key benchmarks in their own environment. For organisations whose procurement process requires defensible benchmark numbers, reproducibility is the baseline — not the bonus.

ai-security reproducibility evals crisis

Back to all articles

More on #ai-security

View all →

AI Security

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.

The Reproducibility Crisis In AI Security Evals

What reproducibility requires

What vendor numbers usually provide

What to demand

How Safeguard Helps

More on #ai-security

API Surface Reviewed: Griffin AI vs Mythos

Real-World Deployment: Griffin AI vs Mythos

Scaling Across Repos: Griffin AI vs Mythos

Tool-Call Hijacking: Griffin AI vs Mythos

Related articles in AI Security

Building an Eval Suite for Your Security LLM Workflows

Zero-Day Discovery With LLM-Augmented Reachability: A Safeguard Engine Walkthrough

Frontier LLM Vendors Are Not Your Supply Chain Security Vendor

Never miss an update

Product

Solutions

Compare

Resources

Company

Legal

Developers