AI Security

The Reproducibility Crisis In AI Security Evals

ML research has a reproducibility crisis. AI security evaluation inherits it. Vendors publishing numbers that can't be reproduced are the norm — not the exception.

Shadab Khan
Security Engineer
2 min read

Academic ML has a well-documented reproducibility crisis. AI-for-security evaluation inherits the problem and adds its own. Vendors quote accuracy numbers that customers cannot reproduce. Dataset details are private. Methodology documents are marketing. The pattern is common enough to call a crisis in the narrow sense — the ability to independently verify claims is rare. Customers adapting to this reality have specific evaluation practices.

What reproducibility requires

Five components:

  • Dataset with provenance.
  • Methodology document.
  • Version pinning of model and tooling.
  • Scoring code that can be rerun.
  • Variance reporting across runs.

Missing any of these breaks reproducibility.

What vendor numbers usually provide

Three components out of five:

  • Dataset: usually private.
  • Methodology: often partial.
  • Version pinning: sometimes missing.
  • Scoring code: rarely shared.
  • Variance: frequently absent.

The gap is the reproducibility gap.

What to demand

Three asks for any AI-for-security vendor:

  1. Dataset description with construction methodology.
  2. Scoring code or pseudocode.
  3. Variance numbers across repeated runs.

If the vendor can't provide all three, the numbers they quote are less reliable than they appear.

How Safeguard Helps

Safeguard publishes all five reproducibility components for its eval benchmarks. Customers can rerun key benchmarks in their own environment. For organisations whose procurement process requires defensible benchmark numbers, reproducibility is the baseline — not the bonus.

Related articles in AI Security

AI Security

Safeguard Now Supports Every Major AI Model Family for Zero-Day Discovery: Anthropic, OpenAI, Gemini, Microsoft, Meta, and Your Own Models

You should not have to choose between your organization's AI strategy and your security platform. Safeguard's agentic zero-day discovery and remediation pipeline now works on Anthropic Claude Fable 5, OpenAI GPT, Google Gemini, Microsoft Phi, Meta Llama, Safeguard native models, and privately hosted custom models — all running as first-class agents in the same Multi-Agent TAOR Deep Think AI Engine.

June 9, 2026Read
AI Security

Anthropic Claude Mythos Releases Tomorrow: Capabilities, Benchmarks, and What Security Teams Must Do Now

Anthropic's Claude Mythos model goes public on June 10, 2026 — a frontier AI that scored 97.6% on the Math Olympiad, completed expert-level hacking tasks at 73% success, and found 271 vulnerabilities in Firefox 150. Here is everything security teams need to know before it lands, and how Safeguard already supports Mythos zero-day discovery natively.

June 9, 2026Read
AI Security

Claude Fable 5: Anthropic's Most Capable Public Model Is Here — Benchmarks, Capabilities, and What It Means for Security

Anthropic just released Claude Fable 5, its most capable publicly available model and the first Mythos-class AI open to everyone. 80.3% on SWE-Bench Pro, 88% on Terminal-Bench 2.1, state-of-the-art across software engineering, vision, and scientific research. Safeguard has already integrated Fable 5 natively — here is everything you need to know.

June 9, 2026Read

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.