A vendor claim of "98% accuracy" with no published dataset, no methodology, and no way for a customer to rerun the benchmark is, technically speaking, marketing. A benchmark with a published dataset, documented methodology, and a reproducibility command an evaluator can run on their own infrastructure is evidence. The space between the two is where most AI-for-security vendor claims live, and the gap matters more in 2026 than it did when AI tooling was new and customers were willing to take vendor claims at face value. Griffin AI publishes reproducible benchmarks; Mythos-class general-purpose tools largely do not, and the procurement conversation is sharper as a result.
What reproducibility actually requires
Five components, all of which need to be present:
- A published dataset with provenance. Where the test cases came from, how they were curated, what the ground truth is.
- A documented methodology. What metric is being computed. How scoring works. What edge cases are handled how.
- A reproducibility command. A specific way for an evaluator to run the benchmark on their own infrastructure with their own access.
- Version pinning. What model version, engine version, and dataset version produced the published numbers.
- Confidence intervals or variance. A single number is a snapshot. A range is a real measurement.
Vendors who provide all five are running a real benchmark program. Vendors who provide some-but-not-all are doing benchmark theatre.
Why benchmark reproducibility matters now
Three reasons specific to 2026:
- Model drift is real. A benchmark from twelve months ago may not reflect the model that is running today. Reproducibility lets evaluators rerun on the current model.
- Customer environments differ from vendor benchmarks. A 98% accuracy on the vendor's curated dataset may be 60% on the customer's actual codebase. Reproducibility lets the customer verify on their data.
- Audit and regulatory pressure is increasing. EU AI Act enforcement and similar regimes increasingly require documented evaluation methodology. Vendors who cannot reproduce their own claims will struggle with this.
A vendor that publishes reproducible benchmarks is signalling that they have the testing infrastructure to defend their claims. A vendor that doesn't is asking for trust without evidence.
Where Griffin AI sits
Five published benchmark families:
- Exploit hypothesis accuracy. 400 reachable taint paths drawn from real CVEs with known exploit conditions. Methodology, scoring, and dataset provenance documented. Published 81% full-agreement, 94% partial-credit accuracy.
- Remediation PR correctness. 250 dependency vulnerability scenarios with human-verified fixes. Pass criterion: compile + existing tests pass. Published 73% pass-unchanged, 87% pass-with-minor-edits.
- Advisory summarisation. 500 real security advisories with human-written summaries. Scored on factual accuracy and embedding-space similarity. Published 96% factual accuracy, 0.89 similarity.
- Cross-finding correlation. 300 scenarios with known correlation ground truth. Published 88% precision, 82% recall.
- Adversarial resistance. 150 prompts including jailbreak attempts, leakage probes, scope-violation attempts. Published 100% canary-leak resistance, 98% jailbreak refusal.
Each family ships with the dataset (where licensable), the scoring script, the model version, and the engine version. Customers can rerun the benchmarks against their own environment.
What Mythos-class vendors typically publish
The pattern across general-purpose AI-for-security tools varies but rarely matches the discipline above. Common gaps:
- Marketing-grade numbers without methodology. "Our tool finds 95% of vulnerabilities" with no definition of what counts as a finding or which vulnerabilities are in the test set.
- Cherry-picked customer testimonials. N=1 or N=3 case studies that don't generalise.
- Demo-environment claims. Numbers from a controlled demo dataset that doesn't resemble production codebases.
- Unpinned model versions. Claims based on a specific model version that has since been deprecated.
- No variance reporting. A single number with no indication of run-to-run variability.
This is not an indictment of any specific vendor; it is a description of an industry-wide pattern. The vendors who break the pattern are the ones with mature evaluation programs.
How a customer can run their own benchmark
Three steps that any procurement evaluation should include:
Step 1 — Establish ground truth. Pick 50–100 findings from the customer's existing security backlog. Manually classify each as actionable or not. This is the customer's reference dataset.
Step 2 — Run candidate platforms. Send the same set through each vendor's platform. Collect the platform's output for each.
Step 3 — Score against ground truth. Compute precision (of findings the platform marked actionable, what percentage actually were?) and recall (of findings the customer classified as actionable, what percentage did the platform surface?). Compute cost per actionable finding.
This produces customer-specific numbers that any vendor's published claims have to defend against.
What reproducibility signals about the rest of the platform
Reproducibility is a leading indicator of engineering discipline. Vendors who run reproducible benchmark programs typically also have:
- Stable APIs (because they run regression tests on them).
- Documented degraded-mode behaviour (because they test failure modes).
- Honest support for edge cases (because they hit them in benchmarks before customers do).
The inverse is also true. Vendors who publish unreproducible numbers tend to also have unstable APIs, undocumented failure modes, and surprise edge cases. The discipline correlates.
What to evaluate
Three concrete checks during procurement:
- Ask for the most recent benchmark dataset, methodology, and reproducibility command. If any are missing, the published numbers are aspirational.
- Ask whether the published numbers will hold under a model upgrade in the next six months. The answer should reference the eval harness, not "we'll see."
- Run your own benchmark on a representative subset of your codebase and compare numbers.
The answers separate vendors with evidence from vendors with marketing.
How Safeguard Helps
Safeguard's published benchmarks are reproducible by design — dataset provenance, scoring script, model and engine versions, and confidence intervals are all part of the publication. Customers can rerun the benchmarks against their own environment as part of evaluation. The eval harness that gates internal Griffin AI releases is the same harness that produces the published numbers, which is why those numbers don't drift between releases. For procurement processes that need defensible evidence rather than marketing-grade claims, Safeguard's discipline is a differentiator that pays back as long as the contract runs.