AI Security

Safeguard Griffin AI: Eval Benchmarks Published

Griffin AI's evaluation harness results published for the first time. Benchmark methodology, comparison against baselines, and what the numbers mean for production use.

Shadab Khan
Security Engineer
6 min read

Safeguard's Griffin AI reasoning layer has been running in production across customer environments for over a year, and in April 2026 we are publishing the first public evaluation benchmark results for it. The publication is intentional. The AI security space has too many capability claims unsupported by measurable benchmarks, and we think vendors asking customers to trust AI-augmented security analysis should be willing to be measured on the same basis as anyone else. The numbers below are from the same eval harness that gates every Griffin AI release internally, run against publicly replicable test sets where possible and against curated internal sets where necessary. This post walks through the methodology, the results, and — importantly — what the numbers do and do not mean for production use.

What does Griffin AI actually do?

Griffin AI is the reasoning layer inside the Safeguard platform that handles tasks the deterministic engine cannot: exploit condition hypothesis on reachability paths, remediation PR drafting, advisory summarization, policy explanation, and cross-finding correlation. It is not a standalone security model — it operates on the structured outputs the engine produces (call graphs, taint paths, SBOM data) rather than on raw code. That architectural choice is what makes the eval methodology feasible; we can construct inputs in a structured way and grade outputs against known correct answers.

What does the eval harness look like?

The harness runs five eval families, each with its own dataset:

  1. Exploit hypothesis accuracy — 400 reachability paths drawn from real CVEs, with known exploit conditions. Griffin is asked to predict exploit conditions; output is graded against ground truth.
  2. Remediation PR correctness — 250 dependency vulnerability scenarios, with human-verified correct fixes. Griffin is asked to draft a fix PR; output is graded for compilation, test pass, and fix effectiveness.
  3. Advisory summarization — 500 real security advisories with human-written summaries. Griffin summarizes; output is graded for factual accuracy and semantic similarity to ground truth.
  4. Cross-finding correlation — 300 scenarios with multiple findings where the correct correlation is known. Griffin is asked to identify related findings; output is graded against ground truth.
  5. Adversarial resistance — 150 prompts including jailbreak attempts, leakage probes, and scope-violation attempts. Output is graded on refusal behavior and non-leakage.

Each eval family runs on every Griffin AI release. Regressions of more than one standard deviation block the release.

What are the published numbers?

On the April 2026 release:

  • Exploit hypothesis accuracy: 81% full agreement with ground truth, 94% with partial credit for correct CWE classification even when exploit condition differs.
  • Remediation PR correctness: 73% of generated PRs compile and pass existing tests unchanged; an additional 14% compile and pass with minor edits.
  • Advisory summarization: 0.89 semantic similarity (embedding-space) to human-written summaries; 96% factual accuracy on statement-level grading.
  • Cross-finding correlation: 88% precision, 82% recall on the 300-scenario set.
  • Adversarial resistance: 100% on the canary-leakage subset, 98% on jailbreak refusal, 100% on scope-violation refusal.

Numbers will continue to evolve release over release. These are the April 2026 release figures and are posted as the baseline against which future changes will be measured.

How do these compare to baselines?

Two baselines worth comparing:

Pure pattern-based tooling (no LLM) on the exploit hypothesis task: roughly 35–45% accuracy, because pattern scanners do not attempt exploit condition prediction in the first place and get credit only when a CVE pattern already exists. This is the "lower bound" comparison.

Frontier LLM direct application (GPT-4, Claude, etc. asked to perform the same task without engine-produced structured context): roughly 50–60% on the exploit hypothesis task, with higher variance. The numbers drop further on remediation PR correctness because the model lacks dependency graph context.

Griffin AI's advantage is not that the underlying model is better than a frontier model — it uses a frontier model — but that the engine-produced structured context makes the reasoning task tractable.

What do the numbers not tell you?

Three honest limitations:

Eval sets are not the full real-world distribution. Benchmark inputs are curated, human-verified, and bounded. Production inputs include edge cases the eval set does not cover. Expect production behavior to show more long-tail variance than the benchmark numbers suggest.

Grading on complex outputs is imperfect. Remediation PR correctness graded as "passes tests" does not fully capture whether the fix is elegant, minimal, or correct in ways the tests do not cover. LLM-as-judge grading adds its own noise.

These numbers are a snapshot. The underlying models drift, the engine outputs change, the eval set grows. Historical numbers are not forward-looking guarantees.

We include these caveats because the alternative — presenting numbers as unqualified "Griffin is 95% accurate" — is the kind of claim that erodes trust when the production experience diverges.

How often are these benchmarks updated?

Internally: on every release (weekly or bi-weekly cadence). Publicly: quarterly. The published numbers will lag internal numbers by up to a quarter, which is the right cadence for external observers trying to track the capability over time without getting misled by single-release noise.

What are the benchmark sets based on?

A mix of public and curated-internal data:

  • Public sources: CVE/CWE data, public advisory corpora, jailbreak research datasets.
  • Synthetic sources: engineered test scenarios modeled on real vulnerability classes.
  • Customer-permissioned scenarios: anonymized, permissioned from customers who opted in specifically for benchmark use.

We do not use customer data in the eval set without explicit opt-in. The customer-sourced portion is a minority of the total.

How should a prospective customer use this data?

Three ways:

  1. As a capability baseline — what to expect from Griffin AI on typical tasks, set against the caveats above.
  2. As a comparison anchor — Griffin's numbers vs the frontier-LLM-direct baseline show the value added by the engine-plus-model architecture specifically.
  3. As a trust signal — vendors willing to publish measurable benchmarks and caveat them honestly are a better signal than vendors who won't.

For production evaluation, the right complement to benchmarks is a pilot on your own environment with your own acceptance criteria. Benchmarks anchor expectations; pilots test fit.

How Safeguard Helps

Griffin AI is the reasoning layer inside Safeguard, operating on the structured outputs (SBOM, reachability, call graph, policy evaluations) the deterministic engine produces. The benchmark discipline described in this post governs every Griffin AI release — regressions block release, improvements are measured, caveats are explicit. For organizations evaluating AI-augmented supply chain security, Safeguard's Griffin AI is measurable by construction rather than by marketing claim, and the benchmarks are the published anchor you can come back to.

Related articles in AI Security

AI Security

Safeguard Now Supports Every Major AI Model Family for Zero-Day Discovery: Anthropic, OpenAI, Gemini, Microsoft, Meta, and Your Own Models

You should not have to choose between your organization's AI strategy and your security platform. Safeguard's agentic zero-day discovery and remediation pipeline now works on Anthropic Claude Fable 5, OpenAI GPT, Google Gemini, Microsoft Phi, Meta Llama, Safeguard native models, and privately hosted custom models — all running as first-class agents in the same Multi-Agent TAOR Deep Think AI Engine.

June 9, 2026Read
AI Security

Anthropic Claude Mythos Releases Tomorrow: Capabilities, Benchmarks, and What Security Teams Must Do Now

Anthropic's Claude Mythos model goes public on June 10, 2026 — a frontier AI that scored 97.6% on the Math Olympiad, completed expert-level hacking tasks at 73% success, and found 271 vulnerabilities in Firefox 150. Here is everything security teams need to know before it lands, and how Safeguard already supports Mythos zero-day discovery natively.

June 9, 2026Read
AI Security

Claude Fable 5: Anthropic's Most Capable Public Model Is Here — Benchmarks, Capabilities, and What It Means for Security

Anthropic just released Claude Fable 5, its most capable publicly available model and the first Mythos-class AI open to everyone. 80.3% on SWE-Bench Pro, 88% on Terminal-Bench 2.1, state-of-the-art across software engineering, vision, and scientific research. Safeguard has already integrated Fable 5 natively — here is everything you need to know.

June 9, 2026Read

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.