AI Security

Safeguard Griffin AI: Eval Benchmarks Published

Griffin AI's evaluation harness results published for the first time. Benchmark methodology, comparison against baselines, and what the numbers mean for production use.

Shadab Khan
Security Engineer
6 min read

Safeguard's Griffin AI reasoning layer has been running in production across customer environments for over a year, and in April 2026 we are publishing the first public evaluation benchmark results for it. The publication is intentional. The AI security space has too many capability claims unsupported by measurable benchmarks, and we think vendors asking customers to trust AI-augmented security analysis should be willing to be measured on the same basis as anyone else. The numbers below are from the same eval harness that gates every Griffin AI release internally, run against publicly replicable test sets where possible and against curated internal sets where necessary. This post walks through the methodology, the results, and — importantly — what the numbers do and do not mean for production use.

What does Griffin AI actually do?

Griffin AI is the reasoning layer inside the Safeguard platform that handles tasks the deterministic engine cannot: exploit condition hypothesis on reachability paths, remediation PR drafting, advisory summarization, policy explanation, and cross-finding correlation. It is not a standalone security model — it operates on the structured outputs the engine produces (call graphs, taint paths, SBOM data) rather than on raw code. That architectural choice is what makes the eval methodology feasible; we can construct inputs in a structured way and grade outputs against known correct answers.

What does the eval harness look like?

The harness runs five eval families, each with its own dataset:

  1. Exploit hypothesis accuracy — 400 reachability paths drawn from real CVEs, with known exploit conditions. Griffin is asked to predict exploit conditions; output is graded against ground truth.
  2. Remediation PR correctness — 250 dependency vulnerability scenarios, with human-verified correct fixes. Griffin is asked to draft a fix PR; output is graded for compilation, test pass, and fix effectiveness.
  3. Advisory summarization — 500 real security advisories with human-written summaries. Griffin summarizes; output is graded for factual accuracy and semantic similarity to ground truth.
  4. Cross-finding correlation — 300 scenarios with multiple findings where the correct correlation is known. Griffin is asked to identify related findings; output is graded against ground truth.
  5. Adversarial resistance — 150 prompts including jailbreak attempts, leakage probes, and scope-violation attempts. Output is graded on refusal behavior and non-leakage.

Each eval family runs on every Griffin AI release. Regressions of more than one standard deviation block the release.

What are the published numbers?

On the April 2026 release:

  • Exploit hypothesis accuracy: 81% full agreement with ground truth, 94% with partial credit for correct CWE classification even when exploit condition differs.
  • Remediation PR correctness: 73% of generated PRs compile and pass existing tests unchanged; an additional 14% compile and pass with minor edits.
  • Advisory summarization: 0.89 semantic similarity (embedding-space) to human-written summaries; 96% factual accuracy on statement-level grading.
  • Cross-finding correlation: 88% precision, 82% recall on the 300-scenario set.
  • Adversarial resistance: 100% on the canary-leakage subset, 98% on jailbreak refusal, 100% on scope-violation refusal.

Numbers will continue to evolve release over release. These are the April 2026 release figures and are posted as the baseline against which future changes will be measured.

How do these compare to baselines?

Two baselines worth comparing:

Pure pattern-based tooling (no LLM) on the exploit hypothesis task: roughly 35–45% accuracy, because pattern scanners do not attempt exploit condition prediction in the first place and get credit only when a CVE pattern already exists. This is the "lower bound" comparison.

Frontier LLM direct application (GPT-4, Claude, etc. asked to perform the same task without engine-produced structured context): roughly 50–60% on the exploit hypothesis task, with higher variance. The numbers drop further on remediation PR correctness because the model lacks dependency graph context.

Griffin AI's advantage is not that the underlying model is better than a frontier model — it uses a frontier model — but that the engine-produced structured context makes the reasoning task tractable.

What do the numbers not tell you?

Three honest limitations:

Eval sets are not the full real-world distribution. Benchmark inputs are curated, human-verified, and bounded. Production inputs include edge cases the eval set does not cover. Expect production behavior to show more long-tail variance than the benchmark numbers suggest.

Grading on complex outputs is imperfect. Remediation PR correctness graded as "passes tests" does not fully capture whether the fix is elegant, minimal, or correct in ways the tests do not cover. LLM-as-judge grading adds its own noise.

These numbers are a snapshot. The underlying models drift, the engine outputs change, the eval set grows. Historical numbers are not forward-looking guarantees.

We include these caveats because the alternative — presenting numbers as unqualified "Griffin is 95% accurate" — is the kind of claim that erodes trust when the production experience diverges.

How often are these benchmarks updated?

Internally: on every release (weekly or bi-weekly cadence). Publicly: quarterly. The published numbers will lag internal numbers by up to a quarter, which is the right cadence for external observers trying to track the capability over time without getting misled by single-release noise.

What are the benchmark sets based on?

A mix of public and curated-internal data:

  • Public sources: CVE/CWE data, public advisory corpora, jailbreak research datasets.
  • Synthetic sources: engineered test scenarios modeled on real vulnerability classes.
  • Customer-permissioned scenarios: anonymized, permissioned from customers who opted in specifically for benchmark use.

We do not use customer data in the eval set without explicit opt-in. The customer-sourced portion is a minority of the total.

How should a prospective customer use this data?

Three ways:

  1. As a capability baseline — what to expect from Griffin AI on typical tasks, set against the caveats above.
  2. As a comparison anchor — Griffin's numbers vs the frontier-LLM-direct baseline show the value added by the engine-plus-model architecture specifically.
  3. As a trust signal — vendors willing to publish measurable benchmarks and caveat them honestly are a better signal than vendors who won't.

For production evaluation, the right complement to benchmarks is a pilot on your own environment with your own acceptance criteria. Benchmarks anchor expectations; pilots test fit.

How Safeguard Helps

Griffin AI is the reasoning layer inside Safeguard, operating on the structured outputs (SBOM, reachability, call graph, policy evaluations) the deterministic engine produces. The benchmark discipline described in this post governs every Griffin AI release — regressions block release, improvements are measured, caveats are explicit. For organizations evaluating AI-augmented supply chain security, Safeguard's Griffin AI is measurable by construction rather than by marketing claim, and the benchmarks are the published anchor you can come back to.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.