AI Security

Remediation PR Quality: Griffin AI vs Mythos

Griffin AI produces draft PRs with taint paths, exploit hypotheses, and disproof attempts. Mythos-class pure-LLM tools skip those anchors, and PR quality suffers.

Shadab Khan
Security Engineer
7 min read

Every auto-remediation pipeline eventually gets judged on one thing: does the pull request it opens actually land? Reviewers do not care how clever the model was, how large the context window, or how many chain-of-thought tokens were spent. They care whether the diff compiles, whether the fix matches the vulnerability, and whether the explanation gives them enough confidence to click merge.

At Safeguard we have been measuring remediation PR quality across Griffin AI and a representative set of Mythos-class pure-LLM remediation tools. The structural difference in how each approach produces a PR drives most of the quality gap we see in the field.

What a Griffin AI remediation PR contains

When Griffin AI opens a remediation PR, the draft is assembled from four grounded artifacts rather than a single free-form generation step.

The first is the taint path. Griffin walks from the vulnerable sink back to every reachable source, annotating the intermediate frames with the call graph evidence that justifies reachability. That path is attached to the PR description, so a reviewer can scan from attacker-controlled input to the affected function without rebuilding the flow in their head.

The second is the exploit hypothesis. Griffin states, in plain terms, the class of exploit the advisory suggests and the specific precondition that would need to be true in this codebase for the flaw to be triggered. This is a testable claim, not a marketing paragraph.

The third is the disproof attempt. Griffin actively tries to refute the exploit hypothesis using the project's own guards, sanitizers, and runtime invariants. If the attempt succeeds, the finding is downgraded and the PR is not opened at all. If it fails, the failure is recorded next to the hypothesis so reviewers see exactly which defensive assumption did not hold.

The fourth is the human merge gate. Griffin never self-merges. The PR lands in the normal code review flow with CODEOWNERS, required checks, and branch protection rules intact.

How Mythos-class tools structure their PRs

Pure-LLM remediation tools in the Mythos class typically take a different path. The model is handed the advisory text, a window of source code, and a prompt asking it to produce a patch. The output is a diff and a short rationale.

That rationale is usually fluent. It references CVE identifiers, names the vulnerable function, and proposes what looks like a reasonable change. What it does not contain is a traceable connection between the specific advisory and the specific code. There is no taint path, because tracing taint requires a program analysis layer the model does not have. There is no disproof attempt, because the model has no mechanism to falsify its own hypothesis against the running code. And there is rarely an explicit merge gate because the tool is marketed on its automation, so human review is framed as friction rather than as a feature.

Published numbers versus marketing numbers

Griffin AI's published benchmarks show 73 percent of auto-PRs compile clean on first push and 87 percent pass CI with minor edits from a reviewer. Those numbers come from instrumented evaluation against real repositories with real test suites, not synthetic benchmarks.

Mythos-class tools typically do not publish comparable figures. The marketing usually focuses on volume, such as number of fixes suggested per scan, rather than on the harder question of how many of those fixes actually land. When independent teams have measured compile rates on pure-LLM patches against non-trivial codebases, the results cluster well below Griffin's bands. The structural reason is straightforward: a patch written without grounded context about the project's types, imports, and call graph is generating plausible code, not code that fits this repository.

Why grounded context raises compile rates

A compile-clean patch is a patch that respects the project's real shape. It uses the types that are in scope, imports that already exist or that the build system can resolve, and language features compatible with the target runtime.

Griffin's PR generation pulls the real call site, the real type signatures, and the real import graph into the context. When the generation step proposes a change, the change is constrained by the same facts a compiler would later check. Anything that violates those facts can be caught before the PR is even opened, because Griffin runs a local compile and targeted test pass on the candidate diff as part of the draft step.

A pure-LLM tool without that grounding has to infer types and imports from whatever source text it happened to see. Inference works often enough to look impressive in demos. It fails often enough on real repositories to produce the compile-fail rates that teams eventually notice.

The reviewer experience

PR quality is not only about the diff. It is also about what a reviewer feels when the PR lands in their queue.

A Griffin PR arrives with a structured header that says what was vulnerable, how the taint reaches it, what exploit was hypothesized, what the disproof found, and what the proposed change does. A reviewer can accept, reject, or adjust the change in minutes because the reasoning is visible and checkable.

A Mythos-class PR tends to arrive with a short paragraph and a diff. The reviewer has to rebuild the taint path from memory or from the advisory, decide whether the change is scoped correctly, and verify that no unrelated code was touched. That work takes longer, and when teams are busy, the PR gets deprioritized or closed.

Failure modes worth naming

The quality gap shows up in specific patterns. Pure-LLM PRs will sometimes fix a non-vulnerable call site because it looks similar to the advisory's example. They will update a dependency in a way that breaks a peer constraint because the patcher did not check the lockfile. They will add defensive code around a sink that is already guarded, bloating the diff without improving safety.

Griffin's disproof attempt catches most of the first case, because a false-positive exploit hypothesis fails to hold. The compile-and-test step catches the second and third cases, because the resulting diff either fails to build or breaks an unrelated test. These are not sophisticated guardrails. They are simply checks that a grounded pipeline can perform because the pipeline has access to the artifacts the checks need.

What to measure if you are evaluating

Teams looking to evaluate remediation PR quality should not trust vendor compile rates in isolation. The useful measurements are repo-specific. Pick ten open advisories in your own code, have each tool open PRs, and record compile outcome, test outcome, reviewer time to decision, and merge rate. Do the same measurement a month later after the model has been updated.

When we run that exercise on mixed stacks, Griffin tends to sit in the published bands and Mythos-class tools tend to sit noticeably below. The gap is wider on larger codebases, narrower on single-file vulnerabilities, and widest of all on patches that span multiple files.

The structural argument

A remediation PR is a structured artifact that has to survive contact with a compiler, a test suite, and a reviewer. Tools that generate the PR from grounded context and validate it before opening tend to produce PRs that survive. Tools that generate the PR from a prompt and hope for the best tend to produce PRs that do not. Grinding the grounded pipeline is harder to build. It is also what the numbers say actually works.

Related articles in AI Security

AI Security

Safeguard Now Supports Every Major AI Model Family for Zero-Day Discovery: Anthropic, OpenAI, Gemini, Microsoft, Meta, and Your Own Models

You should not have to choose between your organization's AI strategy and your security platform. Safeguard's agentic zero-day discovery and remediation pipeline now works on Anthropic Claude Fable 5, OpenAI GPT, Google Gemini, Microsoft Phi, Meta Llama, Safeguard native models, and privately hosted custom models — all running as first-class agents in the same Multi-Agent TAOR Deep Think AI Engine.

June 9, 2026Read
AI Security

Anthropic Claude Mythos Releases Tomorrow: Capabilities, Benchmarks, and What Security Teams Must Do Now

Anthropic's Claude Mythos model goes public on June 10, 2026 — a frontier AI that scored 97.6% on the Math Olympiad, completed expert-level hacking tasks at 73% success, and found 271 vulnerabilities in Firefox 150. Here is everything security teams need to know before it lands, and how Safeguard already supports Mythos zero-day discovery natively.

June 9, 2026Read
AI Security

Claude Fable 5: Anthropic's Most Capable Public Model Is Here — Benchmarks, Capabilities, and What It Means for Security

Anthropic just released Claude Fable 5, its most capable publicly available model and the first Mythos-class AI open to everyone. 80.3% on SWE-Bench Pro, 88% on Terminal-Bench 2.1, state-of-the-art across software engineering, vision, and scientific research. Safeguard has already integrated Fable 5 natively — here is everything you need to know.

June 9, 2026Read

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.