AI Security

Remediation PR Quality: Griffin AI vs Mythos

Griffin AI produces draft PRs with taint paths, exploit hypotheses, and disproof attempts. Mythos-class pure-LLM tools skip those anchors, and PR quality suffers.

Shadab Khan
Security Engineer
7 min read

Every auto-remediation pipeline eventually gets judged on one thing: does the pull request it opens actually land? Reviewers do not care how clever the model was, how large the context window, or how many chain-of-thought tokens were spent. They care whether the diff compiles, whether the fix matches the vulnerability, and whether the explanation gives them enough confidence to click merge.

At Safeguard we have been measuring remediation PR quality across Griffin AI and a representative set of Mythos-class pure-LLM remediation tools. The structural difference in how each approach produces a PR drives most of the quality gap we see in the field.

What a Griffin AI remediation PR contains

When Griffin AI opens a remediation PR, the draft is assembled from four grounded artifacts rather than a single free-form generation step.

The first is the taint path. Griffin walks from the vulnerable sink back to every reachable source, annotating the intermediate frames with the call graph evidence that justifies reachability. That path is attached to the PR description, so a reviewer can scan from attacker-controlled input to the affected function without rebuilding the flow in their head.

The second is the exploit hypothesis. Griffin states, in plain terms, the class of exploit the advisory suggests and the specific precondition that would need to be true in this codebase for the flaw to be triggered. This is a testable claim, not a marketing paragraph.

The third is the disproof attempt. Griffin actively tries to refute the exploit hypothesis using the project's own guards, sanitizers, and runtime invariants. If the attempt succeeds, the finding is downgraded and the PR is not opened at all. If it fails, the failure is recorded next to the hypothesis so reviewers see exactly which defensive assumption did not hold.

The fourth is the human merge gate. Griffin never self-merges. The PR lands in the normal code review flow with CODEOWNERS, required checks, and branch protection rules intact.

How Mythos-class tools structure their PRs

Pure-LLM remediation tools in the Mythos class typically take a different path. The model is handed the advisory text, a window of source code, and a prompt asking it to produce a patch. The output is a diff and a short rationale.

That rationale is usually fluent. It references CVE identifiers, names the vulnerable function, and proposes what looks like a reasonable change. What it does not contain is a traceable connection between the specific advisory and the specific code. There is no taint path, because tracing taint requires a program analysis layer the model does not have. There is no disproof attempt, because the model has no mechanism to falsify its own hypothesis against the running code. And there is rarely an explicit merge gate because the tool is marketed on its automation, so human review is framed as friction rather than as a feature.

Published numbers versus marketing numbers

Griffin AI's published benchmarks show 73 percent of auto-PRs compile clean on first push and 87 percent pass CI with minor edits from a reviewer. Those numbers come from instrumented evaluation against real repositories with real test suites, not synthetic benchmarks.

Mythos-class tools typically do not publish comparable figures. The marketing usually focuses on volume, such as number of fixes suggested per scan, rather than on the harder question of how many of those fixes actually land. When independent teams have measured compile rates on pure-LLM patches against non-trivial codebases, the results cluster well below Griffin's bands. The structural reason is straightforward: a patch written without grounded context about the project's types, imports, and call graph is generating plausible code, not code that fits this repository.

Why grounded context raises compile rates

A compile-clean patch is a patch that respects the project's real shape. It uses the types that are in scope, imports that already exist or that the build system can resolve, and language features compatible with the target runtime.

Griffin's PR generation pulls the real call site, the real type signatures, and the real import graph into the context. When the generation step proposes a change, the change is constrained by the same facts a compiler would later check. Anything that violates those facts can be caught before the PR is even opened, because Griffin runs a local compile and targeted test pass on the candidate diff as part of the draft step.

A pure-LLM tool without that grounding has to infer types and imports from whatever source text it happened to see. Inference works often enough to look impressive in demos. It fails often enough on real repositories to produce the compile-fail rates that teams eventually notice.

The reviewer experience

PR quality is not only about the diff. It is also about what a reviewer feels when the PR lands in their queue.

A Griffin PR arrives with a structured header that says what was vulnerable, how the taint reaches it, what exploit was hypothesized, what the disproof found, and what the proposed change does. A reviewer can accept, reject, or adjust the change in minutes because the reasoning is visible and checkable.

A Mythos-class PR tends to arrive with a short paragraph and a diff. The reviewer has to rebuild the taint path from memory or from the advisory, decide whether the change is scoped correctly, and verify that no unrelated code was touched. That work takes longer, and when teams are busy, the PR gets deprioritized or closed.

Failure modes worth naming

The quality gap shows up in specific patterns. Pure-LLM PRs will sometimes fix a non-vulnerable call site because it looks similar to the advisory's example. They will update a dependency in a way that breaks a peer constraint because the patcher did not check the lockfile. They will add defensive code around a sink that is already guarded, bloating the diff without improving safety.

Griffin's disproof attempt catches most of the first case, because a false-positive exploit hypothesis fails to hold. The compile-and-test step catches the second and third cases, because the resulting diff either fails to build or breaks an unrelated test. These are not sophisticated guardrails. They are simply checks that a grounded pipeline can perform because the pipeline has access to the artifacts the checks need.

What to measure if you are evaluating

Teams looking to evaluate remediation PR quality should not trust vendor compile rates in isolation. The useful measurements are repo-specific. Pick ten open advisories in your own code, have each tool open PRs, and record compile outcome, test outcome, reviewer time to decision, and merge rate. Do the same measurement a month later after the model has been updated.

When we run that exercise on mixed stacks, Griffin tends to sit in the published bands and Mythos-class tools tend to sit noticeably below. The gap is wider on larger codebases, narrower on single-file vulnerabilities, and widest of all on patches that span multiple files.

The structural argument

A remediation PR is a structured artifact that has to survive contact with a compiler, a test suite, and a reviewer. Tools that generate the PR from grounded context and validate it before opening tend to produce PRs that survive. Tools that generate the PR from a prompt and hope for the best tend to produce PRs that do not. Grinding the grounded pipeline is harder to build. It is also what the numbers say actually works.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.