AI Security

Auto-Fix Compile Rates: Griffin AI vs Mythos

Griffin AI's auto-fixes compile clean 73 percent of the time and pass with minor edits 87 percent. Mythos-class pure-LLM patches rarely show those numbers for a reason.

Nayan Dey
Senior Security Engineer
6 min read

Compile rate is the first filter an auto-remediation tool has to clear. If the patched code does not build, nothing else about the fix matters. The reviewer cannot test it, the CI cannot validate it, and the pipeline cannot merge it. A high compile rate is not a luxury metric. It is the gate.

Griffin AI publishes two numbers we hold ourselves to: 73 percent of auto-fix PRs compile clean on first push, and 87 percent pass CI after minor reviewer edits. Mythos-class pure-LLM remediation tools rarely publish compile rates at all. When independent teams measure them, the numbers land well below those bands. The gap is not a coincidence and it is not a model quality issue. It is structural.

What a compile-clean patch requires

A patch compiles clean only if it respects the project's types, imports, visibility rules, language version, and build configuration. Each of those is a constraint the patcher has to satisfy exactly, not approximately. A close miss on any one of them produces a build failure.

Types have to match the signatures at the call sites touched. Imports have to resolve against the project's module graph and dependency constraints. Visibility rules, such as internal and package-private markers, have to permit the code motion the patch performs. Language version has to match so that the patcher does not use a feature unavailable on the target. Build configuration, including generated sources, code generation steps, and native toolchains, has to survive the edit.

Those constraints live in the repository, not in the vulnerability advisory. A tool that does not read them will violate them.

Griffin AI's grounding chain

Griffin builds every auto-fix against a resolved view of the project. Before the model generates a diff, Griffin has already extracted the relevant call sites, the enclosing type signatures, the import graph of affected files, and the module and build settings. The model's task is constrained rather than open-ended: make the smallest change that removes the taint, using the types already in scope and imports already resolved.

After generation, Griffin runs a local compile on the candidate diff. If compilation fails, the diff is regenerated with the failure output fed back as additional context. Targeted tests are run next, either existing unit tests that touch the affected code or synthesized tests that exercise the taint path. Only diffs that pass both checks are opened as PRs.

That chain is why 73 percent of PRs compile on first reviewer push. It is not a property of the underlying model. It is a property of the pipeline around the model. The 87 percent pass-with-minor-edits figure captures the cases where a reviewer had to touch whitespace, a naming convention, or a style comment, but the fix itself was correct.

Why pure-LLM pipelines undershoot

A Mythos-class pure-LLM tool usually takes a different shape. The model gets an advisory, a code window, and a prompt. It produces a diff. The diff is not compiled before the PR is opened. There is no feedback loop from the build system.

The model has to guess at the project's shape. It will often guess well on small, isolated files. It will often guess badly on large repositories with generated code, internal DSLs, or non-standard build layouts. Those guesses surface as import errors, type errors, and references to functions that do not exist in this codebase.

Even when the guesses happen to be correct, they are not verified. The tool does not know whether the patch compiles until a human applies it and watches the build break. By that point, the human's time has already been spent.

The numbers teams have reported

Internal and external benchmarks we have seen on pure-LLM remediation tools, run against real OSS and enterprise repositories, tend to show compile rates in the 30 to 50 percent range for non-trivial patches. Single-file trivial patches can clear 70 percent, but those are the patches teams were least worried about to begin with.

Griffin's 73 and 87 percent numbers are measured against the same kind of non-trivial repositories, including multi-module Java projects, Python monorepos, and Node workspaces. The gap widens as repository complexity increases, not narrows.

Build systems are where pure-LLM patches die

The specific failure mode we see most often in Mythos-class tools is build-system mismatch. The model generates a diff that assumes a standard project layout. The real project uses a non-standard layout because of a generator, a plugin, or a legacy convention. The diff lands, the build fails, and the PR closes with no merge.

Griffin avoids this because the build system is part of the grounded context. We read the build configuration, we know where generated sources live, and we know which modules depend on which. The patcher is told those facts rather than inferring them.

Partial patches and their cost

A subtler failure is the partial patch. The model fixes the obvious call site and misses a second site in the same file or a related file. The diff compiles. The vulnerability is still exploitable through the missed path.

Griffin's taint analysis finds every reachable site before generation. If the vulnerable pattern appears at three locations, all three are included in the patch scope. If only two are reachable from real sources, only two are patched and the third is annotated with why it was left alone. A reviewer can see the coverage decision rather than having to reconstruct it.

Pure-LLM tools without taint analysis routinely miss sites that are not in the immediate code window. The fix looks correct, passes a review, and leaves the vulnerability half-open.

What reviewers should ask

If you are considering an auto-remediation tool, the useful questions are about the pipeline, not about the model. Does the tool compile its candidate diffs before opening the PR? Does it run at least a targeted test pass? Does it feed compile failures back into regeneration rather than give up or hallucinate? Does it scope the patch against a taint analysis or only against the advisory text? Does it read your build configuration?

When the answers are yes, compile rates in the Griffin bands are achievable. When the answers are no, compile rates stay in the pure-LLM bands regardless of how good the underlying model is.

The bottom line

Compile rate is a function of how the pipeline is constructed, not of how clever the generator is. Griffin AI's 73 and 87 percent figures reflect a pipeline that reads the repository, generates under constraint, compiles the candidate, and only opens PRs that pass. Mythos-class pure-LLM approaches skip the grounding and the verification, so the diffs they open look reasonable and fail to build. The patches that do not compile cannot fix anything. The patches that compile usually can.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.