AI Security

Regression Testing on Fixes: Griffin AI vs Mythos

A remediation PR is only useful if it does not break anything else. Griffin AI runs targeted regression before opening; Mythos-class tools usually do not.

Shadab Khan
Security Engineer
6 min read

A security patch that fixes a CVE and silently breaks the login flow is not a win. It is a different incident. Any serious auto-remediation tool has to prove that its fixes do not introduce regressions, and that proof has to happen before the PR lands in a reviewer's queue.

Griffin AI integrates targeted regression testing into the PR draft step. Mythos-class pure-LLM tools typically do not, because they do not have the grounded context required to select or synthesize the relevant tests. The gap shows up as two different experiences on merge day.

What regression testing on a remediation PR means

A remediation regression test is not the same as a general CI run. Its job is narrower: confirm that the specific behaviors the patched code is supposed to preserve still hold, and confirm that the specific behaviors the vulnerability allowed are now blocked.

Preservation testing covers the existing functionality touched by the diff. If the patch modifies a parser, the existing parser tests have to keep passing. If the patch changes an authentication helper, the happy-path login tests have to keep passing. These are not new tests. They already exist in the repository and have to be identified and run.

Exploit-blocking testing covers the vulnerable behavior. If the advisory describes an injection through a particular input, the patched code has to reject that input. If the advisory describes a traversal, the patched code has to refuse the traversal path. These are often new tests, synthesized from the exploit hypothesis.

Both halves have to pass for the PR to be useful. Either alone is insufficient.

Griffin AI's targeted regression pass

Griffin builds a regression set for every PR. The preservation half is selected by tracing which existing tests exercise the functions and modules in the diff. That trace uses the project's test instrumentation history and call graph, so the selected set is small and directly relevant rather than the entire suite.

The exploit-blocking half is synthesized from the exploit hypothesis. Griffin already has that hypothesis as part of its grounded context, including the specific input class and expected behavior. The synthesized test asserts that the patched code rejects or sanitizes that class of input.

Both sets are run locally before the PR opens. If either fails, the diff is regenerated with the failure output fed back as context. The PR only ships when both sets pass.

This is why a Griffin PR lands with a green regression status in its description. The reviewer does not need to trust the tool on faith. The test run is attached to the PR.

How Mythos-class tools approach regression

Pure-LLM remediation tools in the Mythos class rarely run regression before opening PRs. The reason is not an oversight. It is that they cannot select the right tests without the grounded context a program analysis layer provides.

Some tools run the full test suite in CI after the PR is opened. That catches obvious failures but does not catch behavioral regressions that the existing suite does not cover. It also shifts the discovery cost to the reviewer, who now has to wait for CI to fail before knowing whether the patch is safe.

Some tools generate tests from the advisory text using the LLM. These tests are often plausible-looking but untethered from the repository. They assert against functions that do not exist, use fixtures that are not present, or check behaviors the codebase never implemented. They pass or fail for reasons unrelated to the actual fix.

The practical result is that Mythos-class PRs rely on human reviewers and CI to catch regressions that a grounded pipeline could have caught before the PR was opened.

The specific regression failures we see

Three patterns show up most often in pure-LLM remediation tools.

The first is the input-validation regression. The patch tightens input validation to block an injection, but the tighter validation rejects legitimate inputs the application relied on. The authentication helper that used to accept email-plus-tag addresses now rejects them. The parser that used to tolerate trailing whitespace now errors. These breakages are invisible in the diff review and surface as production incidents.

The second is the dependency-upgrade regression. The patch bumps a dependency to a version that fixes the CVE, but the new version changes an API signature the rest of the codebase used. The build compiles because types happen to line up superficially, but a runtime call path produces the wrong behavior.

The third is the coverage regression. The patch fixes one site and leaves another unreachable from the existing test suite. The tool reports all-green because the reachable paths still work. The unreachable path is still exploitable, and no one notices until the next scan.

Griffin's taint analysis and synthesized exploit tests catch all three patterns before the PR opens. Pure-LLM tools without those mechanisms cannot.

The reviewer contract

Regression testing on fixes is really about what contract the remediation tool offers the reviewer. Griffin's contract is that the PR has been compiled, preservation-tested, and exploit-blocked before the reviewer sees it. The reviewer's job is to confirm scope, style, and organizational fit.

The Mythos-class contract is thinner. The PR contains a plausible diff and a short rationale. The reviewer's job is to confirm scope, style, organizational fit, whether the patch compiles, whether existing functionality still works, and whether the vulnerability is actually blocked. That is a different job, and it takes longer.

Teams that try to run auto-remediation at the volume vendors promise find out quickly which contract they have signed. The thinner contract does not scale, because the reviewer workload per PR is too high.

What to look for in evaluation

When evaluating a remediation tool, ask whether the regression set it runs is targeted or opportunistic. Targeted means the tool traced the diff's impact to specific tests and ran those tests locally before opening the PR. Opportunistic means the tool hopes CI will catch problems after the fact.

Ask whether the exploit-blocking test was synthesized from the actual code or from the advisory text. Code-grounded synthesis means the test runs against real functions and fixtures. Advisory-grounded synthesis produces tests that look right and do not exercise anything.

Ask how the tool handles failures during regression. Does it regenerate the diff, or does it open the PR anyway with a note saying the test failed? The former is honest engineering. The latter is passing the cost to the reviewer.

Why this is a structural problem

The regression gap is not something a better model closes. It is closed by having program analysis, a compile step, and test selection infrastructure around the model. That is the infrastructure Griffin invests in and that Mythos-class tools skip on the theory that a sufficiently large model makes the infrastructure unnecessary.

So far, the numbers from real repositories do not support that theory. Fixes that do not go through regression testing fail regression testing downstream. The work happens either before the PR opens or after it merges. Before is cheaper.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.