AI Security

Breaking Change Awareness: Griffin AI vs Mythos

An auto-fix that closes a vulnerability and breaks the build is not a fix. Breaking-change awareness separates auto-PRs that ship from auto-PRs that get reverted.

Shadab Khan
Security Engineer
6 min read

The fastest way to lose engineering trust in an automated remediation tool is to ship a fix that breaks the build. The second fastest is to ship a fix that compiles but breaks production behaviour. Both happen routinely with auto-remediation tools that don't reason about breaking changes. Griffin AI and Mythos-class remediation tools approach this problem with different architectural choices, and the result shows up directly in the percentage of auto-PRs that merge versus the percentage that get reverted within a week.

What breaking-change awareness requires

Three capabilities, all of which need to work together:

  • Recognise breaking-change patterns in the proposed fix. Renamed function signatures, removed parameters, changed return types, modified behaviour for existing inputs.
  • Verify the fix against the consumer surface. Does any code in the project call the modified API? In what contexts? With what arguments?
  • Provide a migration path when the breaking change is unavoidable, or fall back to a non-breaking alternative when possible.

A platform that gets the first capability but not the others produces compile-clean PRs that fail at runtime. A platform that gets all three produces PRs that merge cleanly the first time.

Where pure-LLM remediation lands

Mythos-class general-purpose models can produce diffs that look correct in isolation. The model sees the function and proposes a change to it. The change compiles. The change handles the original input correctly.

The failure mode is at the consumer surface. The model has not consistently reasoned about the rest of the codebase that calls the modified function. A change that adds a required parameter to a function used in 14 places will compile in the function file and fail in the 14 call sites.

The result is a remediation PR that merges into a feature branch, fails the test suite, and either gets fixed manually (consuming the time the auto-fix was supposed to save) or gets reverted entirely.

How Griffin AI handles it

Three deterministic steps before the model proposes a fix:

Caller enumeration. The engine identifies every call site of the function or API being modified. The taint path that produced the finding is one entry; all other call sites are also collected.

Constraint analysis. For each caller, the engine determines what assumptions the caller makes about the function's signature, return type, and behaviour. Adding a required parameter where one caller passes only positional arguments is flagged as a breaking change.

Fix selection. Where multiple possible fixes exist, Griffin AI ranks them by breaking-change impact. A fix that adds an optional parameter with a safe default is preferred over a fix that requires a parameter. A fix that introduces a new function and migrates one caller is preferred over a fix that modifies an existing function used by twenty callers.

The output is a remediation PR that comes with a breaking-change report: zero callers affected, or N callers requiring migration with the migration steps included.

A concrete example

A finding shows that a getUser(id) function does not validate that the requesting user has permission to view the target user. The naive fix changes the signature to getUser(requestingUser, id) with a permission check.

A pure-LLM tool produces a PR that modifies getUser to take both parameters. The PR fails because 23 call sites in the codebase call getUser(id) and now break.

Griffin AI's analysis identifies the 23 call sites and ranks the fix options:

  1. Option A: rename getUser to getUserUnsafe, introduce getUser(requestingUser, id) as the new safe API, deprecate the unsafe version with a migration window. Breaking-change risk: zero immediate; flagged for follow-up. The auto-PR ships this option.

  2. Option B: modify getUser(id) to accept an optional requestingUser parameter that defaults to the current request context. Breaking-change risk: zero. Suitable for codebases where the request context is reliably available.

  3. Option C: change the signature directly. Breaking-change risk: 23 call sites. Surfaced as a follow-up task with the affected call sites listed.

The auto-PR ships option A or B; option C is a tracked follow-up with explicit migration scope. Engineers see a fix that merges cleanly and a tracked task for the broader migration. No reverts.

Coverage across change classes

Breaking-change awareness applies to:

  • API signature changes — added/removed/reordered parameters, type changes, return-type changes.
  • Behavioural changes — same signature, different output for the same input.
  • Configuration changes — environment variables added, defaults changed, deprecation triggers.
  • Database schema changes — added/removed columns, type changes, constraint changes.
  • Wire-protocol changes — new required headers, changed payload formats.
  • Dependency-version changes — major-version bumps with documented breaking changes.

Each follows the same caller-enumeration → constraint-analysis → fix-ranking pattern.

A measurable outcome

Griffin AI's published benchmarks: 73% of auto-PRs compile and pass existing tests unchanged; 87% pass with minor edits. The remaining 13% are explicitly flagged as requiring engineering review before merge.

These numbers come from the breaking-change awareness pipeline. Without it, the rates would be substantially lower — and the ones that did pass tests would still create downstream incidents at runtime.

What to evaluate

Three concrete checks:

  1. Show the platform a finding where the obvious fix changes a widely-used API. Does the auto-PR break callers, or does it choose a non-breaking alternative?
  2. Show a finding where the fix requires a behavioural change. Is the behavioural impact surfaced?
  3. Look at five recent auto-PRs the platform produced. What percentage merged unchanged?

The third check is the operational truth. A platform whose auto-PRs land at 70%+ unchanged is a platform engineers will use. A platform whose auto-PRs land at 30% is one engineers will turn off.

How Safeguard Helps

Safeguard's auto-remediation pipeline includes breaking-change awareness as a first-class step. Caller enumeration, constraint analysis, and fix ranking happen before the auto-PR is generated, so the PR that lands is the lowest-blast-radius option that still closes the finding. The published auto-PR benchmarks reflect this discipline. For engineering teams whose trust in automated remediation has been damaged by previous tools, this is the architectural choice that rebuilds it.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.