Griffin AI drafts remediation PRs for reachable vulnerabilities — including zero-day candidates — with the taint path, exploit hypothesis, and disproof attempt attached. A human approves before merge. Frontier models matched to the tasks they are actually good at, with eval-gated fallbacks and a published benchmark.
AI remediation without structure produces unreliable PRs. With structure, it ships.
Using the same frontier model for every task wastes budget on trivial work and hits reasoning limits on complex work. Different security tasks want different model families.
If your team can't tell you the current refusal rate, citation accuracy, and regression delta against last release — you don't have an AI workflow, you have a demo.
A patch for a vulnerability that isn't actually reachable from user input is wasted work. Fixes for reachable vulnerabilities, with the taint path shown, are triage-ready.
Automated PR merges for AI-generated fixes is where programs explode publicly. Every remediation needs a human checkpoint before reaching main.
Reasoning-heavy triage routes to Opus-class models; high-throughput scanning uses Haiku-class for cost; multi-step remediation uses Sonnet-class for the balance. Eval-gated fallbacks per task family.
Every LLM gets exactly the tools it needs — via scoped service identities. Irreversible actions require out-of-band confirmation. Tool-call distribution shifts trigger alerts.
Griffin AI drafts the fix PR carrying the taint path, exploit hypothesis, disproof attempt, and ranked evidence. A human reviewer approves before merge. The workflow ships, not just demos.
Five eval families run on every release: exploit-hypothesis accuracy, remediation-PR correctness, advisory summarization, cross-finding correlation, and adversarial resistance. Regressions of more than one standard deviation block the release. The published benchmark numbers are the same numbers our internal pipeline gates against — no marketing gloss, no cherry-picking.
See the current benchmark numbers. Run the eval harness on your workloads. Ship fixes you can defend in a review.