AI Security

Hypothesis Quality: Griffin AI vs Mythos

Two AI bug hunters can both generate hypotheses. Only one can defend them. A field study of grounded versus ungrounded hypothesis generation in zero-day discovery.

Nayan Dey
Senior Security Engineer
6 min read

A hypothesis is a promise. When an AI scanner tells me "this function is vulnerable to command injection at line 412," it is promising that, if I spend my next thirty minutes reading, I will find a real primitive, or at least a real near-miss. The quality of an AI bug hunter is basically the distribution of those promises. How often are they kept? When they are not, how spectacular is the failure? A tool whose hypotheses are mostly wrong teaches the team to ignore it. A tool whose hypotheses are mostly right, even when the finding is ultimately downgraded, teaches the team to read carefully.

I have spent most of the last eighteen months watching both Griffin AI and a rotating cast of Mythos-class tools produce hypotheses against the same C, Java, Python, and Go codebases. The populations of hypotheses are genuinely different, and the difference is not subtle.

What a grounded hypothesis looks like

Griffin's hypotheses arrive with scaffolding. Before the language model writes anything, the static engine has already identified a reachable taint path: a concrete source, a concrete sink, a concrete inter-procedural sequence, and a set of constraints that the path implies. The model's job is not to invent a bug; it is to explain the bug the engine has already located and to propose what an attacker would need to do to exercise it. Because the path exists in the program graph, the hypothesis inherits reachability for free. Because the CWE classification is constrained by the source-sink pair, the classification drifts much less than it does in free-form generation.

A grounded hypothesis for a CWE-89 SQL injection on a REST handler looks like this: request body field user.search flows, unsanitised, through a helper named buildQueryFragment, into a raw string concatenation at the ORM boundary on line 402 of repository/reports.go. The attacker must send a POST that bypasses the JSON schema validator (which only constrains length, not content). The hypothesised exploit condition is that search contains an unescaped single quote followed by a UNION clause. Everything in that sentence is anchored to a specific line, a specific function, a specific validator.

A Mythos-class hypothesis for the same target often reads the function, notes that it handles user input and does database things, and narrates a bug. Sometimes the bug is real. More often the bug it narrates is subtly wrong: it invokes a sanitiser that the function does not call, or a sink that the ORM routes around, or an input field that the controller never actually exposes. The grammar of the hypothesis is correct. The referents are hallucinated.

The FP distribution is bimodal

When you plot the false-positive rate by CWE class across a Mythos-class run, the distribution is bimodal in a painful way. CWE classes with extremely strong textual priors (CWE-89 SQL injection, CWE-78 OS command injection, CWE-22 path traversal) show false-positive rates in the 55 to 75 percent range because the model has read thousands of real examples and has strong intuitions about what the code "should" look like. CWE classes with weaker priors (CWE-362 race conditions, CWE-787 out-of-bounds writes, CWE-617 reachable assertions) show false-positive rates above 90 percent because the model resorts to pattern-matching on surface features and invents causality.

Griffin's distribution is flatter, because the engine does the same grounding work for every class. Racing, bounds, and logic bugs still have lower recall than injection classes, because they are genuinely harder to ground, but the bugs that survive into hypotheses are defensible. The 2024 DARPA AIxCC retrospective and the follow-up analyses at IEEE S and P 2025 make a similar point: pipelines that pair program analysis with LLM reasoning show much flatter precision curves across CWE classes than LLM-only systems.

Reading the explanations

The tell is in the explanation field. Ask a Mythos-class tool why it believes a finding, and you will typically get a paragraph of fluent prose that reads like a blog post. It cites a CWE, paraphrases a textbook, and waves at the code. Ask Griffin the same question, and you get a taint path, a list of constraints, a sanitiser gap, and the hypothesised exploit conditions expressed as constraints on attacker-controllable inputs. One is a narrative; the other is an argument. Narratives do not falsify. Arguments do.

This matters for the second pass. If the hypothesis is a narrative, the disproof step has nothing concrete to attack, so the model tends to re-narrate the same story and confirm itself. If the hypothesis is an argument with explicit conditions, the disproof step has discrete claims to break, and the pass is productive. Hypothesis quality and disproof quality are not independent properties. The first determines whether the second is possible at all.

A small experiment

Last quarter I ran a small, unscientific experiment on a single Go microservice. I had Griffin and two Mythos-class tools analyse the same HEAD. Griffin produced 31 hypotheses; after triage, 27 were real bugs or defensible near-misses. The two Mythos tools produced 284 and 412 hypotheses respectively; after triage, 34 and 29 were real. The raw true-positive counts are comparable. The ratios are not: Griffin's precision was 87 percent, the Mythos tools were at 12 percent and 7 percent. The triage cost to extract the same true-positive volume was roughly an order of magnitude higher.

That order-of-magnitude gap is the real story of hypothesis quality. It is not that pure LLMs cannot find bugs. It is that their hypotheses arrive unanchored to the program and therefore cannot be defended without a human doing the anchoring work by hand.

How Safeguard Helps

Safeguard exposes Griffin AI's hypothesis artefacts directly in the finding UI. Reviewers see the taint path, the constraints, and the exploit conditions as first-class objects, not buried in a prose blob. When a hypothesis is downgraded by the disproof pass, the reason is captured and surfaced. The effect, across dozens of customer deployments, is that the signal-to-noise ratio on the review queue approaches what a principal security engineer would tolerate from a human red team, instead of the 10-to-1 noise ratio that characterises Mythos-class output.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.