The first time I sat down to triage a batch of "AI-discovered" zero-days, I stopped counting at 180. Eighteen of them were real. The rest were hallucinations dressed in convincing prose: invented function names, sinks that did not exist, exploit preconditions that would require the attacker to already be root. That evening convinced me that the interesting question in 2026 is not whether an LLM can "find a bug." Almost any frontier model can narrate a bug. The question is whether the system around the model can tell you, with evidence, which narrations correspond to reachable, weaponisable primitives in the target program.
That is the lens through which I want to compare two architectures I have spent the last year poking at: Griffin AI's zero-day discovery pipeline, and the broader family of pure-LLM bug hunters we have taken to calling Mythos-class. I will stick to methodology rather than inventing vendor internals I cannot verify.
The shape of a zero-day pipeline
A zero-day pipeline, at its most honest, does three things. It decides where to look. It decides what to claim. And it decides what it is willing to defend. Griffin AI splits these into three discrete stages, and the separation is load-bearing.
The first stage is a static analysis engine that surfaces reachable taint paths. This is boring, well-understood, and extremely useful. It is boring because it relies on constructions that the program analysis community has been refining since the early 2000s, descendants of the work collected in the SOAP and ISSTA proceedings between 2012 and 2019. It is useful because it gives the downstream model a finite, grounded set of candidate flows rather than the entire program. A source, a sink, an inter-procedural path, and a set of constraints that must hold for data to traverse it. Nothing is hypothetical yet.
The second stage is where Griffin does its hypothesising. Given a candidate taint path, it asks what an attacker would need to do to convert the flow into an exploit. Which sanitisers are actually sanitisers? Which branches are feasible? What CWE class does this pattern belong to (CWE-89 SQL injection, CWE-502 deserialisation of untrusted data, CWE-78 OS command injection, CWE-918 SSRF, CWE-1236 formula injection, and so on)? The model is not generating findings out of thin air. It is reasoning over a scaffold the engine already built and can already prove exists.
The third stage is disproof. This is the part most pipelines skip. Griffin deliberately runs a second pass whose job is to kill the hypothesis. It re-examines sanitiser coverage, type narrowing, framework-level escaping, and the preconditions that would have to hold for the exploit to fire in production. A finding only survives if the disproof attempt fails. Everything else becomes an internal downgrade or a discard.
Why Mythos-class tools struggle
Mythos-class tools, as a category, tend to collapse those three stages into a single LLM call or a loose chain of calls without a grounded engine underneath. The results, in aggregate, are familiar to anyone who has triaged them: between 60 and 95 percent false positives depending on the target language and the class of bug. Published evaluations of LLM-only vulnerability detection from late 2023 onward (for example the "Do Language Models Learn Semantics of Code?" line of work, and follow-up studies at ICSE 2024 and USENIX Security 2025) repeatedly land in that band. The bugs the model invents are not random, which is what makes them so seductive. They follow the grammar of real bugs. They just do not correspond to the bytes on disk.
The failure mode is almost always the same. The model reads a function, notices a pattern that resembles a historical CVE, and narrates a vulnerability consistent with that pattern. It has no mechanism to verify reachability, because it never asked the engine whether the taint actually flows. It has no mechanism to verify exploitability, because it never attempted disproof. It produces a plausible story and hands it to a human to disprove instead.
The cost of a false positive
It is tempting to wave off false positives as a minor nuisance. In a product security org, they are the dominant cost. A realistic triage budget is 15 to 40 minutes per finding. At a 70 percent FP rate on a batch of 500 reports, you are spending between 87 and 233 engineer-hours disproving nothing. That is most of a sprint. The secondary cost is worse: teams that spend a quarter drowning in hallucinations stop reading the reports at all, and the real findings die in the noise.
Griffin's published posture is to keep the FP rate under 10 percent on the classes it actually supports, and to refuse to emit a finding rather than emit a speculative one. I have seen enough of its output to believe the first half of that claim in practice, and I have come to trust the second half because the disproof step is observable in the explanations it returns.
Where the comparison gets interesting
Two architectures can both be described as "AI zero-day hunters" and still occupy completely different points in the precision-recall plane. Mythos-class tools optimise for recall and hope a human closes the gap. Griffin optimises for precision and accepts that it will miss bugs whose shape falls outside the engine's reachability model. In a mature product security programme, I would rather have the tool that is quiet and correct than the tool that is loud and wrong, because my team's attention is the scarce resource.
The honest trade-off is that Griffin will not tell you about a bug it cannot ground. A Mythos-class tool might, and very occasionally that matters. But the evidence from two years of field deployments is that the volume of junk swamps the occasional lucky hit.
How Safeguard Helps
Safeguard runs Griffin AI against your monorepo on every merge and surfaces only the findings that survived the disproof pass. Each finding ships with its taint path, the CWE class it belongs to, the hypothesised exploit conditions, and the disproof attempt that failed. Triagers see the reasoning, not just the verdict. Teams that move from Mythos-class scanners to Safeguard typically cut their weekly triage load by 60 to 80 percent within the first month, because the pipeline is doing the disproof work before a human ever reads the report.