If you have followed the bug hunter market over the last three years, you have probably noticed a divergence in how vendors describe what their tools do. One camp talks about "AI-powered vulnerability discovery" and demonstrates output that consists of an LLM reading a function and narrating what could be wrong with it. The other camp talks about "engine-plus-LLM pipelines" and shows output that includes a reachable taint path, a CWE-grounded hypothesis, and a disproof attempt. The two camps are sometimes described as if they were on a spectrum. They are not. They are different architectures with different failure modes, and the difference matters more than the marketing language usually conveys.
This piece is an attempt to write down the architectural difference, the empirical track record on each side, and why I no longer think of pure-LLM bug hunters as a serious option for production triage in 2026.
What a pure-LLM bug hunter is doing
A pure-LLM bug hunter, in its most common form, is a chain of model calls operating over source code. The chain might start with a model that reads a file and identifies suspicious functions, then moves to a model that reads the suspicious function in detail and writes up a vulnerability report. There are variations: some pipelines retrieve related code via embeddings, some use multi-agent setups where one agent proposes and another critiques, some pass the suggested vulnerability through a synthetic exploit generator. None of them have a grounded model of the program. The whole pipeline operates on the model's reading of the source.
This works surprisingly well for narrating a vulnerability. It works poorly for verifying one. The model has no internal mechanism to distinguish a function it understood from a function it imagined, and no internal mechanism to check whether the data flow it described actually exists. Its training has exposed it to a great deal of vulnerable code and the vulnerability literature, which means it can generate plausible vulnerability narratives in the syntax of the target language. Plausible vulnerability narratives are not vulnerabilities. They look like vulnerabilities, which is the problem.
What an engine-plus-LLM bug hunter is doing
The engine-plus-LLM architecture inserts a static analysis layer below the model. The engine is responsible for the parts of the problem the model is bad at: parsing the program correctly, building the call graph, computing inter-procedural reachability, identifying source and sink pairs, and producing a finite set of candidate flows that are grounded in the actual code on disk. The model is responsible for the parts the engine is bad at: reasoning about whether a sanitiser actually sanitises, whether a branch is feasible under attacker-controllable inputs, what CWE class the flow corresponds to, and what exploit conditions would have to hold.
The split is not arbitrary. It is the same split that has worked in formal verification for decades, where a solver does the mechanical work and a human (or now a model) does the high-level reasoning. The model never has to decide whether a function exists, because the engine has already proven it does. The engine never has to decide whether a sanitiser is adequate, because it cannot evaluate that question; it hands the question to the model.
The third stage in the architecture is disproof. After the model has hypothesised a vulnerability over a candidate flow, a second pass tries to falsify the hypothesis. It checks the sanitiser coverage one more time, examines framework-level escaping the engine might have missed, considers whether the path is feasible only under inputs that no realistic attacker controls, and so on. A finding that survives the disproof pass is reported. A finding that does not is silently dropped.
What this looks like in numbers
Published evaluations of pure-LLM bug hunters consistently show false positive rates in the 60 to 95 percent band, depending on the target language and the bug class. The variation is not random. Memory safety bugs in C are easier to verify mechanically, so the FP rate is at the lower end. Web framework bugs in dynamic languages are harder, and the FP rate is at the upper end. Across the spectrum, the modal pure-LLM tool is producing reports that require a human to read most of them and disprove them.
Engine-plus-LLM pipelines, evaluated honestly, land in a different regime. Single-digit to low-teens FP rates are typical for the bug classes the engine can ground. The classes the engine cannot ground (memory safety in unsafe Rust, certain C++ template-heavy patterns, dynamic dispatch through deep eval chains) are not reported at all, because the disproof pass has nothing to compare against and the pipeline refuses to speculate. The trade-off is that recall is bounded by what the engine can reach. You will miss bugs that fall outside that boundary. You will not, however, drown in noise.
The cost of the precision-recall trade
It is fashionable to argue that recall matters more than precision because "you can always filter the false positives later." This argument fails on contact with a real triage queue. Filtering false positives is the most expensive thing a security engineer does, and the time it consumes is unrecoverable. A queue of 500 reports at 70 percent FP, with 20 minutes per report, is 117 engineer-hours of disprove-the-hallucination work. That is most of a sprint, and the emotional cost is worse than the time cost. Engineers stop reading reports that come from sources they have learned not to trust.
A queue of 80 reports at 8 percent FP, with the same triage time, is about 27 engineer-hours, of which the bulk is real work on real findings. The engineers learn that the source is trustworthy, and they engage with the queue rather than avoiding it. The throughput of the security programme goes up, even though the raw report count went down.
Why this is not a tuning question
I am sometimes asked whether a pure-LLM tool could be tuned to match an engine-plus-LLM tool's precision by adding more critique steps, more retrieval, or a stronger model. The answer is no, for the same reason a model cannot be tuned to do floating-point arithmetic faster than a CPU. The work the engine does is qualitatively different from the work a language model does. Adding more model calls produces more elaborate narrations, not more grounded ones. The architecture has to include the engine for the precision to be reachable.
This is the part the marketing language tends to obscure. A vendor that ships an LLM with retrieval and a critique pass is not building a different category of tool from a vendor that ships an LLM with three retrieval passes and two critique steps. A vendor that ships a static engine, an LLM, and a disproof stage is building a different thing entirely.
How Safeguard Helps
Safeguard runs the engine-plus-Griffin AI pipeline that this piece describes. The static engine surfaces reachable taint paths across your first-party code and transitive dependencies, the Griffin layer hypothesises CWE-grounded bug classes over those paths, and the disproof pass drops the hypotheses it cannot defend. Each finding ships with the taint path, the exploit conditions, and the failed disproof attempt, so triagers see the reasoning rather than just the verdict. Teams that move from pure-LLM bug hunters to Safeguard typically cut their weekly triage load by 60 to 80 percent within the first month, because the noise the pure-LLM tool produced was structural, not tunable.