The 2023-2024 wave of pure-LLM vulnerability scanners promised something compelling: feed code to a model, get findings out. By mid-2026, the production track record exists, the failure modes are documented, and the architecture lessons have crystallised. Pure-LLM scanning works for narrow demonstrations and fails at production scale because LLMs without grounding hallucinate findings, miss multi-hop reasoning, and produce false positive rates that no triage queue can absorb. Reachability analysis remains the backbone of credible vulnerability discovery; LLMs add genuine value when they reason over reachability output rather than over raw code.
What pure-LLM scanners do well
Three workloads:
- Pattern recognition of common vulnerability shapes — SQL injection in obvious form, hardcoded secrets, unsafe deserialisation patterns.
- Code explanation when an engineer is investigating a finding.
- Quick prototype validation during early development.
For these, the model's output is good and fast.
Where pure-LLM scanners fail at production scale
Four documented failure modes:
Hallucination at scale. Asked to find vulnerabilities across a codebase, models produce confident outputs that include non-existent vulnerabilities. False positive rates of 30-70% are reported in independent research.
Multi-hop reasoning failure. A vulnerability that requires reasoning across 6+ function calls in different files is unreliable territory. The model misses cross-file flows or invents them.
Context window saturation. Real codebases don't fit. Even with 1M-token context, the relevant call graph for an enterprise application doesn't.
Non-determinism. The same code produces different findings on different runs. Triage workflows assume deterministic findings; pure-LLM output is not.
Each failure mode has been documented in production deployments through 2025-2026.
What the engine-plus-LLM architecture does differently
Three architectural choices:
The deterministic engine produces structured grounding. Call graph, taint paths, version-aware CVE mapping — all computed deterministically. The model never sees raw code in an unstructured way.
The model reasons over the structured grounding. Asked "given this taint path, what is the exploit hypothesis?" — a much narrower question than "find vulnerabilities in this code." The model is far more reliable at the narrow question.
A second model pass tries to disprove. Findings that survive the disproof reach the queue. Findings that don't are filtered before triage time.
The combined system has measurably lower false positive rates and higher reasoning accuracy than either component alone.
Where customers compare
Customer reports comparing pure-LLM tools to engine-plus-LLM platforms (Safeguard) on the same codebases:
- False positive rate: pure-LLM ~50-70%; engine-plus-LLM ~5-15%.
- Multi-hop accuracy: pure-LLM ~30-50%; engine-plus-LLM ~75-85%.
- Run-to-run determinism: pure-LLM low (output varies); engine-plus-LLM high (engine deterministic, LLM gated by eval harness).
The numbers are why mature deployments converged on the architecture.
What the future probably looks like
Three predictions:
- Pure-LLM scanners will continue to ship for specific niches but will not displace structured analysis.
- Frontier models will get better at narrow reasoning tasks; engine-plus-LLM architectures will benefit by giving the model better reasoning targets.
- The benchmark industry will mature; vendors who publish comparable numbers will gain procurement advantage.
How Safeguard Helps
Safeguard's engine-plus-LLM architecture is built around the lesson: reachability and call-graph grounding produce the structured context that makes LLM reasoning reliable. Griffin AI runs at high-leverage decision points with the deterministic output as evidence. For organisations whose vulnerability scanning programme has been disrupted by pure-LLM tools that didn't survive contact with production, the engine-grounded architecture is the path back to operational sanity.