Security tools die not from missing vulnerabilities but from crying wolf. Every false positive costs a developer fifteen minutes and a unit of trust, and the trust budget runs out faster than people expect. I spent two weeks at a healthtech customer watching their engineers triage the backlog from their incumbent AI security product — a pure-LLM tool in the Mythos class. Of the 317 findings reviewed that sprint, 241 were false positives. The team had built a Slack channel whose only purpose was to mark things "not real." When we ran Safeguard's Griffin AI against the same commits, the finding count dropped by 70% and the false-positive share dropped by roughly 85%. The architecture explains both numbers. Griffin's engine eliminates a whole class of false positives before the LLM ever reasons; the LLM then reasons only over findings the engine has already confirmed. Pure-LLM systems have to manufacture the confirmation step, and they manufacture it imperfectly.
What causes false positives in pure-LLM security tools?
Three categories, in order of frequency. Pattern false positives: the model sees a call that looks dangerous (eval, exec, system, os.popen) and flags it regardless of whether the argument is user-controlled. Version false positives: the model matches a CVE to a package name without checking whether the installed version falls in the vulnerable range. Reachability false positives: the CVE applies to a function in a library that your code imports but never calls. Griffin AI eliminates all three structurally. Pattern flags pass through a taint check that requires a source-to-sink path. Version matches use the actual resolved purl, not the import string. Reachability is answered by the call graph, not by vibes. Pure-LLM systems can approximate each of these checks, but approximation is not verification, and it shows up in the noise floor.
How big is the delta in practice?
Across nine benchmarked tenants in our internal evaluation — ranging from a 40-service Node.js monorepo to a 2,800-repo Java enterprise — Griffin AI's precision on high/critical findings averaged 0.82. On the same codebases, pure-LLM products averaged 0.34 precision, with a long tail of fabricated CVEs. That four-to-one delta is consistent across ecosystems. It is most dramatic on Java (deep reflection and dependency-injection patterns confuse similarity-based retrieval) and least dramatic on Go (stricter typing and simpler build graphs narrow the gap), but the direction is invariant. The published 81% hypothesis accuracy number is the engine-constrained cousin of that precision metric; because the engine pre-filters hypotheses, the LLM's false-positive contribution is bounded.
Which specific CWEs produce the most noise?
CWE-78 (OS command injection) and CWE-89 (SQL injection) are the top two noise generators in pure-LLM tooling. Both show up on every grep for exec or query, and pure-LLM systems struggle to tell which calls have a real untrusted source. CWE-502 (insecure deserialization) is a close third; the word "deserialize" appears across safe and unsafe contexts, and a language model cannot reliably distinguish Jackson's default typing behavior from a locked-down configuration without running the actual object-mapper config through analysis. CWE-918 (SSRF) lights up on any HTTP client construction. Griffin AI handles each of these by requiring the taint analyzer to confirm the source-to-sink path; findings that pass the check are high-signal, and findings that fail never reach the developer.
Do false positives scale with repo size?
They scale superlinearly in pure-LLM systems. Larger repos mean more lookalike code, more imports, more retrieval candidates — and the retrieval layer doesn't know which candidates are reachable. One customer migrating from a pure-LLM tool reported 1,140 findings on a 2.1M-line Java monorepo in a single scan, of which roughly 85% were not real. After migration to Safeguard, Griffin AI reported 143 findings on the same codebase, with an independently verified precision above 0.85. That's not a scaling problem; that's an architecture problem presenting as a scaling problem. Engine-grounded reasoning scales linearly with the reachable set; retrieval-based reasoning scales with the superficially-similar set, which is much larger.
How does false-positive density affect remediation velocity?
The 73% auto-PR compile rate Griffin publishes is downstream of the false-positive story. If the system is hallucinating findings, it is also hallucinating patches — patches that fix problems that don't exist, or patches that modify code the team never intended to change. A pure-LLM auto-PR that is grounded in a false positive is worse than no PR at all; it wastes review bandwidth and teaches developers to close every AI-generated PR on sight. We measure compile rate specifically to prevent that failure mode. A patch that doesn't compile in our build sandbox never reaches a human reviewer. Combined with the false-positive suppression upstream, the net result is that Griffin AI's PRs are reviewable artifacts rather than noise.
What about false negatives — does the engine hide real vulnerabilities?
Legitimate question, and one we watch carefully. False negatives would appear if the engine's reachability analysis were overly conservative — marking a reachable path as unreachable. Griffin AI addresses this by using an over-approximating call graph: it errs on the side of including a path when in doubt. The price is a slightly larger candidate set for the LLM to reason over. The benefit is that the engine rarely hides a true vulnerability from the reasoning layer. In red-team evaluations, Griffin detected 94% of injected vulnerabilities in a controlled test corpus; the 6% miss rate was dominated by vulnerabilities introduced through runtime configuration we hadn't indexed, which is a fixable gap. Pure-LLM systems in the same test scored between 71% and 83%, with misses concentrated in deeply transitive dependencies where retrieval failed to surface the relevant chunks.
How does this affect audit readiness?
Auditors do not love long finding lists. They love traceability. A Griffin finding comes with a call-graph citation, a taint trace, a CVE reference, and a patch diff with build status — all artifacts the engine maintains. A pure-LLM finding comes with a paragraph of explanation. Both might describe the same vulnerability. Only one survives the "prove it" phase of a SOC 2 Type II or FedRAMP Moderate review. We have watched customers migrate off pure-LLM tooling specifically because the audit narrative was untenable: "the model said so" is not a control.
How Safeguard Helps
Safeguard's engine is the reason Griffin AI can publish a credible 81% hypothesis accuracy number; the hypotheses are filtered against reachability, dataflow, and version resolution before the LLM opens its mouth. The false-positive rate you experience from a security platform is a function of how much of the verification happens deterministically versus probabilistically. If your current tool is drowning you in noise, measure precision per developer-hour — that's the number that tells you whether an engine is underneath the model or not.