AI Security

False Positive Cost: Griffin AI vs Mythos

A false positive is not free. It costs engineer attention, trust in the tool, and eventually the security programme's credibility. We price the difference.

Shadab Khan
Security Engineer
7 min read

A false positive is a finding the scanner reports as exploitable when it is not. Vendors treat false positives as a minor line in the data sheet. Customers pay for them in engineer attention, eroding tool trust, and eventually the credibility of the security programme. An engine-plus-LLM architecture like Griffin AI produces a materially lower false positive rate than Mythos-class pure-LLM tools, and this post explains both why and what the cost difference looks like once it is priced correctly.

The three costs of a false positive

A false positive costs an organisation three distinct things, and all three have to be counted to get the full picture.

The first cost is engineer time. Every false positive has to be triaged, and a triage that concludes the finding is not real still consumed the same ten or twenty minutes that a true positive would have consumed. At a few hundred findings a week, a false positive rate of twenty percent costs a four-engineer AppSec team roughly a full day of triage work per week on findings that had no real content. Over a quarter, that is close to two engineer-weeks spent on verifying that the tool was wrong.

The second cost is trust. Engineers who work through a queue where one in five findings is a false positive eventually stop treating each finding as a priority. The queue becomes a grind rather than a signal, and the attention to the real findings drops in proportion. An unsurprising consequence is that the true positives start getting triaged slowly too, because the engineer has learned to distrust the first read. The false positive rate has undermined the true positive response.

The third cost is credibility. When the AppSec team escalates a finding to a development lead and the lead discovers it is a false positive, the next escalation carries less weight. After enough of these cycles, development teams start pushing back on security findings by default, because experience has taught them that the default assumption should be scepticism. The security programme now has to spend political capital to get attention on findings that should have had attention immediately.

Why Griffin AI has a lower false positive rate

The Griffin AI engine applies several filters before a finding enters the queue. Reachability analysis is the first: if the vulnerable function is not reachable from any entry point in the actual call graph, the finding is suppressed with a VEX not_affected statement. This alone removes a large block of findings that pure-LLM tools would flag, because list-based scanning against a lockfile has no concept of what the application actually calls.

The second filter is VEX history. If a finding has been triaged and rejected in the same project or in a peer project within the tenant, the engine propagates the prior verdict forward. The assumption is that a finding rejected once for a specific, recorded reason should not come back to the queue unless the evidence changes. Griffin AI records the reason explicitly, so if the evidence does change the engine can open the finding again with a specific pointer to what shifted.

The third filter is policy evaluation. Organisations have policies that describe what they care about, and most of those policies have explicit scope: dev dependencies below severity medium do not need triage, for example, or vulnerabilities in vendored third-party documentation directories are out of scope. The engine evaluates these policies before the finding reaches the queue, and findings that fall outside the policy envelope are closed automatically with a policy reference.

Pure-LLM tools implement versions of these filters, but the implementations go through the model. The model sometimes flags a finding as reachable when the actual call graph disagrees, because the model is reasoning from textual evidence rather than structural evidence. It sometimes ignores the VEX history because the prior context is not consistently fed in. It sometimes misinterprets a policy because the policy is expressed in natural language and the model reads it slightly differently from the engineer who wrote it. Each of those failures is a false positive that the model-based filter should have caught.

What the false positive rate looks like on real workloads

We have benchmarked Griffin AI and two Mythos-class pure-LLM tools on the same set of repositories with the same CVE set. Griffin AI's false positive rate on the benchmark, defined as findings that a human triager ultimately marked not-affected, was around five percent. The pure-LLM tools sat between twenty and thirty percent, with the worst cases concentrated in projects that use heavy reflection, frameworks with dynamic routing, or vendored code that the model struggled to differentiate from first-party code.

Five percent versus twenty-five percent is a five-fold difference in the false positive tax. Over a year, on the hundred-developer organisation scenario that other posts in this series have used, that difference is the equivalent of an additional full-time triage engineer, purely absorbed into finding verification work that the tool was supposed to have done correctly in the first place.

The hidden second-order effect

The more insidious cost of a high false positive rate is that it pushes organisations into suppressing entire classes of findings to make the queue manageable. A team that cannot triage twenty findings a day might simply suppress all medium-severity findings on non-production services, or silence the scanner on vendored directories entirely. The suppressions start as practical workarounds and gradually become permanent holes in the coverage.

Griffin AI's lower false positive rate means the team can keep the coverage envelope wide. Medium-severity findings on non-production services stay in scope because they are rare enough to handle, and vendored directories can be scanned normally because the engine correctly distinguishes them from first-party code. The coverage is wider, the signal is higher, and the programme ends up with better insight than it would on a tool whose false positive rate forced it to narrow the scope defensively.

Pricing the difference

If you assign a conservative thirty minutes of engineer time to each false positive triage, and the organisation produces three hundred findings per week, a five percent false positive rate is roughly seven hours of engineering per week. A twenty-five percent rate is nearly forty hours, which is a full-time engineer. Multiply by a loaded engineering cost, and the gap between Griffin AI and a Mythos-class tool is a six-figure annual delta just in false positive triage.

Add the trust erosion and credibility cost, and the number grows. Both are harder to price directly, but they show up as slower response times on true positives and as friction in every conversation between security and engineering. The accumulated friction over a year is usually worth more than the subscription cost of either tool.

The architectural point is simple. Structural evidence filters false positives better than model reasoning does, because structure is verifiable and reasoning is probabilistic. An engine that produces a verified call graph is not a nice-to-have. It is the mechanism that keeps the false positive rate low enough that the tool remains trusted. Pure-LLM tools are stuck at a higher false positive floor because they have no way to verify their own reasoning, and the cost of that floor is paid every week by the engineers working the queue.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.