Strategy

Cost-Per-Verified-Finding: How Agentic AI Breaks Vulnerability Triage

Agentic AI can generate findings faster than any team can read them. The metric that survives that flood isn't cost-per-finding, it's cost-per-verified-finding. Here's why verification is now the bottleneck.

For two decades, vulnerability programs optimized for the wrong thing: coverage. Find more. Scan deeper. Add another tool. The implicit metric was cost-per-finding, and on that axis we have won decisively. In 2025, 48,185 CVEs were published, a 20.6 percent jump over 2024's 39,962, according to NVD data summarized across multiple year-end reviews. That is roughly 131 new CVEs every single day, and it does not count the per-organization findings your scanners generate on top of the public catalog.

Now agentic AI has arrived, and it changes the supply side completely. A general-purpose model with a security harness, or a purpose-built one like Anthropic's Mythos or OpenAI's Daybreak, can read code and emit plausible findings at a rate no human team can keep pace with. The marginal cost of generating a finding is collapsing toward zero. When supply is effectively infinite, the price of the thing you are producing is no longer the interesting number. The interesting number is what it costs to know which findings are real.

That is the metric I want to argue for here: cost-per-verified-finding. Not how cheaply you can produce a finding, but how cheaply you can produce a finding that a human can act on with confidence. On that axis, most AI-assisted programs are getting worse, not better, even as their dashboards get more impressive.

The Economics Just Inverted

Here is the uncomfortable arithmetic. Only a small minority of vulnerabilities are ever exploited in the wild. Research summarized by EPSS practitioners puts the share of CVEs that see real exploitation at roughly 2 to 7 percent, and FIRST's own analysis found that only about 2.3 percent of CVEs scored CVSS 7 or higher were ever observed in an exploitation attempt. The implication is brutal: if you remediate everything rated CVSS 7 and above, you do catch the large majority of what actually gets exploited, but FIRST estimates that something like 96 percent of that effort is spent on vulnerabilities that would never have been touched. Prioritizing by EPSS probability instead catches a smaller but still substantial share of real exploitation for a fraction of the work.

Read that again, because it is the whole argument. Under the old cost-per-finding regime, the waste was tolerable because finding things was expensive, so you naturally produced fewer of them. Agentic AI removes that natural governor. It produces findings at machine speed, and the verification step, the part that decides whether a finding is real, reachable, and worth a human's attention, stays stubbornly human-speed. So the ratio of noise to signal does not improve. It explodes.

Verification Is The Bottleneck, And It Always Was

When a competent security engineer finds a suspicious pattern, the report is not the work. The work is everything after: can tainted input actually reach the sink, is there sanitization upstream, does the framework neutralize this at a higher layer, is the code path even reachable in this deployment, is there already a CVE for it. Producing the suspicion is cheap. Discharging it is expensive.

Agentic systems are extraordinarily good at the cheap part and structurally bad at the expensive part, because verification requires things a single inference pass does not have: cross-file data-flow analysis, knowledge of your runtime topology, and the discipline to say "I cannot confirm this." Models, even fine-tuned ones, still hallucinate. In vulnerability work the hallucination is especially dangerous because it arrives well-dressed: a real line number, a plausible CWE, a confident exploitation narrative for code that happens to be safe. These pass a five-second review, which is exactly how they waste an hour of investigation each.

This is why I keep returning to a single number. If your AI tool doubles the findings it produces but you still verify them one engineer at a time, your cost-per-verified-finding went up, not down. You bought more haystack.

Severity Without Deployment Context Is A Guess

The second failure mode is severity. A model assigns "Critical" based on the vulnerability class, not on whether anyone can reach the code. But severity is a property of deployment, not of source. A SQL injection in an internal CLI that requires local access is not the same risk as the identical pattern on an unauthenticated public endpoint, even though both will render as red on a dashboard.

The industry already learned this lesson the hard way with CVSS, and the data is unambiguous. Sonatype's 2026 Software Supply Chain Report found that roughly 65 percent of open-source CVEs lack an NVD-assigned CVSS score at all, and when the report's authors scored those orphans themselves, about 46 percent turned out to be High or Critical. Severity is not a label you can trust off the shelf, and an AI that infers it from code alone is reproducing the same context-free mistake at higher volume.

The situation got materially harder in 2026. On April 15, 2026, NIST moved the NVD to a risk-based enrichment model, leaving a large backlog of CVEs reclassified as "Not Scheduled." Industry analysis, including a Cloud Security Alliance research note, warns that a substantial share of incoming CVEs may now ship without the CPE identifiers, CVSS scores, and CWE classifications that scanners have quietly depended on for years. The free severity signal we all leaned on is thinning out at precisely the moment AI is multiplying the raw finding count. Context you used to get for free, you now have to compute yourself.

False-Positive Economics, Made Concrete

Let me make the cost real without inventing numbers. Suppose an agentic scanner returns a thousand findings against your dependency tree. You do not know your false-positive rate in advance, but you know two things from the data above: the overwhelming majority of findings will never be exploited, and a meaningful fraction will be hallucinated or context-irrelevant. Every one of them still costs verification time before it can be dismissed or actioned.

The leverage point is obvious once you frame it this way. Adding another finding-generator to this pipeline makes the dashboard worse. The only thing that improves cost-per-verified-finding is moving verification earlier and making it cheaper, so that humans spend their scarce attention only on findings that already survived machine scrutiny. That is an architecture problem, not a model problem. No single model, however well fine-tuned, verifies its own output, because the same blind spot that produced a wrong finding will rate it confidently.

How To Actually Measure It

If you want to adopt this metric, instrument three things. First, track verified findings as a distinct stage in your workflow, separate from raw findings, so you can watch the ratio. Second, log triage time per finding by source, so you can see which tools generate cheap-to-verify findings and which generate expensive noise. Third, weight by exploitability using a probabilistic signal such as EPSS alongside, not instead of, your own reachability and deployment context. A finding that is real, reachable, and likely to be exploited is worth a hundred that are merely plausible.

Done honestly, this exercise reorders your tooling budget. Tools that produce a torrent of unverified findings start to look like liabilities, because they consume the resource that actually matters: engineer attention. Tools that produce fewer, pre-verified, context-aware findings start to look like the bargain they are.

How Safeguard Helps

Safeguard is built around exactly this metric. Our Multi-Agent TAOR Deep Think AI Engine treats finding generation and finding verification as separate jobs, running specialized agents that cross-check, trace data flow, and reason about exploitability before anything reaches a human. We are model-agnostic by design, so engines like Mythos or Daybreak plug in as components while the reliability lives in the verification and orchestration layer above them. That layer is exactly what benchmarks like CyberGym, where even the strongest agent combinations clear only about a fifth of real-world vulnerabilities, show to be the hard part. Through our AIBOM, vendor scorecards, and policy gates, per-package findings get translated into your actual deployment context, so severity reflects reachability rather than the worst-case CWE label. The result is a lower cost-per-verified-finding, which is the only vulnerability metric that still means something in the agentic era. If you want to see it run against your own dependency tree, reach out.