AI Security

Zero-Day Discovery Economics: Cost Per Find

The economics of zero-day discovery have been opaque for too long. Here is the actual cost structure of finding a real, defensible bug, and how to think about it.

Nayan Dey
Senior Security Engineer
7 min read

The vendor pitch decks I have seen for AI bug hunters in the last two years all converge on the same statistic. They report findings per hour, or findings per million lines of code, or findings per dollar of model spend. The number is always large and the slide is always confident. The number is also always wrong, in the sense that it is measuring the wrong thing. A finding that does not survive triage is not a finding. It is a notification. Counting notifications and dividing by cost gets you a metric that is easy to compute and useless to budget against.

The metric that actually matters in a security programme is cost per defensible finding. A defensible finding is one that is grounded in a real reachable taint path, has survived a disproof pass, and is accepted by the triage team as a real vulnerability they will act on. The denominator in that ratio is the only thing that affects the security posture of the organisation. Everything in the numerator that does not contribute to a defensible finding is cost without value.

This piece is an attempt to write down what the cost per defensible finding actually looks like in 2026, broken down by architecture and by the operational structure around it.

The pieces of the cost stack

A discovery pipeline's cost stack has roughly five components.

The first is platform cost: the licence or subscription for the discovery tool itself. This number ranges from low five figures for a small SaaS deployment to mid six figures for a large self-hosted enterprise install. It is the most visible cost and usually the smallest one.

The second is compute cost: the model inference, the static analysis runtime, and the storage for intermediate artefacts. Pure-LLM tools tend to have high model inference cost because every analysis is an LLM call. Engine-plus-LLM tools tend to have lower model inference cost because the engine does most of the work and the LLM is invoked over a finite candidate set.

The third is triage cost: the engineer-hours spent reading, verifying, and disposing of findings. This is the largest cost in most programmes and the one that most decks omit. A back-of-envelope calculation: 25 minutes per finding at a fully loaded cost of $150 per engineer-hour is about $62 per finding triaged.

The fourth is remediation cost: the engineer-hours to fix accepted findings. This is paid only for defensible findings, not for hallucinated ones, so it scales with the precision of the pipeline. The cost per remediation varies enormously by bug class, from an hour for a clear-cut input validation fix to weeks for a subtle reachability bug that requires architectural changes.

The fifth is disclosure cost: the operational overhead of responsible disclosure to upstream maintainers when the bug is in a transitive dependency. This is small per finding but real, and it includes the embargo discipline costs.

What the math looks like for a pure-LLM tool

A pure-LLM bug hunter producing 500 findings per month at a 70 percent false positive rate produces 150 defensible findings, before accounting for the precision loss in triage. The 350 false positives consume triage time. At 25 minutes each, that is 146 engineer-hours per month, or roughly $22,000 per month at $150 per engineer-hour. The 150 defensible findings consume their own triage time, another 62 hours or $9,400 per month.

Total triage spend: $31,400 per month for 150 defensible findings, or about $209 per defensible find in triage cost alone.

Add the platform cost (let us assume $5,000 per month for a typical SaaS) and the compute cost (let us assume $4,000 per month for the LLM inference), and the cost per defensible find lands around $269 in pre-remediation cost. The remediation costs are paid on top, and they scale with precision.

This calculation is generous to the pure-LLM tool. It assumes the triage team does not burn out, the false positive rate does not poison the team's confidence in the queue, and the missed findings (false negatives) do not produce incident response costs downstream. In practice all three of those assumptions break, and the realised cost per defensible find ends up considerably higher.

What the math looks like for an engine-plus-Griffin AI pipeline

The engine-plus-Griffin pipeline's economics are different in shape. The same pipeline producing 100 findings per month at an 8 percent false positive rate produces 92 defensible findings. The 8 false positives consume 3 hours of triage time. The 92 defensible findings consume 38 hours.

Total triage spend: about $6,150 per month for 92 defensible findings, or about $67 per defensible find in triage cost.

The platform cost is comparable to the pure-LLM tool. The compute cost is split differently: more spend on the static engine, less on the LLM, but the totals are within a factor of two of each other for similar throughput.

Cost per defensible find lands around $130 in pre-remediation cost.

The ratio is roughly 2x in favour of the engine-plus-Griffin pipeline on this calculation. In practice it is wider, because the precision improvements compound through the funnel: triagers stay engaged, the queue is read carefully, real findings do not get missed in noise, and the team's overall throughput on security work is higher.

What the gross-output metrics miss

A vendor that reports findings per dollar of model spend is implicitly arguing that the cost of a finding is dominated by the model spend. The numbers above show why this is wrong. The dominant cost is triage, and triage scales with the false positive rate, not with the model spend. A pipeline that doubles model spend to halve the FP rate is making the right trade.

What changes at scale

The economics shift further when you look at large codebases. A pure-LLM tool's findings per month scale roughly linearly with codebase size, which sounds appealing until you realise that the false positive count scales with it too. A 5x larger codebase produces 5x the noise, and the triage budget either scales with it or the queue collapses.

An engine-plus-LLM pipeline scales differently. The static engine prunes the candidate set to flows that have a chance of being reachable, so the candidate count grows sublinearly with codebase size. The LLM is invoked over the pruned set, the disproof pass is invoked over the survivors, and the finding count grows in line with the actual bug density in the code rather than in line with the volume. The triage burden does not collapse the team as the codebase grows.

This is the part that makes the architectural choice load-bearing for organisations with large monorepos. The cost-per-find advantage compounds with scale.

The hidden cost of false negatives

One number is missing from both calculations above: the cost of bugs the pipeline does not find. This is genuinely hard to estimate, because the universe of unfound bugs is unknowable in any specific organisation. The honest framing is that engine-plus-LLM pipelines have bounded recall, limited by what the engine can reach, while pure-LLM pipelines have unbounded recall in principle but in practice produce so many speculative findings that the real ones drown.

Across the deployments I have observed, the missed-bug rate for engine-plus-Griffin pipelines is similar to or better than the missed-bug rate for pure-LLM tools, because the pure-LLM tools' missed bugs are usually buried under the false positives where nobody looks. The "high recall" of pure-LLM tools is mostly theoretical.

How Safeguard Helps

Safeguard publishes its cost per defensible find by deployment, broken down by platform, compute, and realised triage. Customers get the actual ratio, not the gross-output metric. The engine-plus-Griffin AI pipeline operates at the precision regime described above (single-digit FP rates on supported bug classes) and the platform tracks queue health, triage throughput, and remediation cost so the cost ratio is observable, not asserted. Teams can compare Safeguard against incumbent tools using the metric that actually matters to their budget rather than the one the incumbent vendor prefers to report.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.