Vendors love to compare list price. Engineering teams love to compare token spend per scan. Both are the wrong unit. The unit that matters for the security program's actual budget is cost per actionable finding — the dollar amount, in token spend plus engineering time, required to produce one finding that engineering will actually fix. By that measure, Griffin AI and Mythos-class general-purpose AI-for-security tools have very different economics, and understanding the structural reason matters before signing any multi-year contract.
Why cost-per-finding is the right metric
Three reasons:
- It captures both inputs and outputs. Token spend matters; so does engineering review time. A platform with low token spend that produces unactionable findings is more expensive than the invoice suggests.
- It captures false positives correctly. A finding that gets investigated, determined to be a false positive, and closed costs the same as a finding that gets fixed — but produces no security value. Cost per actionable finding penalises false positives in the right way.
- It maps to the security program's outcomes. The program is judged by vulnerabilities prevented and incidents avoided, not by scan volumes. The metric should match the goal.
A serious procurement evaluation produces a cost-per-actionable-finding number for each vendor. The number varies by 3–5x across vendors at the same surface scan volume.
Where engine-plus-LLM economics win
Three structural reasons:
Engine eliminates obvious noise before the model is called. Reachability analysis, version-aware resolution, sanitizer detection, and call-graph context all run deterministically. The model is invoked only on candidates that passed these filters. Token spend per analysis decision is dramatically lower because the model is asked far fewer questions.
Model tiering matches model capability to task. Reasoning-heavy work (exploit hypothesis on a complex taint path) routes to a high-capability model. Routine work (advisory summarisation, finding deduplication) routes to a smaller model. The cost-weighted model selection produces lower aggregate spend.
Eval gating prevents expensive regressions. When a model upgrade increases spend without improving outcomes, the eval harness catches it before customers absorb the cost. Pure-LLM tools surface this regression to customers as opaque cost increase.
A concrete cost model
Take a 200-engineer org with 30 services and 1,200 dependency vulnerabilities surfaced by raw SCA in a quarter.
With pure-LLM analysis (Mythos-class):
- Every finding is sent to the model for assessment: 1,200 model calls × ~$0.30 per call ≈ $360 in token spend.
- Engineering reviews each finding: 1,200 findings × ~15 minutes per review = 300 engineer-hours = ~$45,000 at fully-loaded cost.
- ~25% of findings are actionable after review. Cost per actionable finding ≈ $151.
With engine-plus-LLM analysis (Griffin AI):
- Engine filters to ~150 reachable findings before the model is called.
- 150 model calls × ~$0.30 per call ≈ $45 in token spend.
- Engineering reviews 150 findings × ~10 minutes per review (lower because findings are pre-grounded with reachability evidence) = 25 engineer-hours = ~$3,750.
- ~85% of findings are actionable after review. Cost per actionable finding ≈ $30.
The 5x difference is structural, not a deal-quality difference. It comes from the architecture choice.
Where Mythos-class economics can compete
Two specific cases:
Greenfield codebases with limited dependency exposure. When the dependency tree is small and clean, the engine doesn't have many findings to filter. Pure-LLM analysis on a small graph is cost-comparable.
Discovery-mode workflows where false positives are acceptable. Bug-bounty triage, threat-modelling, advisory drafting — workloads where the consumer will read every finding regardless. False-positive cost is lower because review is part of the task.
For routine production security workflows at enterprise scale, engine-plus-LLM dominates pure-LLM economics by 3–5x in our customer benchmarks.
Hidden costs in the cost-per-finding model
Three that surface in long-running deployments:
- Token-spend variability during incidents. Pure-LLM tools incur peak token spend exactly when budget visibility is lowest — during incident response. Engine-plus-LLM tools spike token spend more modestly because the engine handles the bulk of the analysis volume.
- Engineering trust erosion. Tools with high false-positive rates lose engineering trust over time. Future findings get reviewed less carefully. The dollar cost of trust erosion is not on the invoice but is real.
- Scope-restriction pressure. Teams faced with high cost-per-finding limit the scope of analysis to control budget. The unmonitored scope becomes the attack surface.
Each of these compounds the difference over a multi-year deployment.
What to evaluate
Three checks during procurement:
- Ask the vendor for cost-per-actionable-finding numbers from existing customers, not cost-per-scan.
- Run a 30-day pilot and measure actual review time per finding, not just finding counts.
- Project three-year cost under your actual scale, not the demo's scale.
The numbers diverge over three years more than over three months.
How Safeguard Helps
Safeguard's engine-plus-LLM architecture is the structural reason cost-per-actionable-finding economics work. The engine eliminates reachability-irrelevant noise before any model call. Griffin AI runs at gated, high-leverage points — not on every finding. Model tiering routes routine work to smaller models. The eval harness catches cost regressions before they reach customers. For organisations whose AI-for-security spend is being scrutinised by finance, the architecture choice is upstream of the budget conversation.