There's a growing wave of tools using large language models to find vulnerabilities in source code. The pitch is compelling: point a powerful LLM at a codebase, ask it to find security issues, and get back a list of vulnerabilities. Some projects have claimed to find hundreds or thousands of vulnerabilities in open-source packages using this approach.
The results sound impressive until you look closely. When you dig into the actual findings, a pattern emerges: high false positive rates, shallow analysis, missed context, and vulnerability reports that wouldn't survive a human reviewer's scrutiny. We've spent months building and testing AI-powered vulnerability discovery at Safeguard, and we've learned — the hard way — that the model alone is not enough.
This post explains why, and what's actually needed to make AI vulnerability discovery work in production.
The Single-Model Approach: What It Does
The typical approach works like this:
- Take a source code file (or a package's entire source)
- Feed it into an LLM with a prompt like: "Analyze this code for security vulnerabilities. Report each vulnerability with its CWE, severity, and location."
- Parse the model's output into structured findings
- Repeat for every file in every package
Some implementations add refinements — system prompts with CWE definitions, few-shot examples of known vulnerabilities, or post-processing to filter obvious noise. But the core architecture is: one model, one pass, per file.
This approach does find real vulnerabilities. LLMs are genuinely good at recognizing common vulnerability patterns — they've been trained on millions of code examples, security advisories, and CWE descriptions. If there's an obvious SQL injection or a textbook path traversal, the model will probably flag it.
The problem is everything that isn't obvious.
Where Single-Model Scanning Breaks Down
1. Context Window Limits Kill Cross-File Analysis
Real vulnerabilities often span multiple files. A tainted input enters through an HTTP handler in routes.js, passes through a middleware in auth.js, gets processed in utils.js, and finally reaches a dangerous sink in db.js. No individual file contains the full vulnerability — you need to trace the data flow across the entire call chain.
A single-model approach that analyzes files independently will miss this entirely. It might flag the eval() call in db.js as suspicious, but without knowing that user input actually reaches it, the finding is speculative. And if it flags every eval() as a vulnerability regardless of context, the false positive rate becomes unusable.
Even models with large context windows (128K+ tokens) struggle here. A medium-sized package might have 50-100 source files. Concatenating them all exceeds context limits, and even if it doesn't, the model's attention degrades significantly with context length. Important details in file 47 get lost when the model is processing file 48.
2. No Structured Reasoning About Exploitability
Finding suspicious code patterns is step one. Determining whether they're actually exploitable is step two — and it's dramatically harder.
Consider this Python code:
def process_template(template_str, context):
return eval(template_str, {"__builtins__": {}}, context)
A pattern-matcher (including a naive LLM pass) flags eval() immediately. But is it actually exploitable? The {"__builtins__": {}} argument restricts the eval sandbox significantly. Is the restriction sufficient? That depends on the Python version, the contents of context, and whether there are known sandbox escapes for the specific restricted builtins configuration.
A single-model approach either:
- Flags it unconditionally (false positive if the sandbox is actually effective)
- Misses it entirely (false negative if the model is trained to recognize the sandbox pattern as "safe")
- Provides a vague assessment that doesn't help the developer decide what to do
What's needed is structured exploitability analysis — a dedicated reasoning step that considers the specific mitigations in place, known bypass techniques, and the actual attack surface. This is a different cognitive task than pattern recognition, and it benefits from a different agent with different expertise.
3. Severity Estimation Is Unreliable
When a single model assigns severity (Critical, High, Medium, Low), it's essentially guessing based on the vulnerability type. SQL injection? Must be Critical. XSS? Probably High. Missing input validation? Medium.
But severity depends on context:
- An XSS in an admin-only dashboard with CSP headers might be Low
- An XSS in a public-facing login page without CSP is Critical
- A SQL injection behind an authentication barrier with WAF protection is different from one in a public API endpoint
Single-model approaches don't have the architectural capacity to reason about deployment context, network position, or defense-in-depth. They assign severity based on the vulnerability class alone, which often doesn't match reality.
4. Deduplication and CVE Matching Are Afterthoughts
A significant portion of "vulnerabilities found by AI" are actually known vulnerabilities that already have CVE identifiers. Finding them isn't zero-day discovery — it's just a slower, less reliable version of database lookup.
Proper deduplication requires:
- Cross-referencing against NVD, OSV, GitHub Advisory, and other databases
- Understanding that the same root cause can manifest in different files and functions
- Recognizing when a "new" finding is actually a duplicate of another finding from the same analysis run
- Identifying when an upstream dependency's vulnerability is being attributed to the downstream package
This is a distinct capability that requires database access, structured matching, and domain-specific logic — not just model inference.
5. Language-Specific and Framework-Specific Blind Spots
LLMs trained on general code have broad but shallow knowledge of security patterns across languages. They know the textbook examples but miss the language-specific edge cases that real attackers exploit.
Examples:
- JavaScript prototype pollution — The model might flag
Object.assign()but miss thatlodash.merge()with a specific version has a known deep-merge vulnerability that allows__proto__injection - Python pickle deserialization — The model flags
pickle.loads()but misses thatyaml.load()withoutLoader=SafeLoaderis equally dangerous - Java deserialization — The model might miss that
ObjectInputStreamused with Apache Commons Collections on the classpath creates a gadget chain even when the deserialized class itself looks safe - Go's
html/templatevstext/template— Using the wrong template package in a web handler introduces XSS, but the code looks identical
A single generalist model handles these inconsistently. Some it knows from training data; others it misses completely. There's no systematic coverage of language-specific vulnerability patterns.
6. Hallucinated Vulnerabilities
This is the most dangerous failure mode. LLMs can and do fabricate vulnerabilities that don't exist — complete with convincing-sounding CWE classifications, line numbers, and explanations. They generate plausible-looking security findings for code that's actually secure.
When a security team triages 50 findings from an AI scan and discovers that 20 of them are hallucinations, trust erodes rapidly. After a few rounds of this, the tool gets ignored — and the real vulnerabilities it did find get ignored with it.
The hallucination problem is inherent to single-pass generation. The model is producing text that looks like a vulnerability report, and it's optimized for plausibility, not accuracy. Without verification steps, the output is unreliable.
What Actually Works: Multi-Agent Orchestration
At Safeguard, we've found that reliable AI vulnerability discovery requires a fundamentally different architecture than "ask the model." Safeguard's Multi-Agent TAOR Deep Think AI Engine — built on Tool-Augmented Orchestrated Reasoning — uses multiple specialized agents that reason deeply and check each other's work:
Separation of Concerns
Instead of one model doing everything, we separate the task into distinct phases:
- Pattern detection — Find suspicious code patterns (fast, broad, allowed to over-report)
- Data flow tracing — Verify that untrusted input actually reaches the suspicious pattern (eliminates most false positives)
- Exploitability assessment — Determine if the verified data flow is actually exploitable given the defenses in place
- Confidence scoring and deduplication — Cross-reference against known CVEs and score novelty
- Advisory generation — Produce structured, actionable findings only for confirmed issues
Each phase is handled by a different agent (or group of agents) with different specialization, different prompts, and different evaluation criteria. A finding must survive all phases to become an advisory.
CWE-Specialized Sub-Agents
Instead of relying on a generalist model's understanding of all vulnerability types, we deploy sub-agents that are specialized by CWE class. A sub-agent focused on CWE-502 (deserialization) carries deep knowledge about:
- Every known deserialization vulnerability pattern across Java, Python, Ruby, PHP, and .NET
- Framework-specific serialization libraries and their known-unsafe configurations
- Gadget chain patterns and classpath requirements
- The distinction between safe and unsafe deserializers in each ecosystem
This specialization dramatically reduces both false positives (the agent knows what safe deserialization looks like) and false negatives (the agent knows obscure vulnerability variants that a generalist would miss).
Verification Through Redundancy
Critical findings get analyzed by multiple agents independently. If the Code Analysis agent flags a potential SQL injection but the Data Flow agent can't trace untrusted input to the query, the finding gets downgraded. If the Exploitability agent determines that the application's ORM layer parameterizes the query before it reaches the flagged code, the finding gets closed.
This multi-perspective verification is the key difference between "AI found something suspicious" and "AI confirmed a vulnerability with a traced data flow and exploitation path."
The False Positive Problem Is the Real Problem
A vulnerability scanner that reports 1,000 findings is useless if 800 of them are false positives. Security teams are already overwhelmed with alert fatigue from existing tools. Adding another source of noisy, unreliable findings doesn't improve security — it degrades it by wasting the team's most constrained resource: human attention.
The metric that matters isn't how many vulnerabilities did you find — it's what percentage of your findings are actionable. A tool that reports 50 real vulnerabilities with a 90% true positive rate is vastly more valuable than one that reports 500 findings where only 100 are real.
This is why we invest so heavily in the verification pipeline. Every finding that TAOR produces has:
- A confirmed data flow from source to sink
- An exploitability assessment with attack complexity rating
- A confidence score based on cross-referencing against known vulnerability databases
- A generated remediation with before/after code patches
When a Safeguard customer receives an SGZ advisory, they can act on it immediately — not spend hours triaging to figure out if it's real.
The Supply Chain Dimension
Individual vulnerability scanning is only part of the picture. In a software supply chain context, you need to understand:
- Transitive exposure — If package A depends on package B which has a zero-day, package A's users are exposed even though A's own code is clean
- Version-specific analysis — A vulnerability might exist in version 2.1.0 but be fixed in 2.1.1. The analysis must be version-aware
- Dependency tree impact — When a zero-day is found in a widely-used utility package, the blast radius includes every package in its dependency tree
- Remediation path — Is there a fixed version? Is there a drop-in replacement? Will upgrading break compatibility?
Single-model approaches don't address any of this. They analyze code in isolation and report findings in isolation. A proper supply chain security platform integrates vulnerability discovery with dependency intelligence, impact analysis, and automated remediation.
What We're Building Toward
The current state of AI vulnerability discovery is early. The approaches that treat it as a simple "prompt the model, parse the output" problem are producing results that look impressive in demos but don't hold up under scrutiny. The approaches that invest in structured reasoning, multi-agent verification, and deep CWE specialization are producing results that security teams can actually trust and act on.
At Safeguard, our Zero-Day Discovery engine — powered by Safeguard's Multi-Agent TAOR Deep Think AI Engine — is live and finding real vulnerabilities in production open-source packages. Not hundreds of noisy maybe-vulnerabilities — confirmed findings with traced data flows, exploitation assessments, and generated remediations. Available to Enterprise customers with responsible disclosure to upstream maintainers.
The model is a component. The multi-agent deep think architecture is what makes it work.
Safeguard's Zero-Day Discovery is available for Enterprise customers. To see how Safeguard's Multi-Agent TAOR Deep Think AI Engine compares to traditional scanning for your dependency tree, request a demo.