When we first ran a frontier LLM across a thousand lines of a payment-processing service, it produced fourteen findings. Two were real. Three were confabulated. The rest were style opinions dressed up as security issues. We spent six hours triaging the report and ended the day with slightly less confidence in LLM-augmented review than we had started with.
A year later, the picture is different. Not because the models got dramatically better — they did improve — but because we stopped trying to use them as replacement bug-finders and started treating them as a particular kind of augmentation tool with particular strengths and a long list of weaknesses. This post is that methodology.
What LLMs Are Actually Good At in Bug Discovery
Before committing to a workflow, it is worth being explicit about where LLMs add value and where they do not. Our empirical sense after a year of daily use:
They are good at recall breadth. Given a codebase and a prompt describing a class of bug, a capable model will point to dozens of candidate locations where that bug class could appear. Many will be false positives, but the recall is high and the review is tractable if the prompt is narrow enough.
They are good at cross-referencing patterns. "Show me every place that calls parseXML without configuring XXE protections" is a task that combines code search with domain knowledge, and modern LLMs handle it competently.
They are good at hypothesis generation for unfamiliar code. When landing in a codebase you have never seen, an LLM's summary of "here is what this module does and here are the places I would expect bugs to hide" is a reasonable starting point for directed manual review.
They are not reliable at precision on their own. Without grounding, they will invent vulnerabilities in correct code, particularly when the prompt primes them to find something.
They are not reliable at novel bug classes. If a bug class is not well-represented in training data, the model will not surface it. Novel research is still human work.
They are not reliable at confirming absence. "I reviewed this code and did not find the bug" from an LLM means almost nothing. Absence of evidence is not, in this case, evidence of absence.
The Methodology: Narrow Prompts, Grounded Outputs, Human Validation
The methodology we converged on has three pillars. Each one addresses a specific failure mode we hit early.
Narrow prompts. A prompt like "find security bugs in this code" produces noise. A prompt like "find places in this code where user input reaches a database query without going through the parameterised query API" produces tractable output. Narrow the bug class, narrow the input surface, and narrow the output format before you ask.
Grounded outputs. Every finding the model produces must include a specific file, specific line range, and an explanation that references code it has actually seen. Findings that wave at "there is probably an issue with authentication somewhere" are rejected without triage. This discipline alone cut our false positive rate by more than half.
Human validation on every finding. No LLM-generated finding enters a bug tracker without a human engineer confirming it. Not because we want to slow down the pipeline, but because the model's confidence is uncorrelated with correctness often enough to make unsupervised workflows dangerous.
Stage One: Corpus Preparation
Before asking the model anything, we prepare the corpus. For a codebase of non-trivial size, "here is the whole repo" does not work — context windows run out, the model loses focus, and reviews become shallow. Our preparation steps:
Identify the trust boundaries in the service. For an HTTP API, that is the set of handlers that receive external input. For a message consumer, it is the deserialisation paths. The review focuses on these boundaries first because they are the shortest path from attacker-controlled input to interesting code.
Extract the files and functions on the boundary and their immediate callees, to a depth of two or three function calls. This gives the model a focused context where the signal-to-noise ratio is high.
Annotate the corpus with metadata the model will need: framework idioms, known-safe helper functions, team conventions. A model that does not know your custom sanitiser exists will flag every call to it as a potential issue.
Stage Two: Hypothesis Generation
With a prepared corpus, we run a structured hypothesis generation pass. The prompt template: "Here is a module that handles [input type] from [source]. I am concerned about [specific bug class]. For each location where this bug class could occur, list: the file and lines, why you think this location is at risk, and what would need to be true for the bug to be exploitable."
The "what would need to be true" clause is what we have found to differentiate useful findings from bulk noise. A model that articulates preconditions is thinking; a model that produces a flat list of "suspicious locations" is pattern-matching. The former is worth triaging; the latter usually is not.
We run this pass for one bug class at a time. Broad prompts ("find all security bugs") produce shallow analyses. Narrow prompts ("find XXE vulnerabilities," "find path traversal," "find authorisation bypasses in admin endpoints") produce deeper, more specific output.
Stage Three: Corpus Search and Pattern Mining
LLMs are excellent at saying "here are the places that match a description." We use this for corpus-wide pattern searches that would be tedious manually and fragile in static analysis tools:
"List every HTTP handler that accepts a redirect_url query parameter and describe how that parameter is used." — catches open-redirect and SSRF candidates.
"List every call to subprocess.run or os.system and describe the provenance of each argument." — catches command injection candidates.
"List every database query constructed by string concatenation." — catches SQL injection candidates and is a useful early warning for team discipline drift.
These searches produce candidate lists. The candidate lists go to human reviewers who check the actual behaviour. The LLM is a high-recall, low-precision scout; humans apply the precision.
Stage Four: Validation and Exploit Development
Once a candidate finding has a hypothesis — "this handler is vulnerable to X because Y" — we validate it. Validation is where LLMs are least reliable and humans are most necessary. The common failure modes we have seen:
Models assert that a bug exists based on code that looks similar to vulnerable code but behaves differently. "This looks like it uses eval on user input" can be a legitimate finding or can be a call to a function named eval_expression that internally uses an AST walker. The model does not always distinguish.
Models miss sanitisation paths that exist further up the call chain. A handler that passes user input to a dangerous function looks bad in isolation, but if a middleware upstream already validated the input, the finding is false. Models with limited context lose these paths.
Models sometimes generate proof-of-concept exploits that look plausible and do not actually work. We do not trust LLM-generated PoCs without running them end-to-end in an isolated environment. A surprising number of "confirmed exploits" fail when actually executed.
The validation stage is therefore always hands-on keyboard. Write the exploit, run it against a representative environment, observe the outcome, document the result.
Stage Five: Reporting and Knowledge Capture
Confirmed findings are reported the same way any other finding is reported: with full reproduction steps, impact assessment, and remediation guidance. We do include, in an internal note, that the finding was surfaced by an LLM-assisted review, because tracking hit rate over time tells us whether the methodology is still paying off.
The knowledge capture piece is worth calling out separately. Each confirmed finding becomes an input to the next review cycle. If the model missed something similar elsewhere in the codebase, we flag it in the corpus preparation for the next service. If the model over-reported on a particular pattern, we adjust the prompt for future runs.
Common Failure Modes to Avoid
A short list of anti-patterns we have learned to avoid:
Chain-of-trust hallucination — the model asserts that a function is safe because it is "well-known to be safe," without actually checking. Always ground assertions in code, not training-data reputation.
Confirmation bias amplification — the model is told "we think this service has SQL injection" and dutifully finds some, even where none exist. Keep prompts neutral about what you expect to find.
Tool invocation theatre — a workflow that has the model call static analysis tools and summarise the output gives an impression of rigour without adding much. The underlying tools are doing the work; the model is a formatter. This is sometimes useful, but do not confuse it for LLM-driven analysis.
Overreliance on summary outputs — "the model said this file is fine" is not a review. The methodology requires the model to point to specific lines and specific reasoning.
How Safeguard Helps
Safeguard integrates LLM-augmented analysis into its vulnerability discovery pipeline with the discipline this methodology requires: narrow prompts mapped to specific bug classes, grounded outputs tied to exact lines, and human validation gates before any finding becomes an actionable alert. The platform maintains a corpus-aware memory of each codebase — trust boundaries, custom sanitisers, team conventions — so models are not flagging safe helpers as risky or missing sanitisation paths outside their immediate context. Validation runs in ephemeral execution environments so proof-of-concept exploits can be confirmed end-to-end before reaching developers, eliminating the confabulated-PoC problem that plagues naive LLM review. The outcome is a bug discovery workflow where LLMs add breadth without compromising precision, and where every reported finding is one a human engineer has confirmed.