AI Security

Zero-Day Discovery With LLM-Augmented Reachability: A Safeguard Engine Walkthrough

Pattern-matching scanners miss zero-days by definition. An engine that follows taint across package boundaries plus a model that hypothesizes exploit conditions can find what either would miss alone. Here is how that pipeline works end to end.

Nayan Dey
Senior Security Engineer
8 min read

Pattern-matching scanners find CVEs that are already in the database. By construction, they cannot find a zero-day — if a pattern for it existed, it would be a one-day. Pure-LLM scanners are the opposite failure mode: they speculate freely, produce high recall and low precision, and the resulting findings rarely survive a human reviewer's first pass. The interesting question is not which approach is better. It is whether there is a combination that outperforms both, and the answer that has emerged over the last eighteen months is yes, with a specific structure: a deterministic engine that follows taint across package boundaries, plus an LLM that reasons about exploit conditions on the candidates the engine surfaces. Each piece does the job the other cannot. This post walks through how that pipeline works in Safeguard's engine and why the division of labor matters.

Why doesn't pattern scanning find zero-days?

Because the patterns are derived from vulnerabilities that have already been reported. A scanner's rule set — whether it is a Semgrep pack, a custom YARA ruleset, or a commercial SAST engine's tuned checks — is a library of known-bad shapes. When a new vulnerability is disclosed, rules get written for it. That is the disclosure-to-detection loop, and it runs in one direction only. By the time a rule exists for shape X, someone has already published that X exists.

Pattern scanners find the long tail of unpatched instances of known vulnerabilities, which is genuinely valuable work. They do not find new vulnerability shapes. They cannot, because the rule does not exist yet.

Why doesn't pure-LLM scanning find them either?

Because the model, left to its own devices, does not know which call sites are actually exploitable and which are not. A model asked to "find vulnerabilities in this code" will flag sinks (uses of eval, subprocess.run, raw SQL construction, deserialization) based on local context. Most of those flags are false positives. In the runs we have analyzed, FP rates on pure-LLM vulnerability discovery over open source packages have landed anywhere from 60% to 95%, depending on the prompt and the code. The work of confirming which flagged sinks are actually reachable from an untrusted source is exactly the work the model skips — and it is the work that separates a theoretical vulnerability from an actual one.

Put another way: the model generates hypotheses. Without a grounding layer, the hypotheses are too cheap to be useful.

What does an engine contribute that a model cannot?

Three things, all deterministic and all measurable:

Cross-file and cross-package call-graph construction. The engine parses source, resolves symbol references, and builds a directed graph of who calls whom. At Safeguard, this graph spans package boundaries — a call from your application into a transitive dependency's internal function is a real edge. Models do not do this reliably. They approximate it from context, which breaks past two or three indirection hops.

Taint propagation with source/sink classification. Given a set of source types (HTTP parameters, CLI arguments, file contents, DB rows) and sink types (code execution, path operations, deserialization, privileged system calls), the engine computes which sinks are reachable from which sources, along which paths. This is classical program analysis, mature for three decades, and still the best way to answer "is this actually exploitable from an untrusted input?"

Version-aware symbol resolution. Different versions of a dependency expose different functions, with different arguments and different sink behaviors. The engine resolves the exact versions in the lock file and analyzes against those, not against HEAD of the upstream repo.

These outputs are the structured context an LLM cannot construct from source code alone, because it does not have the symbol table, the lock file, or the patience to walk a 50,000-node graph.

What does the LLM contribute that an engine cannot?

Reasoning about exploit conditions in natural language. Specifically:

  • "Given this taint path, under what inputs would the sink actually execute?"
  • "What CWE class does this match?"
  • "Is there an existing mitigation in the call chain that neutralizes the taint?"
  • "What would a proof-of-concept payload look like?"

These are not questions a static analyzer answers well. They require pattern-matching against exploit literature, natural-language interpretation of comments and docstrings, and flexible hypothesis generation. They are exactly what current-generation LLMs are good at — when given structured context to reason over.

The moment the LLM is operating on a specific taint path (not "the whole codebase") and is asked "given this path, is there an exploit?" — with the path's source, sinks, intermediate transforms, and sanitizer calls laid out for it — its FP rate drops by an order of magnitude. It is no longer hallucinating about unseen code. It is reasoning over a structured input.

How does Safeguard's engine stage the two pieces?

The pipeline is four stages. Each stage has an acceptance test, and findings that fail the test do not reach the next stage.

Stage 1 — Package intake. Given a repository or an uploaded package, the engine resolves the full dependency tree, downloads source for all transitive dependencies, and builds a call graph that spans package boundaries. Output: a graph with typed nodes (functions, methods, class initializers) and typed edges (direct call, dynamic dispatch, dependency injection, callback registration).

Stage 2 — Taint analysis. The engine marks all known source types and sink types across the graph. It runs a forward taint analysis, computing every path from any source to any sink. Sanitizer functions (detected by signature or by configuration) cut paths. Output: a set of source-to-sink paths, each annotated with the intermediate transformations applied to the taint.

Stage 3 — LLM hypothesis generation. For each path that survives stage 2, Griffin AI — our LLM reasoning layer — receives a structured brief: source, sink, intermediate code, version context, and the existing CVE set for this package/version. The model generates a hypothesis: what class of vulnerability this path represents, what input would trigger it, whether an existing CVE already describes it, and a confidence score. If the hypothesis matches an existing CVE, the finding is routed as a known issue. If it does not, it is routed as a zero-day candidate.

Stage 4 — Verification. Candidates do not ship as findings. They ship to a verification queue where a second LLM pass (with different system prompt, different seed, often a different model) is asked to disprove the hypothesis. Candidates that survive a disproof attempt get human review. This is the step that keeps the FP rate honest.

The division of labor is cleaner than either approach alone. The engine never speculates. The model never invents call edges. The verification layer forces the model to argue against itself.

Do you have any real numbers from this pipeline?

On our internal benchmarks across the top 10,000 packages of npm, PyPI, and RubyGems, this pipeline raises candidate discovery rate significantly over pattern scanning alone while holding FP rate at a level where a security engineer can triage a day's output in under an hour. Specific numbers vary by ecosystem — Python and JavaScript are where the current tooling is strongest because taint analysis is well-understood there; Rust and Go have different noise profiles because the lifetime/ownership systems eliminate some classes of bug at the type level.

The more important number is the ratio of confirmed zero-days to triage hours. For any security program, that ratio is what tells you whether the tool is buying you time or spending it. This pipeline's ratio is high enough that it is worth running continuously on your internal codebase and on the dependencies you consume.

What are the known limits?

Three, worth stating plainly. First, taint analysis on dynamic languages with heavy reflection (Ruby, older Python codebases, JavaScript with eval-based dispatch) has real blind spots. The engine handles the common cases; the rare ones require human-in-the-loop enrichment. Second, the LLM hypothesis step is not free — it costs inference time and dollars, and the cost compounds with the number of paths surviving stage 2. Prioritization matters; we rank paths by exposure surface before sending to the model. Third, verified zero-day candidates still require a responsible disclosure process. The pipeline does not produce CVEs directly; it produces actionable leads that go to a coordinated disclosure workflow.

How Safeguard Helps

Safeguard's discovery engine implements exactly the pipeline described above, running continuously over customer codebases and the transitive dependency graph each codebase pulls in. Griffin AI provides the hypothesis and verification layers. The platform packages each zero-day candidate with its full taint path, the generating hypothesis, the disproof attempt, and the ranked evidence, so a security engineer can make the triage call in minutes instead of days. For customers who opt in, we participate in coordinated disclosure with upstream maintainers on verified candidates — so the findings become fixes, not just alerts. The combination of a real program-analysis engine with an LLM reasoning layer is where zero-day discovery at scale is heading, and it is the shape of the pipeline we have built Safeguard around.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.