A typical enterprise application has direct dependencies in the dozens, transitive dependencies in the hundreds, and the deepest levels of the dependency graph reach somewhere between fifty and seventy hops down. Pattern scanners that only check the first five or six levels miss most of the dependency graph and most of the vulnerabilities that actually live there. AI-for-security tools claim to go deeper, and the claims vary widely. The architectural reasons why some tools handle depth well and others don't are worth being precise about, especially when comparing Griffin AI's deterministic-engine-plus-LLM approach to Mythos-class general-purpose AI-for-security tools.
Why transitive depth is hard
Three structural reasons:
- Combinatorial explosion. A graph with 500 packages and 60 levels of depth has billions of possible paths. Naive analysis runs out of memory or time.
- Reflection and dynamic dispatch. At deeper levels, frameworks rely more heavily on reflection, which obscures the static call graph. Inferring the runtime call structure requires framework-specific knowledge.
- Version interactions. Each transitive dependency has its own version range. The same package can appear at multiple versions in the same graph, with different behaviours at each version.
Tools that do not address all three either limit depth artificially or produce inaccurate output at depth.
How deterministic engines scale to depth
Three techniques, all of which Griffin AI's engine implements:
Memoised call graph construction. Each package's internal call graph is computed once and cached. Cross-package edges are added incrementally. The total work scales with the size of the graph, not the number of paths.
Targeted reachability queries. Rather than enumerating all paths, the engine answers "is sink X reachable from any source?" as a graph reachability query — which is linear in the size of the graph, not exponential.
Framework-aware reflection handling. For common frameworks (Spring, Express, Django, Rails, etc.), the engine has specific knowledge of how the framework dispatches calls. A @RequestMapping annotation produces a known set of edges in the call graph; reflection-based plugin loading produces another. The engine handles each known framework explicitly and degrades gracefully (with explicit confidence reduction) for unknown frameworks.
These three together produce reliable depth analysis at scale.
Why pure-LLM approaches struggle at depth
Mythos-class tools that ask an LLM to reason directly about the dependency graph hit two limits:
- Context window. Even a 1M-token context cannot fit the full transitive call graph of a real enterprise application. The LLM sees a sample, not the graph.
- Reasoning at depth. Even when the relevant graph fits in context, LLMs are less reliable at multi-hop reasoning than at single-hop. A 6-hop taint path requires the model to maintain state across reasoning steps; the failure modes accumulate.
The output is plausible analysis at shallow depth and unreliable analysis at depth. Customers experience this as "the tool found the obvious things and missed the buried ones."
Where depth matters most
Three vulnerability classes that disproportionately live at depth:
- Deserialization gadgets. A gadget chain can span 20+ hops across multiple packages. Catching it requires depth + cross-package + reflection awareness.
- Authentication bypasses in middleware stacks. Modern web apps stack 10–20 pieces of middleware. The bypass often exists because of how two pieces of middleware deep in the stack interact.
- Cryptography misuse in transitive utility libraries. A common pattern: an application uses a library that uses another library that uses a third library for crypto, and the third library has a misuse that propagates up.
Each of these is invisible at shallow depth. Each is the kind of vulnerability that ships into production and gets discovered after the breach.
What a depth benchmark looks like
A meaningful depth benchmark answers three questions:
- At what depth does the platform's accuracy degrade meaningfully?
- What percentage of real CVEs in the test set are at depth ≥ 10?
- What is the false-negative rate at depth 20+?
Griffin AI publishes depth-stratified accuracy numbers on its eval benchmarks. Mythos-class platforms typically don't, which is part of why depth claims are hard to compare across vendors.
A concrete example
A Spring Boot application with 47 direct dependencies and 312 transitives has a vulnerability at depth 23 — a deserialization gadget reachable through 6 cross-package hops in the Jackson + JavaBeans + reflection chain. The vulnerability was exploited in a public incident in late 2024.
Pattern scanners flagged the surface-level Jackson CVEs but did not detect the specific gadget chain.
Pure-LLM tools asked to reason about the application produced confident-sounding analysis that flagged a different (and incorrect) set of suspect paths. The actual gadget chain was not in the output.
A deterministic-engine-plus-LLM tool surfaced the gadget chain as a candidate finding with the full 23-hop taint path, the gadget construction, and a proof-of-concept payload sketch. Human review confirmed the finding in under an hour.
What to evaluate
Three concrete checks during procurement:
- Run the platform against a known-vulnerable application with a depth-20 finding. Does it surface the right path?
- Ask for depth-stratified accuracy numbers on a published benchmark.
- Walk through how the platform handles framework-specific dispatch (Spring annotations, Express middleware, Django URL routing).
Vendors who answer these concretely are vendors whose depth claims are defensible.
How Safeguard Helps
Safeguard's deterministic engine handles transitive depth as a first-class concern: memoised call graph construction, targeted reachability queries, and framework-aware reflection handling produce reliable analysis at depths that pattern scanners and pure-LLM tools cannot reach. Griffin AI receives the depth-correct evidence and reasons over it — including for the specific vulnerability classes (deserialization, middleware bypass, transitive crypto misuse) that disproportionately live in the long tail. For organisations whose dependency graphs are real-world deep, depth is the architectural feature that determines whether the platform finds the buried vulnerabilities or only the surface ones.