AI Security

Why Engine-Plus-LLM Beats Pure-LLM: Griffin vs Mythos

The structural case for engine-plus-LLM security reasoning — and why pure-LLM products in the Mythos class hit a ceiling that no parameter count can raise.

Shadab Khan
Security Engineer
6 min read

There is a moment in every AI security evaluation where the demo ends and the auditor asks "prove it." The prose stops mattering. The confidence drops a notch. Either the system can hand over a trace — a call graph edge, a taint path, a signed SBOM record — or it cannot. After running Griffin AI side-by-side with pure-LLM products in the Mythos class on eleven real monorepos over the past nine months, the pattern is unambiguous: engine-plus-LLM architectures clear the "prove it" bar; pure-LLM architectures clear it by accident. This is not a judgment about frontier models themselves, which are genuinely impressive reasoners. It is a statement about what happens when you ask a language model to answer a question whose answer lives in a structure no language can fully represent — the dependency graph, the control flow, the specific versioned binary that ships to production. Griffin AI answers those questions because Safeguard's engine answers them first.

What breaks first when you remove the engine?

Three things, in order. First, CVE matching becomes fuzzy. A pure-LLM system matches by string similarity across natural-language descriptions; it will happily claim CVE-2022-22965 (Spring4Shell) applies to a repo that uses Spring but not the vulnerable path. Second, reachability becomes impossible. The model cannot tell you whether the vulnerable function is reached from your HTTP handler because it has no call graph; it guesses from the presence of related imports. Third, patch verification becomes aspirational. The model writes a patch, claims it closes the issue, and has no way to prove it compiles or preserves behavior. Griffin's 73% auto-PR compile rate exists because the engine actually runs the build; the LLM's opinion on compilability is not load-bearing. When you stack those three failure modes together, the pure-LLM accuracy ceiling is not a tuning problem. It's the architecture.

How does the engine constrain the LLM's output space?

Constrained generation is the under-discussed half of the story. When Griffin AI emits a hypothesis — "CVE-2023-50164 is exploitable via MultiPartRequestWrapper in struts-webapp-2.5.30" — the engine has already verified that the purl exists, the CVE applies, and the reachable set includes the affected class. The LLM is not free to invent the hypothesis; it is free only to phrase the hypothesis the engine has supplied facts for. This is why the 81% hypothesis accuracy number holds across tenants. A pure-LLM system generates hypotheses from pattern-matching over the prompt, which is the same mechanism that produces the fabricated CVE IDs we've seen in red-team evaluations. Constraining the output space is not a loss of capability; it is a redirect of capability toward the part of the problem the LLM is actually good at — explanation, prioritization, phrasing, pedagogy.

Can pure-LLM systems close the gap with better retrieval?

Retrieval augmentation helps. It does not close the gap. The reason is that retrieval operates on text similarity, and the questions that matter in software security are structural. "Does this CVE affect my code" is a reachability question. "Will this patch break anything" is a dataflow question. "Is this dependency transitively pulled into my runtime" is a graph-traversal question. You can index your AST into a vector store and still be unable to answer any of those questions reliably, because similarity does not respect edges. Griffin AI treats retrieval as a last resort rather than a first resort: the engine answers the structural question directly, and the LLM consults documentation only for the explanatory layer. Flipping that order is what produces the accuracy delta.

Where does this show up in real CVEs?

Take CVE-2021-44228 (Log4Shell) as the canonical test. The question "are we affected" decomposes into: do we ship log4j-core between 2.0-beta9 and 2.14.1, does our logging config allow JNDI lookups, and does user-controlled data reach a logging call? A pure-LLM system can answer the first question if the package appears in a retrieved manifest chunk. It cannot reliably answer the second or the third. Griffin AI answers all three because the SBOM has the version, the engine parses the config, and the taint analyzer traces the request body to the log call. Same question about CVE-2022-22963 (Spring Cloud Function RCE): the reachability check is what separates true positives from noise, and the reachability check is an engine function. In the MOVEit Transfer campaign (CVE-2023-34362, disclosed May 31, 2023), customers who had engine-grounded tooling could answer "is this in my environment and is it reachable" within hours; customers with pure-LLM tooling could answer "is this in my environment" and had to bring in humans for the reachability step.

What does the 98% adversarial resistance mean structurally?

Adversarial resistance in a pure-LLM system is a trained behavior — the model has learned not to produce certain outputs when certain patterns appear in the prompt. Adversarial resistance in Griffin AI is a structural guarantee. If a user prompt tries to get the model to mark a true vulnerability as a false positive, the engine still returns the finding; the model cannot suppress it. If the prompt asks the model to generate a patch that weakens a check, the build-and-test sandbox rejects it. If the prompt asks the model to expose indexed code from another tenant, the tenant isolation layer blocks the request before the model sees it. 98% resistance is not a claim about the model; it is a claim about the system. The LLM contributes the remaining 2% of the defense, and we are comfortable with that ratio because we know exactly which 2% it is.

Is there a cost to the engine-plus-LLM architecture?

Yes, and we should name it. Engine-plus-LLM is heavier to build. Safeguard maintains ingestion pipelines for twelve package ecosystems, resolvers for five manifest formats, taint analyzers for eight languages, and SBOM attestations in both SPDX and CycloneDX. Those are the capital investments that make the LLM's job tractable. A pure-LLM competitor can ship faster because it skips that investment. For a prototype, that's fine. For a platform you ship into a regulated enterprise, it isn't — because when the auditor asks "prove it," the prototype's architecture is what answers, and it only has prose.

How Safeguard Helps

Safeguard's bet is that the LLM is a reasoning engine, not a knowledge store. Griffin AI sits on top of a deterministic engine that resolves the facts before reasoning begins, which is what turns 81% hypothesis accuracy, 73% auto-PR compile rate, and 98% adversarial resistance from aspirational numbers into reproducible ones. If you are evaluating a pure-LLM competitor, ask it to show you a call graph edge. Ask it to rebuild your patched branch and show you the test output. The answers, or the lack of them, tell you which architecture you are actually buying.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.