A security architecture diagram will lie to you if you let it. Every vendor box and every arrow looks similar on a slide — "AI model," "repo integration," "policy engine," "remediation." To understand what a platform can actually do, you have to ask what happens between the boxes. Last month I spent a few days tearing down the behavior of Griffin AI and comparing it to the behavioral shape of pure-LLM security products in the Mythos class. The interesting finding is not which one scores higher on a single benchmark. It's that the two products do different kinds of work, and the diagram hides it. Griffin AI is a reasoning layer welded to a deterministic engine that resolves SBOMs, walks call graphs, and runs taint analysis before any language model sees a token. Mythos-class products wrap a frontier LLM around a retrieval layer and call the retrieval result "context." The gap between those two sentences, read carefully, is the entire architecture discussion.
What does "engine-grounded" actually mean in Griffin's stack?
Peel the UI back and Griffin's request path looks like this. A user asks: "Is CVE-2023-44487 exploitable in our edge gateway?" Before the LLM receives the prompt, Safeguard's engine runs three deterministic passes. First, the SBOM resolver matches the CVE's affected purls (pkg:golang/golang.org/x/net@<0.17.0) against the resolved component list for the repo and returns a hit at golang.org/x/net v0.14.0. Second, the call graph loader identifies the reachable set from your HTTP/2 handler down through x/net/http2.Server.ServeConn. Third, the taint analyzer checks whether external request frames can reach the vulnerable code path. Only after those three artifacts are attached to the prompt does the LLM reason. Its output is not "I think CVE-2023-44487 might apply" but "CVE-2023-44487 is reachable via cmd/gateway/main.go → internal/http/server.go → x/net/http2.Server.ServeConn; recommended upgrade to 0.17.0." Citations attach to artifacts, not vibes. That path is what enables the 81% hypothesis accuracy we publish — hypotheses are filtered by the engine before the model speaks.
How does the Mythos-class pattern compare structurally?
A pure-LLM security platform typically follows a retrieval-augmented generation pattern. The repo is indexed into a vector store — files, documentation, sometimes AST fragments. At query time, the system retrieves top-k semantically similar chunks and hands them, along with the prompt, to a frontier model. The model then reasons in natural language over those chunks. The structural limitation is that semantic similarity is not reachability. A chunk of code that looks syntactically like the vulnerable function may be completely unreachable; a chunk that looks unrelated may sit on the critical path. Without a call graph, the model cannot distinguish. Without a taint analyzer, it cannot verify that user input actually flows to the sink. It fills in the gap with probability mass from pretraining, which is where hallucinated CVEs, invented function signatures, and mismatched package versions come from. The architecture is elegant but undermatched to the task.
Where does the 98% adversarial resistance number come from?
Adversarial prompts in security tooling usually take one of three shapes. The first is "suppress the finding": coax the model into marking a true vulnerability as a false positive. The second is "weaken the fix": get the model to propose a remediation that silences the alert without closing the hole (e.g., catching the exception, adding a broad allow-list). The third is "exfiltrate context": trick the model into revealing indexed code or credentials from other tenants. Griffin AI resists these at 98% on our internal red-team suite because the engine constrains the output space. The model cannot mark a finding as a false positive unless the taint analyzer agrees. It cannot propose a patch that fails to compile in the sandbox — that's the same constraint that produces the 73% auto-PR compile rate. And it cannot surface tenant data outside the user's policy scope because the engine enforces tenant isolation before the LLM is invoked. A pure-LLM system has fewer levers to resist with; its only defense is trained refusal, which is measurably leakier.
What about context window size?
Vendors love to cite million-token context windows, and they do matter — but only after the grounded context has been assembled. Griffin AI uses the context window to carry the SBOM slice, the call graph subgraph relevant to the query, the taint trace, and recent commits. A pure-LLM system uses the same window to carry retrieved chunks. The difference is signal density. Griffin's context is pre-filtered by the engine: if a file is not reachable from the query, it does not enter the window. Pure-LLM retrieval tends to saturate the window with lexically similar chunks, which dilutes the signal and raises token cost. This shows up in production as both slower responses and more expensive queries. For a large monorepo — think 4,000 repos, 12M lines of code — the engine-first approach is the only one that scales without context-window heroics.
How do the two stacks handle new CVEs after training cutoff?
This is where the divergence sharpens. When CVE-2024-3094 (XZ Utils backdoor) was disclosed on March 29, 2024, any model trained before that date had zero knowledge of it. A pure-LLM system cannot reason about a CVE it doesn't know about unless the CVE details are retrieved into its prompt. Even then, its reasoning is only as good as the retrieved description — it cannot walk the actual dependency chain to tell you whether liblzma.so.5 is linked into any of your production binaries. Griffin AI, by contrast, consumes the CVE feed through Safeguard's engine. When a new CVE appears, the engine matches purls and updates the reachable-set cache; Griffin reasons over the updated artifacts. The model's training cutoff is irrelevant because the engine supplies the facts. That's why post-disclosure response time matters more than parameter count.
How does this affect the developer experience?
From the outside, both platforms offer chat, IDE plugins, and PR review. The experiential difference is that Griffin AI's suggestions come with links — into the call graph viewer, the SBOM explorer, the CVE record, the proposed patch diff with a compile status badge. A developer reviewing a Griffin-generated PR can click through to the trace that justifies the change. Pure-LLM suggestions come with prose. The prose may be right. It may be beautifully worded. But it is not auditable, and for teams under SOC 2 or FedRAMP review, unauditable AI output is a liability. We built Griffin to be boring in exactly this respect: every claim ties to a record the engine maintains.
How Safeguard Helps
Safeguard's architecture choice is that the language model should never be the system of record. The engine — SBOM, call graph, taint analysis, provenance — is the ground truth. Griffin AI reasons on top of it, which is why its outputs carry citations and why its benchmark numbers (81% hypothesis accuracy, 73% auto-PR compile rate, 98% adversarial resistance) are reproducible. If you are mid-evaluation against a pure-LLM competitor, trace the path of a single question through both stacks and note where the facts come from. The architecture story writes itself once you look at the wiring.