AI Security

Context Window Limits: Griffin AI vs Mythos

Context-window size matters less than context quality. A look at how Griffin AI's engine-grounded context beats pure-LLM retrieval at monorepo scale.

Nayan Dey
Senior Security Engineer
6 min read

A friend at a large fintech once showed me a Slack thread where their AI security vendor had quoted a million-token context window like it was a capability. Two weeks later the same vendor had failed to find a Spring4Shell-style path reachable from a public endpoint, despite the vulnerable code being well within that window. The window was big. The selection was wrong. That mismatch — between context size and context quality — is the single most misunderstood dimension of AI security tooling. Griffin AI lives on the quality side of that trade. Safeguard's engine assembles a context slice that contains only the files, functions, and dependency paths reachable from the question being asked. Pure-LLM products in the Mythos class lean on vector-similarity retrieval, which packs the window with chunks that look related but may not be. The accuracy delta is not a function of parameter count; it's a function of what sits in the first few kilobytes of prompt.

Why does context quality trump context size?

Language models weight their attention across the entire window, but the signal they can extract per token is bounded. If half your window is filled with semantically similar but structurally irrelevant code, you've halved the effective reasoning capacity. Think of it like filling a surgeon's operating field with three patients' X-rays; the resolution is fine, the selection is catastrophic. Griffin AI constructs the context in layers: the SBOM slice relevant to the query, the call graph subgraph from the entry point to the suspected sink, the taint trace, the CVE metadata, and finally the source code of just the functions on that path. For a typical vulnerability triage, that slice lands between 12k and 60k tokens depending on the depth of the graph. The window is never the bottleneck. The selection is.

What goes wrong when retrieval replaces reachability?

Retrieval-augmented generation retrieves by embedding similarity. A function named parseUserInput and a function named parseQuery will retrieve together even if they live in unrelated modules. A wrapper around exec() and a wrapper around subprocess.run() will retrieve together because they share vocabulary. None of that tells you whether either function is reachable from an untrusted input. When CVE-2023-38545 (SOCKS5 heap overflow in curl) was disclosed, the correct question for any consumer was: is the affected code path reached from any SOCKS5 proxy configuration that your users can trigger? A retrieval-first system would pull every curl-related file in the repo, flood the window with curl_easy_setopt calls, and miss that only three of them flow into a reachable handler. Griffin AI's engine resolves the dependency at the build-graph level and only hands the three relevant handlers to the LLM. The window is smaller. The answer is correct.

How does signal density affect cost?

Token cost scales with context size. If Mythos-class products burn 400k tokens to answer a question Griffin answers with 30k, the economics matter — especially for large monorepos with thousands of queries per day. One of our platform customers runs roughly 18,000 Griffin queries per week across 4,200 repos. At the density ratios we measure against pure-LLM retrieval approaches, the engine-grounded path is between 5x and 12x cheaper per correct answer. That's not a small number when security tooling costs are under scrutiny. And the cost advantage compounds because a smaller prompt is also a faster prompt; latency drops from the 30-second range into the 4-8 second range, which changes whether developers actually use the thing.

Is a million-token window useless?

No — but it's useful in ways different from the marketing. Griffin AI uses extended context to carry historical artifacts: the last 50 commits on the affected path, prior CVE attributions, compliance exceptions logged against the component, the prior PR that introduced the vulnerability. Those are all high-signal when the engine selects them. A pure-LLM system with the same window tends to fill it with chunks it retrieved by similarity, which is lower-signal. So the window is a tool, and tool utility depends on the hand holding it. The architectural claim here is not "small context good." It is "engine-selected context good; similarity-selected context noisy."

What happens on a 12-million-line monorepo?

This is where the differences become dramatic. Monorepos of that size have cross-service call graphs that no retrieval system can faithfully compress. Griffin AI persists the call graph in Safeguard's engine and slices it per query — pulling, say, the subgraph from payments-api entry points down to crypto-lib signing functions. The LLM sees a coherent trace. A pure-LLM system attempts the same task by retrieving the top-k chunks and praying that the critical edges survive. They often do not. In our benchmarks against CWE-918 (SSRF) across a large monorepo, the engine-plus-LLM approach correctly identified 87% of reachable SSRF sinks; the retrieval-only approach landed in the low 50s, with a substantial tail of false positives driven by similar-looking but unreachable code.

Does this matter for frontier models specifically?

It does, because frontier models will keep getting bigger context windows. That's good news for everyone. But bigger windows do not fix selection. A frontier model given the right 30k tokens will outperform the same model given the wrong 900k tokens — every time, at every parameter count we've tested. Griffin AI is designed to benefit from frontier-model improvements because the engine's job is to hand the model a better prompt. When a new model drops, we swap it in behind the engine and the accuracy numbers tick up. Pure-LLM products also benefit from new models, but they inherit the selection problem unchanged. The floor rises. The ceiling stays put.

How does this connect to the published benchmarks?

The 81% hypothesis accuracy, 73% auto-PR compile rate, and 98% adversarial resistance numbers Griffin publishes all depend on context selection. Hypothesis accuracy is driven by whether the LLM is reasoning over the right files; compile rate is driven by whether the patch was generated against the correct function signatures; adversarial resistance is partly driven by denying the model the context it needs to produce unsafe output. Each of those numbers is a function of the engine's selection quality. Scale the window without improving the selector and none of the numbers move.

How Safeguard Helps

Safeguard's engine solves the selection problem that context windows alone can't fix. Griffin AI operates on precisely the slice of the codebase, dependency graph, and taint trace that the question requires — which is why our benchmarks hold across monorepos from small Go services to 12M-line enterprise platforms. If you're evaluating a pure-LLM alternative, measure tokens per correct answer, not tokens per prompt. The economics and the accuracy move together once the engine is in the picture.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.