AI Security

Scaling Across Repos: Griffin AI vs Mythos

Multi-repo security reasoning is a graph problem, not a retrieval problem. How Griffin AI's engine scales where pure-LLM products flatten into guesswork.

Shadab Khan
Security Engineer
6 min read

Every security product works on a single repo. Show me one that works across four thousand. Cross-repo reasoning is the gauntlet that exposes the difference between a chat interface over a CVE feed and a real security platform. A vulnerability in a shared internal library — say, an authentication helper used by 82 services — has 82 different impact profiles depending on how each service calls it. A pure-LLM system asked "where are we affected" has no mechanism to traverse that cross-repo graph; it can only retrieve and guess. Griffin AI was built for this specific problem because Safeguard's customer profile skews toward large organizations with dense internal dependencies. The engine maintains a multi-repo call graph at ingestion time, which lets Griffin answer "who calls this function" across every connected repo in the tenant in sub-second time. That capability is structural. Pure-LLM products in the Mythos class can paper over it on a demo, but the mask slips under load.

What breaks first in pure-LLM systems at multi-repo scale?

The embedding index. Vector stores do not scale linearly past a few hundred million tokens; query latency rises, cross-chunk coherence drops, and the k-nearest-neighbor top results start to include false neighbors that happen to share vocabulary. A function called validateToken in auth-lib, another in payments-service, and a third in a test fixture will all retrieve together. The LLM has no way to tell which is the definition, which is the caller, and which is the test. Griffin AI doesn't rely on embeddings for structural queries; the engine builds a typed call graph where callers and callees are first-class edges, not similarities. Ask "who calls validateToken" and the engine returns the real callers regardless of repo boundary.

How big is the graph in practice?

A representative enterprise tenant we watch has 4,237 repos, 12.1M lines of code across Go, Java, Python, and TypeScript, and roughly 87M call graph edges when cross-repo links are included. The SBOM for that tenant resolves to 38,000 unique components across twelve ecosystems. Griffin AI persists the graph and the SBOM in the engine, which is why queries like "for CVE-2024-6387 (regreSSHion, disclosed July 1, 2024), which of our services link OpenSSH between 8.5p1 and 9.7p1 and expose the vulnerable code path from a reachable interface" come back in a few seconds. A retrieval-based system asked the same question has to pull candidate chunks from each repo's index, fuse them, and hope; in our tests, fuse-and-hope produced both a lot of noise and a non-trivial rate of missed true positives.

How does cross-repo reachability differ from intra-repo reachability?

Within a single repo, reachability is a walk of the local call graph. Across repos, it requires resolving internal dependencies — did payments-service import auth-lib at version 2.3.1 where the vulnerable function lives, or version 2.3.2 where it was replaced? That resolution uses the SBOM and the manifest, and it is deterministic. Pure-LLM systems approximate it with token patterns: if both repos mention the function name, assume a link. That approximation breaks on every version bump, every private fork, every internal monorepo path-alias. Griffin AI does the real resolution. The 81% hypothesis accuracy we publish is measured across exactly this kind of multi-repo query; hypotheses that fail resolution never reach the model.

What happens with SBOM joins at scale?

Shared components are the usual attack surface. When CVE-2024-3094 (XZ Utils, disclosed March 29, 2024) landed, the critical question was not whether xz-utils was in the SBOM of any given service — it was in almost everyone's. The critical question was which services linked the specific backdoored binary paths and which built against an alternate toolchain. Griffin AI answered that by joining the SBOM across every repo in the tenant and highlighting the subset whose build environments matched the known-malicious release tags. Within hours, impacted customers had a reviewed list of affected services with taint-verified reachability. Pure-LLM systems can return a list of services that mention xz in their dependencies. That's a start, not an answer.

Does this create an ingestion cost problem?

It creates an ingestion cost, yes — which is exactly why we built the engine. Parsing a 4,000-repo tenant into a queryable call graph is not free; it's measured in compute-hours per full rebuild. But amortized over the tens of thousands of queries run against the graph per week, the cost per answer is low. A pure-LLM retrieval pipeline looks cheaper at ingestion (embeddings are faster than AST parsing) and ends up more expensive per answer (bigger prompts, lower precision, more retries). This is the kind of trade that only reveals itself after production use. Prospective buyers should ask for both ingestion cost and cost-per-correct-answer; the first favors retrieval, the second favors the engine.

How does the 98% adversarial resistance hold up across repos?

Cross-repo adversarial prompts are a specific attack surface. A bad actor with read access to one repo might try to coerce the model into revealing findings from another repo, or into producing a patch that introduces a backdoor into a shared library. Griffin AI enforces tenant and project scopes at the engine level before the LLM is invoked; a prompt that references a repo outside the caller's scope returns an authorization error, not a leak. Patches generated for shared libraries pass through a build-and-test sandbox that includes the dependent services, which catches the case where a patch weakens a check in a way that only manifests in downstream consumers. Pure-LLM systems with a flatter architecture have fewer enforcement points; their adversarial resistance numbers tend to degrade at scale, whereas Griffin's 98% holds across our multi-repo red-team suite.

What does the developer experience look like at this scale?

A developer in a 4,000-repo org does not want to know about every CVE in every repo. They want to know about the ones that affect their code, their code paths, and their deploy targets. Griffin AI scopes findings to the caller's project context and the actual reachable set; a finding in an unreachable branch of an unimported module never surfaces. Pure-LLM systems tend to over-report, partly because retrieval can't establish reachability and partly because the model hedges to avoid missing true positives. The developer's inbox fills with noise. Over a quarter, that's the difference between an AI tool that accelerates work and one that competes with it.

How Safeguard Helps

Safeguard's engine maintains a multi-repo call graph and a joined SBOM across your tenant, which is what lets Griffin AI answer reachability questions that pure-LLM products cannot. The architecture is why our published numbers — 81% hypothesis accuracy, 73% auto-PR compile rate, 98% adversarial resistance — survive at enterprise scale instead of degrading as repo count grows. If you are evaluating a competitor on a monorepo alone, add a second and third repo with cross-service dependencies. The structural differences become visible within the first few queries.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.