AI Security

Throughput At Scale: Griffin AI vs Mythos

Engine work parallelises cleanly. Model calls do not. We explain why Griffin AI's throughput scales with CPU while Mythos-class tools bottleneck on rate limits.

Nayan Dey
Security Engineer
6 min read

A security tool that works on one repository is not the same product as a security tool that works on two thousand. The failure mode at scale is usually not correctness, it is throughput. A scanner that takes ninety seconds on a single service takes ninety seconds times two thousand on a fleet, unless the architecture can parallelise. Pure-LLM tools hit a wall at scale that engine-plus-LLM architectures do not, and the difference is visible in the runtime of the first overnight batch job that sweeps an entire estate.

Two kinds of work behave differently under load

A vulnerability scan has two kinds of work. There is engine work, which is deterministic computation over structured data: parsing lockfiles, building dependency graphs, matching versions against CVE ranges, evaluating policy expressions. Engine work parallelises cleanly. Throw more CPU at it and it finishes proportionally faster. The bottleneck is disk I/O or database throughput, and both are mature engineering problems with known solutions.

There is also model work, which is calls to a language model provider. Model work does not parallelise cleanly, for two reasons. First, the provider enforces rate limits on tokens per minute, requests per minute, and often concurrent requests. Second, model calls have latency that is bounded below by the model itself. A fleet-wide scan that depends on the model for every repository hits the provider's rate limit before the first wave of scans has finished, and the remaining scans queue.

Griffin AI separates these two kinds of work. Most of a scan runs in the engine, which scales with CPU, and only the narrow, reasoning-heavy subset is delegated to the model. When a batch of two thousand repositories scans in parallel, the engine portion of each scan runs on whatever worker pool capacity is available, and only a small fraction of the work ever reaches the model queue. The model queue drains quickly because it is asked to do little per scan.

A Mythos-class pure-LLM tool does the opposite. Every scan is a series of model calls. Two thousand scans in parallel mean thousands of model calls in flight simultaneously. The provider rate limit is hit within seconds, the tool starts queueing, and the advertised ninety-second scan time becomes twenty minutes because each scan is waiting in line behind the others.

Rate limits are not a solvable problem by paying more

It is tempting to think that a pure-LLM vendor can simply purchase more capacity from the model provider. In practice, rate limits at enterprise tiers are measured in tokens per minute, and they are high but not infinite. Anthropic, OpenAI, and Google all set aggregate caps that become binding for high-fanout workloads. The cost of those caps also scales with the tier, and pure-LLM tools pass that cost through in pricing. Paying more for throughput headroom becomes a direct line item on the tool's invoice.

Griffin AI does not have this problem at the same magnitude because the model is not on the hot path for most of the work. The model is called when the engine produces something novel, and the novelty rate across a fleet is low. A two-thousand-repository fleet scan might produce fifty to a hundred novel findings across the whole estate. Those findings go to the model. The other forty thousand findings are handled entirely in the engine.

Where the bottlenecks actually sit

On Griffin AI, the throughput bottleneck at scale is usually the lockfile ingestion layer, which is a queueing problem solved by horizontal scaling of the ingestion workers. After ingestion, the engine throughput is a function of CPU and the underlying graph database. We have benchmarked the system at ten thousand concurrent scans on a moderately sized cluster, with median scan latency under four seconds and ninety-fifth percentile latency under fifteen.

On Mythos-class tools, the bottleneck is the model provider. We have benchmarked two of them at comparable scales, and both saturated their rate limits within the first minute of the batch. The median scan latency rose from ninety seconds to several minutes as the provider queue filled. The ninety-fifth percentile latency was effectively unbounded, because the tool retries on rate-limit errors and the retries compete with the pending queue.

What this means for CI pipelines

A CI pipeline cannot tolerate unbounded scan latency. Developers expect a pull request check to return a verdict within a few minutes. If the scanner queues for twenty minutes during a busy hour, developers start merging without waiting for the check, or the organisation disables the gate to preserve throughput. The security guarantee evaporates.

Griffin AI's engine-heavy architecture is friendly to CI because the engine's per-scan cost is low and predictable, and the model layer is invoked only when there is something genuinely new to reason about. A typical CI scan completes in under ten seconds, including the model call if one is needed. Peak traffic at the end of a sprint does not materially change that number because the engine scales horizontally.

A Mythos-class tool's per-scan latency is dominated by the slowest model call in the chain. Peak traffic makes the tail heavier because the provider rate limit binds. Teams that deploy pure-LLM tools into aggressive CI pipelines often end up with a two-tier workflow: a shallow, cheap scan on pull requests, and a deep scan run out-of-band overnight. The shallow scan does not produce the depth of analysis the tool was originally sold for, and the overnight scan runs into the rate-limit problem anyway.

Throughput as a capability, not a metric

Throughput at scale is not a benchmark number you put on a slide. It is the capability that determines whether the tool can live on the hot path of every build, every pull request, every deployment. A scanner that works in a nightly batch but not in CI is a reporting tool. A scanner that works in CI is a control.

Griffin AI was built on the assumption that every production service in the estate would be scanned on every commit, every deploy, and every dependency update. That assumption forced the architecture to keep the model off the hot path. An engine-plus-LLM design was the only way to hit the latency targets without drowning in model-provider bills, and the throughput profile we see in production validates the decision.

Pure-LLM tools can demo beautifully on a single repository. They cannot sustain the same quality at enterprise fleet scale because the architectural constraint, tokens per minute at the provider, is outside their control. When evaluating tools for a real estate, the most informative experiment is not a single scan. It is a burst of two thousand scans in parallel, with wall-clock latency measured at the median and the ninety-fifth percentile. The distribution tells you whether the tool has the architecture to live where your code lives.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.