A tiered model strategy looks simple on a slide. In practice, routing the right task to the right tier is the hardest design problem in production AI security tooling. Get it wrong in one direction and your Opus bill erases your operating margin. Get it wrong in the other direction and Haiku hallucinates a remediation that breaks production. Griffin AI has converged on a tiering strategy after running hundreds of thousands of supply chain scans, and this post is a walkthrough of how each tier earns its place and why Mythos-class pure-LLM tools cannot adopt the same pattern without rewriting themselves.
The three tiers and what they are actually good at
Opus is a reasoning tier. Its cost per token is high, its latency is the longest of the three, and its value shows up in tasks where the inputs are ambiguous, the outputs carry real consequences, and the chain of reasoning has to hold across many steps. In Griffin AI, Opus handles residual triage on findings where the engine's evidence is contradictory, novel CVE interpretation where advisories are partial or conflicting, and architectural review questions that require understanding intent rather than structure. Opus is also the tier that gets invoked when a customer asks a complex question in natural language about their posture. It is the tier we trust to say "I do not know" when the evidence does not support a conclusion.
Sonnet is a drafting tier. Its cost is a fraction of Opus, its latency is workable for interactive use, and its quality on structured drafting tasks is high enough that the marginal improvement from Opus does not justify the price. In Griffin AI, Sonnet writes the human-readable remediation plans, the pull request comments, the executive summaries, and the policy evaluation explanations. Sonnet is also the tier we use for SBOM diffs and change summaries, because those tasks are primarily about clear prose over a well-structured input. When the engine has done the structural work, Sonnet is almost always the right choice for the last mile.
Haiku is a scale tier. Its cost is low enough that we can afford to invoke it on every item in a batch, its latency is low enough that it disappears inside CI, and its quality on narrow classification tasks is strong. In Griffin AI, Haiku classifies findings by exploitability tier, normalises advisory text across ecosystems, tags license obligations, and handles the high-volume formatting work. A scan that produces eight hundred raw matches might produce fifteen things worth a Sonnet draft, but eight hundred things worth a Haiku classification, and the architecture has to pay for the latter at Haiku prices.
The routing logic
The routing decision in Griffin AI is not a model call. It is a deterministic function of the engine's evidence. If the engine has a high-confidence reachability verdict and a clean CVE-to-version match, the finding goes to Haiku for classification and never touches a more expensive tier. If the engine has structural evidence but the advisory is ambiguous, Sonnet drafts an explanation and the finding is attached to a remediation plan. If the engine cannot resolve the finding, either because the advisory contradicts the observed behaviour or because the call graph has unresolved indirection, the finding is escalated to Opus with a clearly-scoped question and the full evidence bundle.
The engine is what makes the routing possible. Without structural evidence, there is no signal to route on. The router would have to use a model to decide which model to use, which is absurd on its face and also cost-catastrophic in practice. This is the trap that pure-LLM tools fall into whenever they try to retrofit tiering.
Why Mythos-class tools cannot tier
A pure-LLM architecture uses the model to understand the input. The model is the parser, the matcher, and the reasoner. If the Mythos-class tool tries to downgrade the parsing step to Haiku, Haiku does not have the context window or the structured-output reliability to handle an enterprise lockfile, a full CVE advisory, and a multi-hop dependency tree in the same prompt. The tool either produces malformed output or misses findings. If the tool downgrades only the easy cases to Haiku, it still needs a frontier model to decide which cases are easy, which means the cost of routing eats the savings from the routing.
We have benchmarked this pattern in three pure-LLM tools across the last year. In every case, either the vendor uses the frontier model for everything and the cost is high, or the vendor downgrades aggressively and the accuracy collapses. There is no stable intermediate. The only architecture that supports stable tiering is engine-plus-LLM, because only the engine produces the structured signal that makes routing trustworthy.
What the tier distribution looks like in practice
On a representative week of production scans, the Griffin AI tier distribution lands roughly at: Haiku for 82 percent of model-layer invocations, Sonnet for 14 percent, and Opus for 4 percent. The token cost distribution is almost the inverse: Opus consumes about half of the spend despite carrying 4 percent of the invocations, Sonnet consumes about a third, and Haiku, despite being the volume tier, is the smallest line item. This is the cost curve we wanted. Spend is concentrated on the tasks that actually require reasoning, and the long tail of classification is cheap.
A Mythos-class tool running the same workload has a single flat line. Every invocation costs the same, because every invocation goes through the same model. That flat line is much higher than Griffin AI's average, and it scales linearly with scan volume because there is no mechanism to make cheap work cheap.
Practical implications for security teams
Tiering gives us two things that matter operationally. The first is that we can afford to run Opus on the genuinely hard cases without flinching, because Opus is not burning cycles on version string comparisons. That means better triage on the findings that matter and fewer false negatives on novel CVEs. The second is that we can afford to invoke Haiku liberally on bulk classification, which gives us the volume of signal needed to drive the dashboards and the policy gates.
A pure-LLM tool has to choose between budgeting for reasoning and budgeting for scale. It cannot do both. Griffin AI does both because the architecture lets it. When you evaluate tools, ask how many distinct models are invoked during a scan, what the per-invocation cost is at each tier, and what the tier distribution looks like on a realistic workload. The answer tells you whether the vendor has the architecture to keep costs bounded as your estate grows.
Model tiering is not a marketing label. It is a design discipline that starts with having an engine in front of the models, and it is the reason Griffin AI's cost curve bends where Mythos-class curves do not.