← Concepts & Glossary
AI Security

LLM Selection

Matching the right model family to the right security task for cost and quality.

What is LLM selection?

LLM selection is the engineering discipline of choosing which model runs which step of an AI workflow — based on the quality bar, latency budget, and cost envelope of that specific step. It is not a vendor decision made once at the top of the architecture diagram. It is a task-by-task routing question, answered with evals.

Modern model families — Anthropic's Opus, Sonnet, and Haiku classes; OpenAI's comparable tiers; strong open-source options — span two orders of magnitude in cost and a wide spread in reasoning depth. Using the top-tier model for every step wastes budget on tasks that don't need it. Using the smallest for every step misses the problems that actually required reasoning. The win is routing.

How it works

A working playbook for a security pipeline:

  1. Opus-class for deep reasoning. Exploit hypothesis generation, cross-file taint reasoning, novel CVE triage — tasks where the cost of a wrong answer dwarfs the per-token cost of the call. This is where you pay up.
  2. Sonnet-class for drafting and synthesis. Remediation PR descriptions, advisory summarisation, structured extraction from long documents. The sweet spot where capability and cost meet for most production workflows.
  3. Haiku-class for scale. Classification, routing, metadata extraction across millions of items, cache-friendly lookups. Cheap per call, fast per call, correct often enough when the task is well-bounded.
  4. Eval-gated fallbacks. A routing layer can try the cheap model first and escalate on low confidence or eval-rubric failure. The escalation is measured: how often, on what cases, with what downstream cost. You only pay the Opus tax when the evidence says the cheap answer is going to be wrong.

Why it matters

At security-team scale, "always use the best model" is a cost line that will eventually be questioned by finance and "always use the cheapest" is a quality line that will eventually be questioned by engineering. Task-by-task selection avoids both conversations by making the tradeoff explicit and per-step.

The deeper reason it matters: model families drift, new models ship, prices move. A pipeline with per-task selection can absorb those changes incrementally — swap the drafting step, re-run evals, ship — instead of re-architecting end-to-end. Selection is a lever, not a commitment.

What value it adds

  • Cost drops 5–20x on the right steps

    Swapping drafting and classification steps from Opus-class to Sonnet- or Haiku-class routinely cuts end-to-end spend by an order of magnitude with no measurable quality drop.

  • Quality improves on the hard steps

    Concentrating the big-model budget on the genuinely hard reasoning (exploit hypothesis, complex remediation) lifts the scores that matter instead of spreading them thin.

  • Latency budget becomes meetable

    Sonnet- and Haiku-tier models return in fractions of what Opus-tier takes. Interactive surfaces become possible on steps that used to be "come back in 30 seconds."

  • Resilience to provider changes

    When a provider retires or reprices a model, you replace it per-step against the eval suite instead of doing a full pipeline rewrite. Migration costs amortise.

  • Defensible cost narrative

    "Here's the model assigned to each step, the eval score we require, and the cost per case" is a conversation you can have with a CFO. "We're paying a lot for AI" is not.

How Safeguard uses it

Inside Griffin AI, each step of the pipeline — triage, hypothesis, patch draft, reviewer — is bound to a specific model family chosen against the eval rubric for that step. Every routing decision in AI remediation is gated by the eval harness so a cheaper model can't silently take over a step it shouldn't.

Right model, right step, measured.

See how Safeguard routes each Griffin step to the cheapest model that meets the eval bar — and escalates only when the evidence says to.