AI Security

LLM Selection Cost-Quality Tradeoff For Security

LLM selection is ultimately a cost-quality optimisation under workflow constraints. The curve is not smooth, and the right point on it depends on where errors land in your pipeline.

Nayan Dey
AI Platform Engineer
7 min read

Every serious discussion of LLM selection eventually collapses into a cost-quality trade-off. How much quality are you willing to sacrifice to save compute, and how much compute are you willing to spend to gain quality. The easy answer is to pick the best model for the highest stakes tasks and the cheapest model for the lowest stakes tasks. The harder answer, and the one this post is about, involves understanding why the cost-quality curve is not smooth, why the right point on it depends on more than just task stakes, and how to build a selection process that gets you there consistently.

The curve is not smooth

People often talk about cost-quality as if it were a continuous line you can slide along. In reality, the frontier of available models has clumps and gaps. There is usually a cluster of tiny models around a certain capability level, a cluster of mid-sized models a step above, and a cluster of frontier models above that. Between clumps, the curve has steep slopes where spending a little more money buys a lot more capability, and shallow slopes where spending a lot more buys marginal gains.

This matters for selection because the optimal point is rarely in the middle of a flat region. It is almost always at a clump, where you are getting the best capability per dollar. Picking a model from a sparse region of the curve means paying for capacity you do not fully use. Picking from a clump gives you the best efficiency.

The curve also changes fast. A new model release can create a new clump at a lower price point or shift the relative position of existing models. A selection decision made twelve months ago might be on a poor point of today's curve. Re-evaluating model choices on a regular cadence, rather than committing to them permanently, is part of good hygiene.

Where errors land matters more than error rate

The naive way to compare models is by error rate on a benchmark. A model with five percent error is supposedly worse than one with three percent error. But for security workflows, where errors land is often more important than how often they happen.

Consider two models classifying findings as exploitable or not. Model A has a five percent error rate, split evenly between false positives and false negatives. Model B has a three percent error rate, concentrated almost entirely in false negatives. In many security pipelines, Model A is the better choice despite its higher overall error rate because the cost of a missed exploitable finding is much higher than the cost of a false alarm that gets filtered downstream.

This principle generalises. Evaluate models on cost-weighted error metrics that reflect your pipeline's sensitivities. Count the dollars you would spend if the error slipped through. A cheap model that errs in cheap ways can beat an expensive model that errs in expensive ways.

Token budgets have structure

Cost in LLM selection is usually expressed as price per million tokens. That is a simplification. Real token costs have structure. Input tokens and output tokens are priced differently. Long context windows often charge a premium over shorter ones. Reasoning models burn extra tokens on internal traces that you pay for even if they are hidden. Caching can dramatically reduce effective cost for repeated context.

When comparing models, build a cost model that reflects your actual usage pattern. If your workload has long shared context across many queries, a model with good prompt caching might be cheaper than a model with nominally lower token prices. If your workload is latency-bound, a model optimised for fast first-token output might be worth more than its nominal price suggests.

The same structural point applies to quality. Quality is not a single number. It is a bundle of capabilities. A model that is great at code understanding but poor at structured output might be a good choice for one workflow and a bad choice for another. Decomposing quality into the capabilities your workflow actually needs prevents paying for capabilities you do not use.

Routing is a selection strategy

The best response to a heterogeneous cost-quality curve is not to pick a single model. It is to route queries to different models based on their characteristics. Simple queries go to cheap models. Complex queries go to expensive models. Queries that need current data go through grounding. Queries that need style consistency go to fine-tuned models.

Routing turns model selection from a one-time decision into a continuous optimisation. A routing system measures the characteristics of each incoming query, predicts which model will produce the best cost-quality outcome, and sends the query there. Good routing can reduce overall cost by a factor of five or ten while maintaining quality, because it avoids sending easy queries to expensive models.

The engineering cost of routing is real. You need classifiers that can predict query difficulty, fallback logic for when the chosen model fails, monitoring to catch drift, and an evaluation harness that measures quality across the whole routed pipeline rather than any single model. Teams that commit to routing usually build this infrastructure incrementally, starting with a simple rule-based router and adding sophistication as they learn what matters.

The headroom question

A specific trap we see often is picking a model that just barely meets current quality needs. This leaves no headroom for growth. When your product adds a new feature that uses the same model more aggressively, the model falls over. When input complexity drifts upward over time, the model's error rate climbs. Selecting at the edge of what you need means you are always one change away from being under-resourced.

The right amount of headroom depends on how quickly your workloads change. For stable workloads, selecting close to the minimum acceptable quality is fine because the environment is predictable. For workloads that are growing or changing in character, picking one step up on the capability ladder gives you room to absorb change without scrambling to reselect.

The latency floor

Latency is a hard constraint more often than cost is. Many security workflows have interactive or near-real-time requirements. A pull request check that takes thirty seconds is tolerable. One that takes three minutes is not. A triage tool that shows findings in under a second feels responsive. One that takes ten seconds per finding does not.

Latency places a floor on how small and fast your models need to be. Models above that floor are excluded regardless of their cost or quality. Respecting the latency floor up front simplifies selection because it narrows the candidate set immediately. Trying to retrofit latency into a system built around a slow model is painful and usually ends with replacing the model.

The selection process that works

A selection process that produces good results consistently has a few components. Start with a clear description of the workflow, including its latency requirements, its volume, its cost budget, and where errors are most and least expensive. Identify candidate models at a few points on the cost-quality curve that might fit. Run them all against a representative evaluation set that uses cost-weighted metrics. Shortlist the candidates that meet the hard constraints and compare their remaining trade-offs.

Do not skip shadow deployment. Running a candidate alongside your current model on real traffic, comparing outputs without letting the new model influence user-facing behaviour, reveals problems that benchmarks do not. Teams that go from benchmark to production without shadow deployment regularly ship regressions that would have been obvious with two weeks of shadow data.

And accept that selection is a recurring task, not a one-time event. Budget for re-evaluation. The model you pick this quarter may not be the right choice next quarter, and that is fine if your process makes re-selection cheap.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.