AI Security

Evaluating Security-Specific Reasoning Models

Reasoning models have arrived in security tooling. Evaluating them requires different methodology from evaluating classification or generation models. Here is what good evaluation looks like.

Reasoning models are the category that has changed most in the past year. Where earlier models produced answers by fluent text generation, reasoning models generate intermediate thinking steps that expose their chain of logic before committing to a final answer. For security work, that exposed reasoning is valuable but it also introduces new evaluation challenges. A reasoning model can reach the right answer through faulty logic, or the wrong answer through logic that looks impeccable. Traditional evaluation methods do not catch either case.

What makes reasoning models different

A reasoning model does not just produce an answer. It produces a trace of intermediate reasoning that led to the answer. That trace is sometimes visible to users and sometimes hidden, but it always exists and always influences the final output. For security tasks, the reasoning trace often matters more than the final answer because security work is about justified decisions, not just correct ones. A model that says "this finding is critical" without justification is less useful than one that says "this finding is critical because the vulnerable sink is reachable from an unauthenticated endpoint."

Reasoning models also exhibit different failure modes from earlier models. They can over-reason, producing pages of analysis for simple questions. They can under-reason, committing to an answer early in the trace and rationalising it in the rest. They can hallucinate supporting facts inside their reasoning trace, building a plausible-sounding argument on a fabricated premise. Each of these failure modes needs specific evaluation attention.

Answer correctness is not enough

The simplest evaluation approach is to score only the final answer. For a classification task, this might be accuracy or F1. For a generation task, it might be a similarity metric against a reference answer. This approach misses everything interesting about reasoning models because it ignores the reasoning trace.

Two models can reach the same correct answer with very different reasoning quality. One model might correctly identify that a finding is a false positive because the upstream framework sanitises input. Another might reach the same conclusion because it happened to see similar-looking code marked as false positive in training. The first model will generalise to new frameworks with similar properties. The second will not. Answer correctness alone cannot distinguish them.

Teams that rely only on answer correctness also miss cases where the model got lucky. On small evaluation sets, a model that arrives at answers through shaky reasoning can score well by chance. The reasoning quality is the signal that the model will continue performing well on new inputs. Ignore it and you lose the most useful predictor of real-world behaviour.

Reasoning trace evaluation

Evaluating the reasoning trace directly is harder. There is no single reference trace to compare against. Different valid reasoning paths can reach the same correct answer. The traces are long and varied enough that string similarity metrics are not useful. Something more structured is required.

One approach is rubric-based human evaluation. Experienced security analysts score traces along dimensions like factual accuracy of intermediate claims, logical validity of inferences, appropriateness of assumptions, and coverage of relevant considerations. This gives a rich signal but it is expensive and hard to scale. For high-stakes evaluations of new models or major updates, it is worth the cost.

A scalable approximation is to use a separate strong model as a judge. A frontier model prompted with a clear rubric can score reasoning traces at a rate and cost that human analysts cannot match. The judge model is not perfect. It has biases and can be fooled by confident-sounding nonsense. But used alongside spot-checked human evaluation, it gives a useful signal on large batches of traces.

A third approach is structural analysis. Break the reasoning trace into atomic claims and inferences. Check each atomic claim for factual accuracy against a ground truth database. Check each inference for logical validity. This approach scales well and catches specific failure modes like fabricated CVEs or invalid deductions. It misses more holistic failures where the structure is valid but the overall argument is weak.

Temporal evaluation

Security evaluation benchmarks age fast. A benchmark built from 2024 CVEs is leaky by 2026 because models have seen the answers during training. The honest evaluation approach is to hold out a temporal slice of data that the model could not have seen. Evaluate the model on vulnerabilities disclosed after its training cutoff. Evaluate it on code committed after that date. This gives a realistic picture of how the model will perform on fresh work.

Temporal evaluation requires discipline to maintain. Every time a new benchmark is built, its freshness clock starts ticking. By the time it is widely used, it has usually leaked into training sets and stopped being fresh. Teams that take evaluation seriously rotate their evaluation datasets regularly and treat the dataset itself as a living artifact.

For reasoning models specifically, temporal evaluation exposes whether the model can reason about novel situations or just retrieve patterns from training. A model that performs dramatically worse on post-training-cutoff data is relying more on memorisation than reasoning. That is a red flag for any use case involving new vulnerabilities, which is most security use cases.

Cost and latency as part of quality

Reasoning models burn many more tokens per query than non-reasoning models because the reasoning trace has to be generated. Evaluating them fairly requires accounting for this cost. A reasoning model that is ten percent more accurate while costing five times as many tokens may or may not be a good trade, depending on the workflow.

Latency evaluation also matters. Reasoning models take longer to produce a final answer because they generate the intermediate trace first. For interactive workflows where users wait for responses, this latency can make an otherwise strong model unusable. Measure time-to-first-useful-token and time-to-final-answer separately. Some workflows can start acting on partial reasoning before the final answer is committed. Others cannot.

Failure mode taxonomy

Building an explicit taxonomy of failure modes makes evaluation more actionable. For security reasoning, the most common failure modes we have seen are fabricated CVEs (the model cites a CVE that does not exist), fabricated code paths (the model reasons about a call chain that is not actually in the code), confident misclassification (the model commits to wrong severity with strong language), context leakage (the model's reasoning references information it should not have had access to), and scope drift (the model's final answer addresses a different question from the one asked).

Scoring a model against this taxonomy gives you specific weaknesses to address. A model with a high rate of fabricated CVEs probably needs stronger grounding. A model with high confident misclassification needs calibration training. A model with scope drift needs clearer prompts or task-specific fine-tuning.

The practical evaluation harness

A useful evaluation harness for security reasoning models has several components. A temporal hold-out dataset for realism. A set of well-defined tasks that match your actual workflows. A rubric for scoring reasoning traces, backed by both human and model-judge scoring. A cost and latency measurement layer. A failure mode taxonomy for categorising errors. And continuous rotation of the evaluation data to prevent leakage over time.

Investing in this harness is unglamorous work, but it pays back every time a new model is released. A team with a well-maintained evaluation harness can test a new model in a day and make an informed decision. A team without one ends up making decisions based on vendor benchmarks, which is the wrong way to run security infrastructure.

ai-security specialised-llms frontier-models selection

Back to all articles

More on #ai-security

View all →

AI Security

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.

Evaluating Security-Specific Reasoning Models

What makes reasoning models different

Answer correctness is not enough

Reasoning trace evaluation

Temporal evaluation

Cost and latency as part of quality

Failure mode taxonomy

The practical evaluation harness

More on #ai-security

API Surface Reviewed: Griffin AI vs Mythos

Real-World Deployment: Griffin AI vs Mythos

Scaling Across Repos: Griffin AI vs Mythos

Tool-Call Hijacking: Griffin AI vs Mythos

Related articles in AI Security

Building an Eval Suite for Your Security LLM Workflows

Zero-Day Discovery With LLM-Augmented Reachability: A Safeguard Engine Walkthrough

Frontier LLM Vendors Are Not Your Supply Chain Security Vendor

Never miss an update

Product

Solutions

Compare

Resources

Company

Legal

Developers