Reasoning models are the category that has changed most in the past year. Where earlier models produced answers by fluent text generation, reasoning models generate intermediate thinking steps that expose their chain of logic before committing to a final answer. For security work, that exposed reasoning is valuable but it also introduces new evaluation challenges. A reasoning model can reach the right answer through faulty logic, or the wrong answer through logic that looks impeccable. Traditional evaluation methods do not catch either case.
What makes reasoning models different
A reasoning model does not just produce an answer. It produces a trace of intermediate reasoning that led to the answer. That trace is sometimes visible to users and sometimes hidden, but it always exists and always influences the final output. For security tasks, the reasoning trace often matters more than the final answer because security work is about justified decisions, not just correct ones. A model that says "this finding is critical" without justification is less useful than one that says "this finding is critical because the vulnerable sink is reachable from an unauthenticated endpoint."
Reasoning models also exhibit different failure modes from earlier models. They can over-reason, producing pages of analysis for simple questions. They can under-reason, committing to an answer early in the trace and rationalising it in the rest. They can hallucinate supporting facts inside their reasoning trace, building a plausible-sounding argument on a fabricated premise. Each of these failure modes needs specific evaluation attention.
Answer correctness is not enough
The simplest evaluation approach is to score only the final answer. For a classification task, this might be accuracy or F1. For a generation task, it might be a similarity metric against a reference answer. This approach misses everything interesting about reasoning models because it ignores the reasoning trace.
Two models can reach the same correct answer with very different reasoning quality. One model might correctly identify that a finding is a false positive because the upstream framework sanitises input. Another might reach the same conclusion because it happened to see similar-looking code marked as false positive in training. The first model will generalise to new frameworks with similar properties. The second will not. Answer correctness alone cannot distinguish them.
Teams that rely only on answer correctness also miss cases where the model got lucky. On small evaluation sets, a model that arrives at answers through shaky reasoning can score well by chance. The reasoning quality is the signal that the model will continue performing well on new inputs. Ignore it and you lose the most useful predictor of real-world behaviour.
Reasoning trace evaluation
Evaluating the reasoning trace directly is harder. There is no single reference trace to compare against. Different valid reasoning paths can reach the same correct answer. The traces are long and varied enough that string similarity metrics are not useful. Something more structured is required.
One approach is rubric-based human evaluation. Experienced security analysts score traces along dimensions like factual accuracy of intermediate claims, logical validity of inferences, appropriateness of assumptions, and coverage of relevant considerations. This gives a rich signal but it is expensive and hard to scale. For high-stakes evaluations of new models or major updates, it is worth the cost.
A scalable approximation is to use a separate strong model as a judge. A frontier model prompted with a clear rubric can score reasoning traces at a rate and cost that human analysts cannot match. The judge model is not perfect. It has biases and can be fooled by confident-sounding nonsense. But used alongside spot-checked human evaluation, it gives a useful signal on large batches of traces.
A third approach is structural analysis. Break the reasoning trace into atomic claims and inferences. Check each atomic claim for factual accuracy against a ground truth database. Check each inference for logical validity. This approach scales well and catches specific failure modes like fabricated CVEs or invalid deductions. It misses more holistic failures where the structure is valid but the overall argument is weak.
Temporal evaluation
Security evaluation benchmarks age fast. A benchmark built from 2024 CVEs is leaky by 2026 because models have seen the answers during training. The honest evaluation approach is to hold out a temporal slice of data that the model could not have seen. Evaluate the model on vulnerabilities disclosed after its training cutoff. Evaluate it on code committed after that date. This gives a realistic picture of how the model will perform on fresh work.
Temporal evaluation requires discipline to maintain. Every time a new benchmark is built, its freshness clock starts ticking. By the time it is widely used, it has usually leaked into training sets and stopped being fresh. Teams that take evaluation seriously rotate their evaluation datasets regularly and treat the dataset itself as a living artifact.
For reasoning models specifically, temporal evaluation exposes whether the model can reason about novel situations or just retrieve patterns from training. A model that performs dramatically worse on post-training-cutoff data is relying more on memorisation than reasoning. That is a red flag for any use case involving new vulnerabilities, which is most security use cases.
Cost and latency as part of quality
Reasoning models burn many more tokens per query than non-reasoning models because the reasoning trace has to be generated. Evaluating them fairly requires accounting for this cost. A reasoning model that is ten percent more accurate while costing five times as many tokens may or may not be a good trade, depending on the workflow.
Latency evaluation also matters. Reasoning models take longer to produce a final answer because they generate the intermediate trace first. For interactive workflows where users wait for responses, this latency can make an otherwise strong model unusable. Measure time-to-first-useful-token and time-to-final-answer separately. Some workflows can start acting on partial reasoning before the final answer is committed. Others cannot.
Failure mode taxonomy
Building an explicit taxonomy of failure modes makes evaluation more actionable. For security reasoning, the most common failure modes we have seen are fabricated CVEs (the model cites a CVE that does not exist), fabricated code paths (the model reasons about a call chain that is not actually in the code), confident misclassification (the model commits to wrong severity with strong language), context leakage (the model's reasoning references information it should not have had access to), and scope drift (the model's final answer addresses a different question from the one asked).
Scoring a model against this taxonomy gives you specific weaknesses to address. A model with a high rate of fabricated CVEs probably needs stronger grounding. A model with high confident misclassification needs calibration training. A model with scope drift needs clearer prompts or task-specific fine-tuning.
The practical evaluation harness
A useful evaluation harness for security reasoning models has several components. A temporal hold-out dataset for realism. A set of well-defined tasks that match your actual workflows. A rubric for scoring reasoning traces, backed by both human and model-judge scoring. A cost and latency measurement layer. A failure mode taxonomy for categorising errors. And continuous rotation of the evaluation data to prevent leakage over time.
Investing in this harness is unglamorous work, but it pays back every time a new model is released. A team with a well-maintained evaluation harness can test a new model in a day and make an informed decision. A team without one ends up making decisions based on vendor benchmarks, which is the wrong way to run security infrastructure.