SWE-bench earned its place as the default yardstick for AI coding agents by doing one thing better than anything that came before it: it picked real issues from real GitHub repositories, with real tests that either pass or fail. That concreteness made it useful, and usefulness is rare in benchmarks.
The security-focused extensions that bolted themselves on afterwards, sometimes labeled SWE-bench-Security, SWE-bench-Verified-Sec, or integrated into broader agent eval suites, have tried to port that same design discipline to the narrower question: can an AI agent correctly fix a security bug without making things worse. This is a harder question than it sounds. Here is a field review of what the extensions measure, where they hold up, and where they quietly fall apart.
The shape of the extensions
The security extensions take three broad forms in the wild today.
The first is filtered subsets of the original SWE-bench corpus. Someone walks the issue list, keeps only the ones with CVE assignments or clear security labels, and treats that as a security benchmark. This is the cheapest approach and also the weakest. Security labels on GitHub are noisy, CVE assignments are inconsistent across projects, and the filter throws away context.
The second is augmented test suites. The base issue stays the same, but extra tests are added to check for common insecure patterns introduced by the patch. These catch obvious regressions like reintroduced SQL injection or broken input validation, but they do not catch subtle logic flaws, and they add a new failure mode where a perfectly correct patch fails because the added test has a bug.
The third is adversarial extensions where the agent's patch is then attacked by a fuzzer, an exploit harness, or a second agent trying to find a new vulnerability in the patched code. This is the most valuable variant and also the most expensive to run. It is the only one where the score genuinely reflects whether the patch is secure, not just whether it looks secure.
What the numbers mean in practice
The headline resolution rate on the security subsets is usually lower than on the base benchmark by a meaningful margin. This is honest. Security bugs are harder than typical issues. They involve more non-local reasoning, more attention to edge cases, and more awareness of what the attacker cares about rather than what the unit test cares about.
What the numbers do not tell you is how the resolved patches would fare in a code review with a human security engineer. I have walked through several agent-produced patches that the benchmark marked as correct and found them to be technically correct but strategically bad. They fixed the specific issue, sometimes by adding a narrow input validator right at the call site, while leaving the same class of bug alive three functions away. The benchmark has no concept of defensive depth, so it scores those the same as a proper fix.
The aggregate numbers also hide a sharp bimodality. Agents that score well on security tasks tend to score very well on a specific family of bugs, usually input validation and injection classes, and very poorly on another family, usually concurrency, authorization, and cryptographic misuse. Averaging those into a single percentage makes the comparison useless.
The contamination problem
SWE-bench has a known and well-discussed contamination issue, and the security extensions inherit all of it plus some of their own. The base repositories are public and well-indexed. The issues and their fix commits are public. Any frontier model trained on post-2023 web data has seen most of the answer keys.
The authors of the extensions and the Verified variants have done real work to mitigate this, including holdout splits, newer issue cutoffs, and paraphrased problem statements. It helps. It does not solve the problem. When a model appears to "solve" a CVE from 2021 with a patch that happens to be byte-identical to the real upstream fix, you cannot tell whether it reasoned its way there or recalled the string.
For security specifically, contamination is more dangerous than for general coding. The model may recognize the CVE pattern and apply the known fix without understanding why that fix is correct. On a new, unseen bug of the same class, it will silently fail. The benchmark score will not warn you.
Where the extensions earn their keep
There are two places where I still reach for SWE-bench-Security variants despite everything above.
The first is as a regression gate for agent frameworks. When a team changes prompts, tools, or the underlying model, running the security subset is a cheap way to detect a drop in capability. You do not need the absolute score to be meaningful. You need the delta across versions of the same system to be meaningful, and it mostly is, as long as the harness is held constant.
The second is as a public reporting floor. If an agent vendor cannot hit a credible score on the security subset, they probably should not be selling their product to anyone who cares about security. It is a low bar, but the low bar is useful because a surprising number of products fail it.
What is missing
The extensions miss most of what I care about in 2026. They do not test agents with realistic tool access. They do not test long-horizon tasks where the security implications emerge over many turns. They do not test multi-file refactors that introduce subtle trust boundary changes. They do not test the case where the right answer is "I do not know, escalate this."
They also do not test the collaborator dynamic. In real engineering, a security fix is a negotiation between the author, a reviewer, and sometimes a security engineer. The agent's role in that conversation is at least as important as the patch content. The benchmark has no instrument for this.
How to read a SWE-bench-Security number
A short checklist.
Ask which subset and which split. If the answer is vague, the number is vague.
Ask whether the test augmentation is public. If not, the score is not reproducible and should be discounted.
Ask for per-CWE or per-class breakdowns. A 30 percent score that is all input validation is a different product from a 30 percent score spread across ten weakness classes.
Ask about contamination controls. A credible answer names specific techniques. A bad answer waves at "held-out data."
Ask for a live run on a new, post-cutoff CVE that neither of you has seen. If the agent can do it once in front of you, the benchmark number is probably directionally right.
The verdict
SWE-bench with security extensions is a better benchmark than most alternatives for the narrow question of patch quality on known issues. It is not a good benchmark for agent security in the broader sense that most buyers care about. Treat it as a necessary condition, never sufficient. A team that passes it has cleared a bar. A team that fails it has told you something important. A team that leans on the score to sell you an agent has told you something even more important, though perhaps not what they intended.