AI Security

Agentic AI Security: Why Architecture Beats Model Size in Vulnerability Discovery

The CyberGym leaderboard shows the lead in AI vulnerability discovery moving to multi-agent orchestration, not raw model scale. Here is what that means for security teams betting on agentic AI.

Nayan Dey
Senior Security Engineer
7 min read

There is a comfortable assumption in AI security right now: that the team with the biggest, smartest model wins. Find the frontier model with the best reasoning, point it at a codebase, and it will out-discover everyone else. It is a clean story. It is also, increasingly, wrong.

The most interesting result of 2026 so far is not which lab shipped the most capable single model. It is who is sitting at the top of the CyberGym leaderboard — and why. The answer says something uncomfortable for anyone who has been treating model selection as the whole game. The lead is moving to the layer above the model, and if your strategy for agentic AI security is "wait for a better model," you are optimizing the wrong variable.

What CyberGym Actually Measures

CyberGym is a benchmark out of UC Berkeley's RDI group, published in 2025, built to evaluate AI agents on real-world cybersecurity work rather than toy puzzles. It comprises 1,507 instances drawn from historical vulnerabilities across 188 real software projects. In its Level 1 task, an agent receives a vulnerability description and an unpatched codebase, and is scored on whether it can reproduce the vulnerability by generating a working proof-of-concept.

That framing matters. CyberGym does not reward an agent for sounding plausible. It rewards an agent for producing an artifact that actually triggers the bug. According to the Berkeley team, runs against the benchmark surfaced dozens of genuine zero-day vulnerabilities and a number of historically incomplete patches in the underlying projects — evidence that performance here correlates with real discovery capability, not benchmark gaming. For a field drowning in confident-but-wrong AI output, a benchmark that demands a reproducible PoC is exactly the right kind of strict.

The Leaderboard Tells On Itself

Look at the public snapshot and the pattern is hard to miss. The strong single frontier models cluster in a band: Claude Mythos Preview around 83 percent, GPT-5.5 around 81.8 percent, Claude Opus 4.7 around 73 percent. These are excellent models. The gaps between them are real but modest, and they are the gaps you would expect from raw capability differences.

Then there is Microsoft's entry. In May 2026, Microsoft published results for a system it codenamed MDASH — its multi-model agentic scanning harness — that hit 88.45 percent on the same 1,507 instances, roughly five points clear of the next entry. By early June, reporting put the figure higher still, around 96.55 percent as the system matured. Microsoft was explicit that MDASH topped Anthropic's Mythos on this benchmark.

Here is the part worth sitting with: MDASH did not win by having a secret model that out-reasons GPT-5.5 or Claude. It won by orchestrating, in Microsoft's own description, more than a hundred specialized agents across an ensemble of frontier and distilled models through a five-stage pipeline — prepare, scan, validate, dedupe, and prove. The Microsoft writeup puts it bluntly: "the harness does the work, and the model is one input," and "discovery requires composition that no single prompt can achieve." Their framing — that the harness is most of the engineering — is the whole argument in one sentence.

Why Orchestration Moves the Frontier and Scale Does Not

If you have built single-model vulnerability discovery, none of this is surprising. The hard problem in this domain was never "can a model notice a suspicious pattern." Frontier models have been good at that for a while. The hard problem is everything after the notice: confirming the tainted input actually reaches the sink, ruling out sanitization the model glossed over, constructing a PoC that fires, and deciding the finding is real before a human ever sees it.

There is a useful way to see this. Think of model size as the quality of a single security researcher, and orchestration as the process that researcher operates inside. A brilliant researcher with no peer review, no reproduction step, and no second opinion will still ship mistakes — confident ones — because the failure is procedural, not intellectual. A merely good researcher embedded in a rigorous process ships fewer. The industry spent two years recruiting smarter and smarter soloists. The teams pulling ahead now are the ones who built the process.

A single model doing all of that in one pass faces a structural trap. Optimize it for recall and it flags everything, burying real bugs under hallucinated ones. Optimize it for precision and it goes quiet on the subtle, cross-file vulnerabilities that actually get exploited. There is no prompt that escapes this tradeoff, because the tradeoff lives in the architecture, not the wording.

Orchestration breaks the trap by separating the jobs. One agent casts a wide net for candidates — recall. A different agent, ideally a different model, tries to disprove each candidate — precision. A third builds the PoC that either fires or does not, which is the only judge that cannot be sweet-talked. Microsoft's "validate, dedupe, prove" stages are exactly this decomposition. The reason a panel beats a soloist is the same reason code review beats writing alone: an independent verifier with a different failure mode catches what the author rationalized away. That is why a harness over a low-80s base model can clear 88 and keep climbing, while swapping in a marginally smarter base model nudges you a point or two. The precision-recall frontier moves with structure. It barely moves with scale.

The Honest Caveats

Two things keep this from being a victory lap. First, leaderboards are not production. CyberGym hands the agent a vulnerability description; real discovery starts from a cold repository with no hint that a bug exists, and the orchestration tax — running a hundred agents and several models per finding — is real cost that someone pays per scan. A number that climbs from 88 to 96 in three weeks is a number under heavy active tuning, and tuning to a benchmark is not the same as generalizing past it.

Second, orchestration is not magic that launders bad models into good findings. If every model in your ensemble shares the same blind spot — and models trained on overlapping data often do — your independent verifier is not actually independent, and correlated errors sail straight through. The win comes from genuine diversity of failure modes and from the prove step being grounded in execution, not from stacking agents for its own sake. More agents that all agree is not verification; it is an echo.

How Safeguard Helps

This is the architecture we built Safeguard around, well before the leaderboard caught up to the thesis. Our Multi-Agent TAOR Deep Think AI Engine treats the model as a swappable component — OpenAI Daybreak, Anthropic Mythos, or whatever leads next month plugs in underneath — while the reliability lives in the verification and orchestration layer above it, where independent agents confirm, contextualize, and prove each finding before it reaches you. That layer is what cuts false positives, and it is why we lead on CyberGym-style evaluation through composition rather than betting on any single model. We measure ourselves in cost-per-verified-finding, not raw alert volume, and we wire results into your AIBOM, policy gates, and provenance so a finding becomes an action. If you want to see verified findings against your own dependency tree, reach out.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.