Claude Opus is Anthropic's most capable reasoning model, and it's the engine Griffin reaches for when a triage decision actually matters. Since Opus is what Griffin uses under the hood for its hardest calls, the interesting comparison isn't "Griffin or Opus" — it's "Opus with a prompt" versus "Opus with Griffin's context, checks, and choreography wrapped around it."
This post walks through how each approach actually performs on real triage work, where the overlap is high, and where the wrapper earns its keep.
The Shape Of A Triage Task
Triage is the step between "a scanner found something" and "we know what to do about it." On a given week, a mid-sized engineering org will see hundreds of new findings land from SCA tools, container scans, code scanners, and secret detectors. The goal of triage is to compress that queue into a short list of things that actually need human attention, with a confident reason attached to each.
A good triage decision needs to answer four questions. Is the finding real or a false positive? Is the vulnerable code actually reachable in our deployment? Is there a known exploit and is it being used in the wild? And finally, what is the cheapest path to make the finding go away?
Each of those questions has a reasoning component and a data component. The reasoning is where Opus shines. The data is where raw Opus starts to struggle.
Opus Alone
Drop a CVE description, a snippet of the affected code, and a dependency tree into a raw Opus prompt and you will get a genuinely thoughtful analysis. Opus will identify the vulnerable function, explain the exploitation pattern, reason about whether your usage looks reachable, and recommend a patched version. For a single finding and a motivated analyst, that's a remarkably good experience.
The problems appear when you scale. Opus doesn't know what your other services are running. It doesn't know that the same library is pinned in eleven other repos with different configurations. It doesn't know that your team already triaged a similar CVE two months ago and concluded it was non-reachable for architectural reasons. Every Opus session starts from zero, so every session re-derives context that should be institutional knowledge.
The second problem is calibration. Opus is confident. When it says "this looks exploitable," it means "this looks exploitable given the limited information I was handed." It doesn't distinguish strongly between "I'm reasoning from first principles" and "I've verified this against ground truth." For high-stakes calls, that uncalibrated confidence is dangerous.
Opus Inside Griffin
Griffin doesn't replace Opus; it front-loads the context and back-loads the checks. Before Opus even sees a triage prompt, Griffin has already pulled the relevant SBOM slice, the reachability data from static analysis, the exploit-availability signal from EPSS and KEV, and the history of how similar findings were resolved for this tenant. All of that lands in the prompt as structured context.
The prompt itself is constructed by a security-tuned template that steers Opus toward the specific decision shape we want: "classify as fix-now, defer, or accept, and cite the specific evidence that drove the call." The template matters because it eliminates an entire category of unhelpful Opus responses — the kind that meander through general advice without committing to a recommendation.
After Opus generates a triage decision, Griffin's eval harness runs a set of cheap graders before showing it to the user. One grader checks that any CVE IDs referenced actually exist in the advisory database. Another checks that the recommended fix version is reachable from the currently pinned version. A third checks that the reachability claim matches the evidence Griffin pulled. If any grader fails, Griffin either retries with a corrected prompt or surfaces the disagreement to a human.
The net effect is that the Opus reasoning you see in Griffin has been both better-informed going in and better-checked coming out. Same model, different failure mode.
Where The Gap Is Biggest
The gap between raw Opus and Griffin is smallest on well-defined, single-finding questions. If you paste a single CVE and ask "is this exploitable given the following code," Opus alone will give you almost the same answer Griffin would. Maybe 90 percent as good, often indistinguishable.
The gap widens dramatically on three patterns.
Cross-project triage is the first. When a CVE affects a transitive dependency that shows up in twelve services with different usage patterns, raw Opus can only reason about whatever you paste into one prompt. Griffin automatically fans out, triages each service's usage separately, and then aggregates. The token cost is higher, but the output is a matrix of decisions instead of a generic analysis.
Longitudinal triage is the second. When a finding has a history — previously marked as non-reachable, then the code changed, then a new CVE variant was published — Griffin's context store surfaces that history automatically. Raw Opus has no memory, so it re-opens closed cases every time.
Calibrated deferral is the third. Griffin's graders are particularly good at catching confidently-wrong deferrals. Raw Opus, given the option to mark a finding as "defer," will sometimes defer things it shouldn't because the prompt didn't push back hard enough. Griffin's check layer reliably catches the "this looks too risky to defer" case.
When Opus Alone Still Wins
For deep, exploratory analysis of a single novel vulnerability, raw Opus with a skilled human analyst is often the superior tool. Griffin's scaffolding is tuned for throughput and consistency, which means it's slightly less good at the weird, bespoke, research-heavy cases. If you're looking at a truly new class of supply chain attack — something unlike anything in the training data or in Griffin's learned patterns — you want Opus in a bare-metal chat, not wrapped in workflow machinery that keeps trying to classify it.
Opus alone is also the right choice for narrative work. Writing a long-form vulnerability disclosure, drafting an advisory for customers, producing a retrospective — these are tasks where Griffin's triage-shaped prompts get in the way. Use Opus directly, bring your own context, and let the model breathe.
The Practical Recommendation
For the large majority of inbound findings that need to be classified, routed, and resolved, Griffin's wrapper around Opus is worth it. The wrapper is mostly earning its keep by eliminating false positives, reducing the cognitive cost per finding, and producing a consistent audit trail. That's the shape of operational triage, and Opus alone just doesn't handle that shape well at volume.
For the small number of findings each week that genuinely warrant deep human analysis, pull them out of Griffin's queue and work them in a raw Opus session. You'll get the same reasoning engine, unconstrained by the triage template, with full access to all the context you choose to hand it.
That split — Griffin for the triage funnel, Opus directly for the hard investigations — is how most teams end up using both. It reflects the reality that Griffin is Opus applied to a specific job with specific guardrails, not a different intelligence layer. When the job matches the guardrails, Griffin wins on consistency and throughput. When the job is the weird one you didn't plan for, raw Opus wins on flexibility.