Refusal Rate Analysis: Griffin AI vs Mythos
A security AI that refuses too often is useless. One that refuses too rarely is dangerous. Griffin AI publishes calibrated refusal benchmarks; Mythos does not.
Deep dives, practical guides, and incident analyses from engineers who build Safeguard. No fluff, no vendor FUD — just what you need to ship secure software.
A security AI that refuses too often is useless. One that refuses too rarely is dangerous. Griffin AI publishes calibrated refusal benchmarks; Mythos does not.
SEvenLLM set out to measure how well LLMs handle Security Event analysis, the unglamorous day-to-day work of SOCs and IR teams. A design review of what the benchmark covers, how it was built, and where the coverage maps or does not map to real operations.
An AI security tool that cites the wrong advisory is worse than one that says nothing. Griffin AI benchmarks citation accuracy at 0.89 similarity; Mythos does not.
SecBench positioned itself as a comprehensive cybersecurity knowledge and reasoning benchmark for LLMs. A methodology review of its construction, scoring, and the gaps that separate the advertised coverage from what the benchmark actually exercises.
Griffin AI reports 98-100% hold rate against adversarial probes. Most Mythos-class tools have never published an adversarial number at all.
The datasets you use to evaluate model safety are themselves a supply chain, and almost nobody is treating them that way. A senior engineer's audit of how eval corpora get poisoned, contaminated, and silently drifted.
SWE-bench became the default benchmark for measuring AI coding agents, but the security extensions that were bolted on afterwards deserve their own scrutiny. A field review of what they measure, where they break, and whether you should trust the numbers.
A benchmark number is only as good as the methodology that produced it. Here is how Griffin AI builds its harness and why most Mythos-class tools cannot be audited.
A working engineer's review of CyberSecEval, the Meta-originated benchmark that has quietly become the default sniff test for AI-for-security claims. What it actually measures, what it misses, and how to read its scores without fooling yourself.
Weekly insights on software supply chain security, delivered to your inbox.