AI Security

CWE Classification Accuracy: Griffin AI vs Mythos

Getting the CWE right is not a taxonomic hobby. It drives remediation, compliance mapping, and detection engineering. Here is how grounded and pure-LLM scanners compare.

CWE classification looks like a secondary concern. In practice, it is the single field that determines what happens to a finding in the downstream pipeline. The remediation library is keyed by CWE. The compliance mapping to PCI-DSS 6.5 and the OWASP Top 10 is keyed by CWE. The detection engineering team's runbook is keyed by CWE. Get the class wrong and the wrong playbook runs; the right mitigation never gets considered; the compliance report lies. "It's close enough" is not an acceptable answer for a field that is consumed by automation.

I have watched enough AI bug hunters confidently misclassify CWEs to have strong opinions about why some of them get it right and others get it wrong.

The taxonomy is not straightforward

The MITRE CWE corpus in its 2026 revision has over 900 weaknesses with heavy hierarchical structure and plenty of overlap. A finding that looks like CWE-89 SQL injection can easily also be CWE-943 improper neutralisation of special elements in a data query logic, or CWE-564 SQL injection through Hibernate, depending on the exact sink. A server-side request forgery that reaches a cloud metadata endpoint is CWE-918 but may also implicate CWE-611 XML external entity processing if the SSRF is through an XML parser. The taxonomy is layered and contextual; the right answer is often a specific child class, not the parent.

Pure-LLM scanners handle this badly. They are trained to recognise the surface grammar of vulnerabilities, and that grammar usually maps to the most-discussed parent class. Ask a Mythos-class tool to classify a finding and you will get CWE-89 or CWE-79 for almost any injection-shaped pattern, even when the more specific child class is objectively correct. The 2025 SEI CERT study of LLM-assisted classification reported parent-class bias in roughly 70 percent of samples that had a clearly preferable child class.

Why Griffin classifies better

The reason Griffin tends to land on correct CWEs is that the static engine has already narrowed the classification space before the model is asked to pick. If the sink is a JDBC prepared statement call misused with string concatenation, the candidate classes are drawn from the injection subtree but excluded from the XSS and XXE subtrees. If the source is a deserialisation boundary and the sink is method dispatch on the resulting object graph, the model is picking between CWE-502, CWE-915, and CWE-913, not between those and CWE-89.

This is not the model being smarter. It is the model being constrained. The engine's reachability analysis restricts the population of CWEs the finding could plausibly be, and the model picks among them based on the semantic details the engine cannot judge: framework context, ORM configuration, serialisation library, and so on. The division of labour maps neatly to what each component is good at. The engine knows the shape of the flow. The model knows the culture of the library.

The cascade of misclassification

When a Mythos-class tool labels a CWE-502 unsafe deserialisation as CWE-94 code injection, several things go wrong at once.

The remediation library, keyed by CWE, proposes input-sanitisation fixes that do not apply to deserialisation chains. The developer reads the proposed fix, notices it does not match the code, loses a bit of trust in the tool, and starts ignoring its remediation guidance. Meanwhile the real fix, which is to replace the unsafe deserialiser with a schema-validated alternative like ProtoBuf or JSON with a strict schema, never surfaces because the tool never classified the finding into the bucket that would have pointed at it.

The compliance mapping fails silently. If the organisation's PCI-DSS 6.5.1 mapping is "flag any CWE-89, CWE-78, CWE-94 for the quarterly injection attestation," a misclassified CWE-502 is either double-counted or invisible. Neither outcome is defensible on an audit call.

The detection engineering team, which relies on CWE classes to decide what runtime telemetry to add, does not know to instrument the deserialisation boundary. They add logging around the assumed code-injection sinks, the bug stays operationally blind, and the incident response team discovers the real class only when someone exploits it.

Specificity and when to stop

There is a counterpoint worth acknowledging. A CWE classification that is too specific is almost as bad as one that is too general. If the tool confidently picks CWE-1236 formula injection for every finding that involves spreadsheet export, including the ones where the code already escapes the formula sigil, it is overclaiming specificity it has not earned. Griffin's behaviour here is to pick the most specific class the engine evidence supports and to fall back to the parent when the evidence is ambiguous. The heuristic is not that different from how a careful human classifier behaves: prefer specificity when you have the ground truth; back off when you do not.

Mythos-class tools tend to fall into a different trap. They either stay aggressively at the parent class (because the training prior is strongest there) or they gamble on a specific child class without the grounding to support it. Either failure mode breaks downstream automation.

A small numerical note

On a mixed Java, Python, and Go benchmark I maintain internally, Griffin's top-1 CWE classification accuracy against a hand-labelled set sits at 91 percent, with 78 percent at the most specific child level. The two Mythos-class tools I benchmark against on the same set land at 62 percent and 71 percent top-1, with single-digit percentages at the specific child level. The gap at the child level is where the remediation and compliance wins live.

How Safeguard Helps

Safeguard uses the CWE classification from Griffin AI to route findings directly into the correct remediation templates, compliance attestations, and runtime detection runbooks. Because the classification is grounded in the taint path, teams do not have to re-classify findings by hand before they can act on them. Compliance reports map cleanly to PCI-DSS, HIPAA, and ISO 27001 control families. Detection engineers get the right sinks to instrument. The tool behaves like a correctly labelled library, which is what a security programme needs it to be.

griffin-ai mythos zero-day ai-security

Back to all articles

More on #griffin-ai

View all →

AI Security

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.

CWE Classification Accuracy: Griffin AI vs Mythos

The taxonomy is not straightforward

Why Griffin classifies better

The cascade of misclassification

Specificity and when to stop

A small numerical note

How Safeguard Helps

More on #griffin-ai

Total Cost of Ownership: Griffin AI vs Mythos

API Surface Reviewed: Griffin AI vs Mythos

Real-World Deployment: Griffin AI vs Mythos

Safeguard Griffin AI: Eval Benchmarks Published

Related articles in AI Security

Building an Eval Suite for Your Security LLM Workflows

Zero-Day Discovery With LLM-Augmented Reachability: A Safeguard Engine Walkthrough

Frontier LLM Vendors Are Not Your Supply Chain Security Vendor

Never miss an update

Product

Solutions

Compare

Resources

Company

Legal

Developers