AI Security

Citation Accuracy: Griffin AI vs Mythos

An AI security tool that cites the wrong advisory is worse than one that says nothing. Griffin AI benchmarks citation accuracy at 0.89 similarity; Mythos does not.

Nayan Dey
Senior Security Engineer
7 min read

The worst failure mode of an AI security assistant is not "did not know the answer." It is "confidently cited the wrong source." A security engineer who sees a citation trusts it. If the citation is fabricated or misattributed, the engineer has just been weaponized against their own judgment.

Griffin AI treats citation accuracy as a first-class benchmark. Our advisory-summarization family scores 0.89 against curated reference summaries, and we publish the rubric. Mythos-class competitors, in the public materials we have reviewed, do not report a citation-accuracy metric at all. This post is about why that gap is the most expensive gap in the AppSec AI category.

What a citation failure actually looks like

Let us be concrete. Here are three citation-failure modes we have observed in competitor demos (not named, and with details blurred for responsible-disclosure reasons).

  • Misattribution: a competitor's agent summarized a vulnerability and attributed it to CVE-2023-XXXXX. The CVE number existed but referred to a completely unrelated issue. The summary was internally consistent and totally wrong.
  • Fabricated advisory ID: a competitor's agent cited GHSA-xxxx-yyyy-zzzz as the source for a remediation recommendation. The ID did not exist in the GHSA registry at all.
  • Stale citation: a competitor's agent cited a superseded advisory as the canonical source, ignoring a later, authoritative update that had reclassified the severity and changed the remediation guidance.

Any one of those, in production, ends with a security engineer either fixing the wrong thing or explaining to their manager why the fix they shipped did not close the finding.

What Griffin AI measures

Griffin AI's advisory-summarization benchmark measures three things, not one:

  1. Similarity: semantic similarity between the model's summary and a curated reference summary, computed with a held-out embedding model. This is the 0.89 number in our public scorecard.
  2. Citation validity: the fraction of cited identifiers (CVE, GHSA, advisory URLs) that resolve to a real, current source. Our current number here is 99.4%.
  3. Citation relevance: among valid citations, the fraction that are materially on-topic to the summary they support. Current number: 96.1%.

The three numbers are intentionally separated. A model can score high on similarity and low on citation validity; that is a model that writes well and cites poorly. A model can score high on citation validity and low on relevance; that is a model that cites real sources that happen to be irrelevant. Both are real failure modes, and both require separate measurement.

Why similarity alone is not enough

A lot of published AI benchmarks for summarization use ROUGE or BLEU or cosine similarity against a reference. Those metrics are necessary but not sufficient for security use cases, because they reward fluency rather than factuality.

A well-written but factually wrong summary will score higher on similarity than a poorly written but correct one, if the wrong summary happens to share more surface vocabulary with the reference. That is not a hypothetical; it is what we observed in early versions of our own harness before we added citation-validity as a separate axis.

The Mythos-class tools that claim high "summary accuracy" almost certainly mean similarity alone. Ask them what fraction of their citations resolve. If they do not know, they are not measuring what matters.

The fabrication problem

Large language models fabricate citations. This is not new, it is not controversial, and it is not going away without intervention. The intervention that works is grounded retrieval: do not ask the model to recall a citation; give the model the citation and ask it to use it.

Griffin AI's architecture enforces grounded citation. The model is never asked to recall a CVE number from memory; it is given the relevant advisories as retrieved context and is required to cite only within that context. Outputs that claim a citation not present in the retrieved set are rejected by the output filter before they leave the system.

This is an architectural choice, not a prompt-engineering choice. A tool that relies on "please only cite real sources" in its system prompt is a tool that will fabricate the first time a prompt gets long enough to push the instruction out of attention. We have seen this in competitor outputs.

What the 0.89 means and what it does not

0.89 is not 1.0. It means our summaries agree with curated references most of the time but not always. The 0.11 gap breaks down roughly as:

  • ~0.04 is stylistic variation (our summary uses different but accurate phrasing).
  • ~0.04 is omission (our summary is shorter than the reference and drops a secondary detail).
  • ~0.03 is genuine disagreement (our summary characterizes the issue differently from the reference, sometimes correctly and sometimes not).

The 0.03 is the slice that matters. We sample it quarterly, have three analysts adjudicate, and use the disagreements as seed data for the next training cycle. The 0.89 is not the interesting number; the trend on the 0.03 is.

The Mythos-class silence

When we review the public materials of competitors, the section on citation accuracy is almost always missing. A common pattern is a single sentence like "our AI cites authoritative sources" without any benchmark behind it.

We think this is because citation accuracy is the easiest benchmark for a buyer to verify themselves and the hardest for a vendor to hide. If a demo output cites GHSA-xxxx-yyyy-zzzz, a skeptical buyer can paste the ID into the GHSA search bar in ten seconds. If the ID does not resolve, the demo is over.

So competitors do not demo outputs with citations, or they demo with citations that are pre-verified by the sales engineer, or they demo with citations that are generically structured (a URL to a vendor's own documentation, for instance) so verification is impossible. None of that is a benchmark.

What a buyer can ask for

Three questions, in order of decreasing politeness:

  1. "What is your citation-validity rate, and how is it measured?"
  2. "Can you walk me through a recent sample where a citation was wrong, and what you did about it?"
  3. "Can my team run a 50-finding sample through your system and verify every citation ourselves?"

Griffin AI can answer all three. Our expectation is that competitors will answer the first with a feature bullet, deflect on the second, and decline the third on contractual grounds. If that happens, you have learned something.

The cost of getting this wrong

A security engineer who trusts a wrong citation wastes hours. A security engineer who trusts many wrong citations loses confidence in the tool and stops using it. A security team that deploys a tool with bad citation accuracy ships more vulnerabilities than a team with no tool at all, because the tool creates the illusion of coverage.

This is the worst case for any AI security product: a tool that is used less than a spreadsheet because the spreadsheet does not lie. If citation accuracy is not benchmarked, this outcome is the default.

Design implications

For teams building on Griffin AI, the citation guarantee has specific implications:

  • Surface the citation: every finding in the Griffin AI UI links to the source advisory directly, with the ID and URL both shown. Engineers verify by eye without leaving the tool.
  • Show the retrieval set: for any model output, the retrieved advisory set is available under a "sources" disclosure. Engineers can see what the model saw.
  • Flag low-confidence citations: when the model's citation confidence is below a threshold, the UI shows a warning rather than suppressing the citation. The engineer decides.

Those are product decisions that fall out of a benchmark. Without the benchmark, none of them get prioritized.

The bottom line

A citation is a contract. The tool promises the engineer that the source is real, current, and relevant. Break the contract and the tool becomes anti-useful. Griffin AI benchmarks the contract. Mythos-class tools, in most cases, do not. Choose accordingly.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.