AI Security

Eval Methodology: Griffin AI vs Mythos

A benchmark number is only as good as the methodology that produced it. Here is how Griffin AI builds its harness and why most Mythos-class tools cannot be audited.

Nayan Dey
Senior Security Engineer
7 min read

A score on a slide is not an eval. A methodology is an eval. The score is just the output. If a vendor hands you an accuracy number without the procedure that produced it, you do not have information; you have an aspiration with a percent sign on the end.

This post walks through the methodology behind Griffin AI's published benchmarks and contrasts it with what we typically see from Mythos-class competitors when we ask them the same questions under NDA.

Four things a methodology must answer

Before we get to Griffin AI specifically, let us anchor on what a legitimate eval methodology must disclose. There are four:

  1. How was the dataset built? Provenance, size, labeling procedure, inter-annotator agreement.
  2. How is the task defined? What is the input, what is the expected output, what counts as correct?
  3. How is the output graded? Human, automated, hybrid, and if hybrid, where is the boundary?
  4. How is the harness run? What model, what prompt, what tools, what context, what seeds?

If a vendor cannot answer all four, their number is not reproducible. If a number is not reproducible, it is not a benchmark; it is a cherry-picked demo.

Griffin AI's dataset construction

The Griffin AI harness draws from five sources, weighted by task family.

  • Public advisory corpora: NVD, GHSA, OSV, PSIRT feeds from major vendors, and a curated subset of KEV-listed CVEs.
  • Internal red-team set: 1,200+ findings generated by our own offensive-security team against a rotating portfolio of open-source targets, with ground-truth exploitability labeled by the engineer who wrote the payload.
  • Customer-contributed corpus: a consented, anonymized pool of roughly 8,000 findings from design-partner tenants, stripped of identifiers and aged by at least 90 days before use.
  • Synthetic adversarial probes: approximately 3,000 prompt-injection, jailbreak, and data-exfil probes constructed by combining known attack patterns with randomized payloads.
  • Held-out eval set: a frozen, never-trained-on set of 2,000 items per task family, rotated annually to prevent overfitting.

Each item in each corpus has a label, a provenance record, and a license flag. If a dataset is not auditable at the per-item level, it is not a golden dataset; it is a bag of strings.

Task definitions that survive contact with reality

Our five task families each have a written definition that fits on a single page. The exploit-hypothesis task, for instance, is defined as:

Given a finding (CWE, file, line, stack trace, and project context), produce a JSON object containing (a) a reachability verdict — one of reachable, conditional, unreachable — (b) a one-paragraph exploit hypothesis, and (c) a confidence score in the zero-to-one range. The verdict is correct if it matches the majority vote of three senior analysts given the same inputs.

That definition is deliberately narrow. It does not say "finds vulnerabilities." It does not say "understands code." It says exactly what the model is asked to do and exactly how the answer is scored. Narrow definitions are how you get to an 81% number that means something rather than a 97% number that means nothing.

Mythos-class competitors frequently describe their capability in the shape of "our AI agent triages findings with 95% accuracy." When we press for the task definition, the most common answers we hear are:

  • "Our customers define accuracy."
  • "Our accuracy is measured on production data."
  • "We cannot share the methodology because it is proprietary."

None of those are methodologies. The first is a delegation. The second is a tautology. The third is a refusal.

Grading: where most vendors cut corners

Grading is where eval methodology lives or dies. For each task family, Griffin AI uses a hybrid rubric.

  • Exploit hypothesis: three-analyst majority vote on a 500-item sample, with 100% of disagreements adjudicated by a fourth analyst. Inter-annotator agreement runs around 0.81 Cohen's kappa, which we publish.
  • Remediation-PR correctness: fully automated. We check out the target repo, apply the generated PR, run the project's test suite, and record compile + test pass/fail.
  • Advisory summarization: semantic similarity against a curated reference summary using a held-out embedding model, with a human-preference spot check on a 200-item subsample.
  • Cross-finding correlation: automated set-comparison against a hand-labeled ground-truth graph.
  • Adversarial resistance: automated classifier plus a 10% human-review sample to catch false-negatives in the classifier itself.

The important detail here is that the human-grading rate is published per task family. A vendor that claims high accuracy with 100% automated grading is really measuring how well their model agrees with whatever grader model they are using. That is not an eval; that is a feedback loop.

Running the harness

The harness is deterministic wherever possible. Every run records:

  • Model identifier and version.
  • System prompt hash.
  • Tool-use configuration.
  • Retrieval corpus snapshot ID.
  • Random seed.

A run that cannot be reproduced is a run that cannot be trusted. We publish the harness configuration in the release notes for each benchmark update, and design partners can request the exact configuration used for any number on our public scoreboard.

The Mythos-class pattern here is telling: several competitors we have evaluated refuse to disclose even the model family behind their agent, let alone the prompt or the retrieval configuration. A security product that will not tell you what model is making decisions about your vulnerabilities is asking you to delegate judgment to a black box inside another black box.

Confidence intervals and honest error bars

A benchmark without an error bar is a benchmark that thinks it is more precise than it is. Every Griffin AI number has a 95% bootstrap confidence interval, computed on the item-level scores. The exploit-hypothesis 81% is really 81% with a 95% CI of roughly [78.4, 83.6] on the current held-out set.

We publish the CI because it is honest. It also changes the conversation: a buyer asking whether Griffin AI at 81% is meaningfully better than a competitor at 84% should first ask whether the competitor's number has an error bar at all. If it does not, the 3-point gap is noise pretending to be signal.

The reproducibility test

Here is a simple test any buyer can run on any vendor, including us.

  1. Ask for the methodology document for a specific claimed number.
  2. Ask for the dataset size, sourcing, and labeling procedure.
  3. Ask whether a design partner can re-run the harness on their own infrastructure against a copy of the held-out set.
  4. Ask when the number was last computed and on what model version.

Griffin AI can answer all four. Our expectation is that most Mythos-class competitors will answer zero or one.

What this costs us

Publishing methodology is expensive. It exposes our weaknesses. It gives competitors a map of our eval set structure. It creates an expectation that the next release will not regress, which constrains engineering choices.

We do it anyway because the alternative is the status quo: a market where AI security vendors compete on the size of their accuracy claims rather than the honesty of their measurement. That market is bad for buyers, bad for defenders, and bad for the long-term reputation of AI in security.

The bottom line

Methodology is the real benchmark. The score is just what falls out when you do the methodology honestly. If you are comparing Griffin AI to a Mythos-class alternative, do not compare the scores first. Compare the methodology documents. If one vendor has one and the other does not, the comparison is already over.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.