AI Security

Golden Dataset Design: Griffin AI vs Mythos

Benchmark scores are only as honest as the dataset behind them. Griffin AI publishes golden-dataset design notes; Mythos-class tools rarely explain theirs.

Shadab Khan
Security Engineer
7 min read

Every benchmark number has a dataset behind it. If the dataset is small, curated to be easy, or leaked into training, the benchmark number is a fiction. If the dataset is large, hard, and clean, the benchmark number is a signal.

Griffin AI publishes the design of its golden datasets: how they were built, how they are labeled, how they are refreshed, and how they are protected from contamination. Most Mythos-class competitors publish nothing at this layer, which means their numbers cannot be interpreted. This post explains what good golden-dataset design looks like and how to evaluate it.

What "golden" actually means

"Golden dataset" is a term of art that gets abused. In the rigorous usage, a golden dataset has four properties:

  1. Held-out: never used in training, fine-tuning, or prompt optimization.
  2. Labeled: every item has a ground-truth answer, produced by a documented procedure.
  3. Representative: items are drawn to match the real production distribution.
  4. Refreshed: the set is rotated or extended on a known cadence to prevent overfitting.

A set that lacks any of the four is not golden; it is a convenience dataset, which is what we call the benchmarks most Mythos-class tools quietly use.

The Griffin AI construction

We maintain five golden datasets, one per task family. Each has the same core structure.

  • Size: 2,000 items per family, rotated annually with ~20% item turnover.
  • Sourcing: a mix of public advisory data (NVD, GHSA, OSV), internal red-team findings, and consented design-partner contributions. Exact mix varies per family and is published.
  • Labeling: three-analyst majority vote for subjective tasks (exploit hypothesis, summarization), fully automated ground truth for objective tasks (PR compile, citation validity), hybrid for the rest.
  • Contamination protection: held-out items are never included in training corpora, never served via retrieval in the systems that evaluate against them, and are hashed so we can detect leakage.
  • Rotation: ~20% of items are retired each year and replaced with new items drawn from the current production distribution.

The rotation is the detail that matters most. Static golden datasets rot. The production distribution drifts, and a static set stops being representative, which means yesterday's 90% on the set is not the same thing as today's 90%. An annual rotation keeps the number honest.

The inter-annotator number

For the subjective tasks, the quality of the labels depends on how much the human annotators agree with each other. We publish the inter-annotator agreement for each subjective family.

  • Exploit hypothesis: Cohen's kappa 0.81.
  • Advisory summarization: pairwise BLEU 0.74 between analyst-written references, which we use as a ceiling on similarity-based metrics.
  • Cross-finding correlation (for the slices that require human judgment): kappa 0.77.

Those numbers are important because they put a ceiling on the benchmark. A model cannot meaningfully score higher than the inter-annotator agreement, because above that point the "errors" are really just disagreements among human experts. Our 81% exploit-hypothesis score is near-ceiling on a 0.81-kappa set; claiming much higher would be suspicious.

Vendors who do not publish inter-annotator agreement are publishing scores without a ceiling. A 97% claim on a set where humans only agree 70% of the time is either a sign of test-set contamination or a sign that the model is trained to agree with the labeling procedure rather than with reality.

Contamination is the silent killer

The fastest way to game a benchmark is to train on it. This is rarely done deliberately; it is done through accident, sloppy data pipelines, or pretraining corpora that happen to include the benchmark's source material.

Griffin AI's contamination protections:

  • Hash every held-out item and check training corpora for exact and near-duplicate matches before fine-tuning runs.
  • Use ~6 months of time-shifted construction: held-out items are drawn from a time window that ends before the training-data cutoff of the foundation model in use, with an additional 30-day buffer.
  • For any benchmark that uses public advisory data, we verify that the advisory's text appears neither in the foundation model's pretraining corpus (to the extent we can determine) nor in any retrieval corpus served to the evaluated system.

Are we 100% contamination-free? No. That is not a claim we make. What we do claim is that we have procedures, we publish them, and we spot-check them quarterly. A vendor that does not even acknowledge the contamination problem is not running a clean benchmark; they are running a vanity benchmark.

The representativeness problem

A golden dataset that does not match the production distribution is measuring the wrong thing. A concrete example: if a benchmark for advisory summarization contains 80% CVSS 9+ advisories, and production traffic is 60% CVSS 4-6, the benchmark is biased toward the easier, more widely discussed high-severity cases.

We match distributions on four dimensions per family:

  • Severity (CVSS band).
  • Ecosystem (npm, PyPI, Maven, etc.).
  • CWE category.
  • Advisory age at time of evaluation.

Matching is not perfect (production drifts, golden sets rotate only annually), but the gap is monitored and published. When the gap exceeds a threshold on any dimension, we backfill the set ahead of the annual rotation.

What Mythos-class tools reveal

When we ask competitors about their golden datasets under NDA, the most common answers are:

  • "We evaluate on our production data." This is not a golden set; it is a feedback loop. Production data is neither held-out nor rigorously labeled.
  • "We use a customer-provided test set." This is better, but a single customer's test set is not representative of the market.
  • "Our benchmarks are based on industry-standard datasets." We ask which ones. The answer, in most cases, is either silence or a reference to a public dataset that is almost certainly in the training data of the foundation model underneath.

None of those are designs. They are hand-waves in the shape of a design.

Publishing the design

Publishing the design is the price of admission for publishing the number. If we tell you we score 0.89 on advisory summarization and do not tell you what "advisory summarization" means, what the dataset looks like, how it was labeled, and how it is refreshed, the 0.89 is decorative.

Our golden-dataset documentation for each family covers:

  • Source manifest (per-family corpus breakdown with sampling weights).
  • Labeling SOP (step-by-step procedure for annotators).
  • Inter-annotator agreement (measured, published, trended).
  • Rotation schedule (next refresh date and target composition).
  • Contamination checks (last-run date and hit rate).

A buyer can read those documents and form an independent opinion about whether the 0.89 is trustworthy. That is the entire point.

Cost

Maintaining proper golden datasets is expensive. Three senior analysts adjudicating 500 items per quarter per family, across five families, is a large recurring line item. The rotation work is another meaningful chunk. The contamination infrastructure is a background cost that never quite goes away.

We spend it because the alternative is indistinguishable from the competitor set we are trying to differentiate from. A benchmark without a good dataset is a benchmark you cannot defend. A benchmark you cannot defend is a benchmark you should not publish.

What a buyer can ask

Three questions, again:

  1. "Can you share the construction document for your golden dataset?"
  2. "What is your inter-annotator agreement on the subjective tasks?"
  3. "When was the last time you rotated the set, and how do you prevent contamination?"

Griffin AI answers all three with documents. Mythos-class competitors will typically answer all three with a variation of "that is proprietary." Proprietary is not a benchmark methodology; it is a decline to be audited.

The bottom line

The dataset is the benchmark. The score is a byproduct. If a vendor shows you a score without a dataset design, what you are looking at is a number without a referent; it points at nothing. Griffin AI publishes the referent. That is the difference between a benchmark and a brochure, and it is the whole reason the numbers mean anything at all.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.