A benchmark run once, at launch, is a commercial artifact. A benchmark run on every build is an engineering discipline. The difference between the two is the difference between a vendor who made claims and a vendor who runs them.
Griffin AI's harness is continuous: it runs on every prompt change, every model update, every retrieval-corpus refresh, and every dependency bump. Mythos-class competitors, in the materials we have reviewed, describe neither a continuous-eval pipeline nor a gating process tied to one. This post explains why that gap is the difference between a measurable product and a hope.
What "continuous" means operationally
Continuous eval is not "we run the benchmark before each quarterly release." That is periodic eval, and it misses most regressions. Continuous eval means the harness is wired into the build pipeline the same way unit tests are.
The Griffin AI build pipeline runs, for each PR that touches the model layer, prompt library, retrieval config, or tool surface:
- A fast smoke subset (roughly 200 items, ~6 minutes) on every commit.
- The full harness (~10,000 items across five families, ~90 minutes) on every PR before merge.
- A nightly shadow run against the previous 24 hours of anonymized production traffic.
- A weekly full re-baseline against the held-out set, with rotation-aware reporting.
That is four separate cadences, each serving a different purpose. The smoke subset catches catastrophic breakage fast. The full harness gates merges. The nightly shadow detects drift in production distribution. The weekly baseline keeps the historical trend honest.
The four things a continuous-eval pipeline must produce
Any serious continuous-eval pipeline produces four artifacts per run:
- A score per task family with a confidence interval.
- A diff against the previous baseline at the item level, not just the aggregate level.
- A failure triage list: items that regressed, items that improved, items that flipped.
- A gate decision: merge, block, or escalate.
Griffin AI's harness emits all four on every run, and the artifacts are durable: we can pull the failure triage from any run in the last 18 months and reconstruct exactly what the model was doing on a given item on a given day.
This is not exotic engineering. This is regression testing applied to a stochastic component. It is how you ship a model-based product without shipping a model-based product you cannot reason about.
Why Mythos-class tools struggle here
Continuous eval requires three things that are expensive to build:
- A durable golden dataset (covered in an earlier post).
- A deterministic enough harness that runs can be compared.
- A pipeline integration that is robust enough to run on every merge without breaking the developer experience.
Vendors who launched on a "just use a great model" strategy tend to have none of these. Their benchmarks, if they ran at all, were one-off runs at launch. Their pipelines do not know how to run an eval; they know how to run a lint.
That is not a criticism of the people; it is a criticism of the architecture. Retrofit is hard. A vendor that did not build continuous eval from the beginning will have a very hard time adding it later without rewriting the testing layer.
Item-level diffs are where the value is
Aggregate scores tell you that something changed. Item-level diffs tell you what.
A concrete example from our internal logs: in Q3 2025, a prompt-library change intended to improve remediation-PR compile rate dropped exploit-hypothesis accuracy by 1.8 points. The aggregate was within the gate threshold (2 points), so the change would have merged under a gate-only process. The item-level diff showed the drop was concentrated in cross-ecosystem findings (Python project consuming a JavaScript dependency), which was a specific production-critical slice. We blocked the merge, revised the prompt, and the revised version raised the compile rate without regressing the exploit hypothesis.
Without item-level diffs, that bug would have shipped and gradually gotten worse as more cross-ecosystem findings hit production. Aggregate numbers hide slice-level regressions, and slice-level regressions are where customer-visible pain lives.
Shadow production traffic
The nightly shadow run is the bridge between offline eval and production reality. The mechanism:
- 24 hours of anonymized production requests are replayed against both the current production model and the candidate model.
- Outputs are compared on a set of secondary metrics: tool-call counts, retrieval-hit rates, latency, and downstream user-signal proxies (did the user accept the suggestion, ask a clarifying question, or reject).
- Divergences above a threshold are triaged before the candidate can be promoted.
This catches the two regressions that pure offline eval misses: tool-use regressions (the model reasons fine in isolation but mis-uses the toolset) and distribution-shift regressions (the model scores fine on the held-out set but worse on the current production mix).
Without a shadow layer, a continuous-eval pipeline is only catching one kind of regression. The shadow layer is not optional; it is the second half of the pipeline.
Release gating on top of continuous eval
Continuous eval produces scores. Release gating is what does something with them. Griffin AI's gating policy:
- A merge is blocked if any family's score regresses more than the published threshold (2-3 points depending on family) against the 30-day rolling median.
- A release (daily or on-demand) is blocked if the merged build fails the full harness against the full held-out set.
- An emergency rollback path exists for post-release regressions detected in shadow production traffic.
The gating policy is public. The thresholds are public. The log of blocked merges is public (anonymized for PR authors). That last piece is important: public gate-blocks demonstrate that the gate is actually enforced, not theater.
The "it's all fine" failure mode
The quiet failure mode of continuous eval is that the pipeline runs, the scores are reported, nobody looks, and a regression goes live. The mitigation is not technical; it is organizational.
Griffin AI's internal discipline is that every eval run ends with an assigned triage owner, and the triage list is a daily standup topic for the model team. If an item regressed, someone is accountable for investigating, even if the aggregate score is fine. That forced attention is what keeps the pipeline from becoming a decorative dashboard.
The dashboard-that-nobody-reads problem is not hypothetical. We have seen it in other teams. The fix is to make the pipeline produce work, not just produce numbers, and to make the work visible at a cadence that matches the shipping cadence.
What the cost looks like
Continuous eval is expensive on three axes:
- Compute: running the full harness on every PR costs real money. The smoke-subset-then-full strategy cuts it down, but the PR-level full runs are still meaningful.
- Engineering: the pipeline itself is infrastructure that needs to be maintained like any other piece of infrastructure.
- Attention: triage is a constant tax on the model team's bandwidth.
We spend it because the alternative is shipping regressions we do not detect. A vendor that does not spend it is not saving money; they are externalizing the cost to their customers in the form of undetected regressions.
What buyers should ask
Four questions:
- "What fraction of your builds run the full benchmark harness before merge?"
- "How many merges or releases in the last quarter were blocked by an eval gate?"
- "Do you run a shadow pipeline against production traffic before promoting a candidate?"
- "What is the triage SLA on a flagged regression?"
Griffin AI answers all four. Competitors typically cannot, which tells you the pipeline is either absent or not load-bearing.
The bottom line
Continuous eval is what turns a benchmark from a marketing artifact into a product discipline. Without it, a number published at launch is a number published at launch; it does not tell you anything about the product you are actually using today. Griffin AI runs the harness on every change, gates releases on the result, and publishes the gate log. That is the minimum bar for trusting an AI security tool in 2026. Ask your current vendor where their pipeline runs. The answer, whatever it is, is diagnostic.