A regression test suite for AI workflow quality, gated into release.
An eval harness is a test suite for AI behaviour — a curated set of inputs, expected outputs, and scoring rubrics that runs on every model or prompt change and gates whether the change can ship. It is to AI workflows what a regression test suite is to application code: the mechanism by which you know a change didn't break something that used to work.
Without an eval harness, "upgrade the model" is a vibes-based decision. With one, it's a diff: these 14 cases regressed, these 3 improved, ship or don't.
Four moving parts:
LLMs are non-deterministic and the providers change the model behind the same API without notice. A prompt that worked last month on GPT-4-turbo may quietly regress when the backend swap happens. Without evals, you find out from a customer; with evals, you find out in CI.
Safeguard runs five families of evals against every change to its AI surfaces: resistance (prompt injection), citation (does it cite sources correctly), correctness (does the remediation actually fix the CVE), refusal (does it decline when it should), and scope (does it stay inside its tool bundle). Each family has its own dataset, its own rubric, its own gate.
Moving from Sonnet to Opus — or from Claude to GPT — is a diff on the eval dashboard, not a cross-your-fingers deploy.
The typical "silent provider update" regression is caught in CI within hours of the underlying change, not after three weeks of degraded product experience.
A prompt tweak goes through the same gate as a code change. You can ship often because you can prove you didn't break anything.
Every prompt-injection discovery, every hallucination report, every scope violation gets added to the dataset. The bug can't come back without CI noticing.
"Our remediation accuracy is 94% on a 1,200-case golden dataset, trending up across the last 12 weeks" is a defensible statement. "The AI is pretty good" is not.
Every change to Griffin AI — prompt, model, tool bundle — passes the five-family eval gate before it reaches AI remediation production. Drift runs nightly on the same datasets so silent regressions surface within hours.
See how Safeguard runs five families of evals on every Griffin change — and catches the regressions that "manual spot-check" misses.