← Concepts & Glossary
AI Security

Eval Harness

A regression test suite for AI workflow quality, gated into release.

What is an eval harness?

An eval harness is a test suite for AI behaviour — a curated set of inputs, expected outputs, and scoring rubrics that runs on every model or prompt change and gates whether the change can ship. It is to AI workflows what a regression test suite is to application code: the mechanism by which you know a change didn't break something that used to work.

Without an eval harness, "upgrade the model" is a vibes-based decision. With one, it's a diff: these 14 cases regressed, these 3 improved, ship or don't.

How it works

Four moving parts:

  1. Golden datasets. Hand-curated examples — representative queries, edge cases, known-adversarial inputs, past regressions — each with an expected behaviour. Datasets grow over time: every real production incident becomes a new case that future versions must pass.
  2. Scoring rubrics. For each case, a deterministic or model-graded check that answers "did this output meet the bar?" Deterministic where possible (exact string, JSON schema, tool-call pattern); model-graded with a separate judge model where quality is semantic.
  3. Regression gates. Eval scores are wired into CI. A drop past threshold on any dataset blocks the change — whether it's a prompt tweak, a model version bump, or a tool-bundle expansion. The gate is the difference between a harness you have and a harness you use.
  4. Drift detection. Evals run on a schedule even without code changes, because models and APIs drift under you. A silent regression on live production becomes an alert within hours, not a quarterly surprise from a user who complained.

Why it matters

LLMs are non-deterministic and the providers change the model behind the same API without notice. A prompt that worked last month on GPT-4-turbo may quietly regress when the backend swap happens. Without evals, you find out from a customer; with evals, you find out in CI.

Safeguard runs five families of evals against every change to its AI surfaces: resistance (prompt injection), citation (does it cite sources correctly), correctness (does the remediation actually fix the CVE), refusal (does it decline when it should), and scope (does it stay inside its tool bundle). Each family has its own dataset, its own rubric, its own gate.

What value it adds

  • Model swaps become engineering, not gambling

    Moving from Sonnet to Opus — or from Claude to GPT — is a diff on the eval dashboard, not a cross-your-fingers deploy.

  • Regressions caught before users notice

    The typical "silent provider update" regression is caught in CI within hours of the underlying change, not after three weeks of degraded product experience.

  • Prompt changes ship with confidence

    A prompt tweak goes through the same gate as a code change. You can ship often because you can prove you didn't break anything.

  • Known-bad inputs stay fixed

    Every prompt-injection discovery, every hallucination report, every scope violation gets added to the dataset. The bug can't come back without CI noticing.

  • Trust becomes measurable

    "Our remediation accuracy is 94% on a 1,200-case golden dataset, trending up across the last 12 weeks" is a defensible statement. "The AI is pretty good" is not.

How Safeguard uses it

Every change to Griffin AI — prompt, model, tool bundle — passes the five-family eval gate before it reaches AI remediation production. Drift runs nightly on the same datasets so silent regressions surface within hours.

Gate your AI like you gate your code.

See how Safeguard runs five families of evals on every Griffin change — and catches the regressions that "manual spot-check" misses.