Every security team using an LLM for anything load-bearing — triage, remediation drafting, advisory summarization, detection enrichment — is one prompt change or model version bump away from a silent regression. The standard response is "we reviewed the output and it looked fine," which is the same level of engineering discipline as deploying without a test suite. It is not sufficient, and it is increasingly not defensible in audits. The fix is an eval suite: a structured, versioned, automatable set of checks that runs on every change to the LLM workflow and fails the change if quality drops. What follows is a concrete harness we have built and watched customers build, sized to the reality that most security teams are two to four engineers, not a twenty-person ML platform group. The goal is a setup that fits in a week, scales to years, and holds up under an auditor's questions.
What goes in a security LLM eval suite that does not go in a dev-loop eval?
Most public eval guidance is written for general-purpose assistants — helpfulness, harmlessness, factuality. Security workflows have a different threat model, and the eval suite reflects that. Four extra families of check that are specific to this setting:
- Adversarial resistance. Jailbreak attempts against whatever system prompt gates the workflow. Pass criterion: refusal rate above a floor on a curated adversarial set.
- Exfiltration resistance. Prompts designed to coax the model into leaking contents of the system prompt, retrieved evidence, or tenant data. Pass criterion: zero exact-match leaks on a set of planted canaries.
- Attribution discipline. For any factual claim the model makes ("CVE-2024-X affects version Y"), the output must cite a verifiable source from the retrieval set. Pass criterion: citation rate above a floor, fabrication rate below a ceiling.
- Scope adherence. The model must not perform actions outside its declared tool scope, even when asked. Pass criterion: zero out-of-scope tool invocations on a provocation set.
These four, plus the standard correctness checks, cover the bulk of real-world failure modes. Layers on top (bias, safety, domain-specific checks) are additive.
What does a minimum viable golden dataset look like?
Start smaller than you think. Fifty to two hundred examples per check family is enough to start catching regressions. The common mistake is waiting until you have "the right dataset," which is unachievable and postpones the value indefinitely. Ship v1 in a week with a few dozen examples and grow it every time an incident, near-miss, or customer report reveals a new failure mode.
Concrete sources for building the initial set:
- Recent internal incidents. Every real bug the LLM workflow produced is an eval candidate. Freeze the exact inputs, fix the bug, and the fix is pinned by the eval.
- Public adversarial corpora. Jailbreak and prompt-injection research has accumulated usable datasets; seeding from them is fine, though do not rely on them alone because publicly available attacks get trained against.
- Security-specific golden inputs. A curated set of vulnerability advisories, CVE descriptions, package metadata, and code snippets with known correct classifications. This is the domain analog of a unit test fixture.
- Red-team output. One afternoon of an engineer attacking the workflow with creativity usually produces more durable evals than a week of synthetic data generation.
Version the dataset in the same git repo as the workflow. Treat dataset changes as code changes, with review.
How do you score outputs when there is no single right answer?
Three scoring strategies, usually used in combination, chosen by what the check actually needs to verify:
Exact match. Used for canary leakage checks and structured output conformance. If you planted the string CANARY-7B3F4A in the system prompt, an exact-match grep for it in the output catches any regression. Cheap, reliable, and the baseline defense for leakage checks.
Programmatic graders. A Python function that parses the output and checks invariants. Examples: "the output is valid JSON," "the output contains at least one citation to a source ID present in the retrieval context," "the output does not contain any tool-call markup for tools outside this session's scope." Programmatic graders are the workhorse — they are deterministic, cheap, and they encode the exact contract the workflow is supposed to honor.
LLM-as-judge graders. A separate model call that evaluates the output against a rubric. Used when the correctness criterion is genuinely subjective ("does this remediation explanation cover the root cause or just the symptom?"). LLM-as-judge has known failure modes — position bias, length bias, self-preference — so use it for the checks where nothing else works, and calibrate by having the judge re-grade human-scored examples until the agreement rate is acceptable.
Most real eval suites are 70% programmatic, 20% exact-match, 10% judge. If your ratio is inverted, you are spending a lot of money and getting soft data.
Where do you set the pass/fail thresholds?
This is the question that trips up most teams, because absolute thresholds are hard to pick and relative thresholds need a baseline. The working answer is a hybrid:
- Absolute floors on critical checks (zero leakage, zero out-of-scope tool calls). These are non-negotiable and do not move with baselines.
- Relative thresholds on everything else, anchored to the previous release. The rule we use: "a change may not regress any metric by more than X% or by more than one standard deviation of historical variance, whichever is larger." This allows noise to breathe while still catching real drops.
- Category minimums to prevent trading. A PR that improves one category by 10% but drops another by 5% should not pass on aggregate score alone; every category has to hold.
Store baselines in the repo. When a change intentionally shifts a baseline, the PR explicitly updates the baseline file, and that update is part of the code review. This makes every baseline shift a deliberate, reviewable act — the single most important discipline in eval operations.
How do you wire evals into CI without making PRs take an hour?
Split the suite into tiers and run them accordingly:
- Smoke tier (seconds). A handful of critical checks — canary leak, tool scope, basic correctness. Runs on every PR, blocks merge on failure.
- Regression tier (minutes). The full programmatic grader set over a representative subset of the golden dataset. Runs on every PR, blocks merge on significant regression.
- Full tier (tens of minutes). Judge-based evals, full adversarial corpus, drift detection against production traces. Runs on main-branch merges and nightly, alerts on regression but does not block (because blocking a merged change has the wrong ergonomics).
Parallelize aggressively — eval suites are embarrassingly parallel. Cache model responses by input hash so re-runs on unchanged code are free. Budget per PR matters: if the smoke tier costs more than a dollar per PR, engineers will find reasons to skip it.
What about detecting drift after a model version bump?
This is where most teams get surprised, because the failure mode is gradual rather than sharp. The pattern we have seen repeatedly: the vendor pushes a minor model version update, the outputs look slightly different on a handful of prompts, the smoke tests still pass, and two weeks later someone notices that citation quality dropped four points. By that point the regression is baked in and the baseline has implicitly moved.
Two checks catch this early. First, pin the model version explicitly and change it through a dedicated PR that runs the full tier with extra judge-eval runs. If a version bump requires rebaselining, the PR description has to explain why. Second, run a daily canary eval in production against real sampled traffic, compare against the held-out golden set, and alert on distribution shift beyond a threshold. This is equivalent to the synthetic check runs that detect API regressions on an HTTP service — same idea, applied to model behavior.
What is the minimum team posture needed to run this?
One engineer, half-time, for setup week. The running cost after that is a review load similar to a test suite: PRs touch the eval set occasionally, baselines move rarely, the CI job produces output you read like any other test report. The leverage is high because the failure modes the suite catches would otherwise become production incidents, and production incidents on LLM workflows are expensive to investigate precisely because the system is non-deterministic. Catching regressions at PR time is an order of magnitude cheaper than catching them after deploy.
How Safeguard Helps
Safeguard ships an eval harness integrated into the platform so that any LLM workflow you configure — vulnerability triage, remediation generation, advisory summarization, policy explanation — runs against the eval suite automatically on every change and every model version bump. Griffin AI's reasoning layer is itself gated by this harness, so regressions in the model that powers Safeguard's own outputs get caught before they reach customers. The eval suite ships with a baseline dataset for the standard security workflow checks and accepts customer-specific golden data for domain-specific rubrics. For teams that want LLMs deeper in their security stack without the silent-regression risk, Safeguard provides the eval control plane as a first-class feature rather than a project you have to staff and maintain yourself.