AI Security

Fine-Tune Drift Measured On Eval Sets

Fine-tuning to improve one task frequently regresses others. Without eval harnesses, the regressions ship. The measurable drift is larger than vendors admit.

Shadab Khan
Security Engineer
2 min read

Fine-tuning a model to improve a specific task often improves that task and regresses others. Published research documents this consistently: a model fine-tuned on coding tasks loses reasoning ability on non-coding tasks; a model fine-tuned on refusal loses utility on legitimate requests. In production AI-for-security deployments, fine-tuning that improves vulnerability detection can regress adversarial resistance, advisory summarisation, or fix-PR quality. Without eval harnesses to catch it, the regressions ship silently.

What the drift looks like

Three typical patterns:

  • Task-target improvement, adjacent regression. Target task up 10 points; adjacent task down 5 points. Net positive on the single metric; net negative across the workload.
  • Adversarial-resistance erosion. Fine-tuning for compliance or helpfulness often reduces jailbreak refusal.
  • Long-tail coverage loss. Fine-tuning on common cases degrades performance on rare cases.

Each is measurable; each is invisible without eval.

Why vendors under-report

Three incentives:

  • Headline metrics matter. Vendors publish the improved metric; the regressions are footnotes.
  • Eval investment is expensive. Running comprehensive evals on every fine-tune takes infrastructure.
  • Customer pressure is asymmetric. Customers ask about the task they care about; they don't ask about adjacent regressions.

The result: public numbers overstate net improvement.

How comprehensive evals catch it

Griffin AI runs five eval families on every release:

  • Exploit hypothesis
  • Remediation PR correctness
  • Advisory summarisation
  • Cross-finding correlation
  • Adversarial resistance

A release that improves one and regresses another is caught. The regression gate blocks the release until the regression is addressed.

What customers should ask

Three questions of any AI-for-security vendor:

  1. What eval families do you run on each release?
  2. Show me the release notes from the last year, including regressions.
  3. What is the gate for releasing despite a regression?

Vendors who answer cleanly are running a mature program. Vendors who can't are shipping drift.

How Safeguard Helps

Safeguard's Griffin AI release gate requires passing all five eval families. Regressions block release. Release notes document eval deltas. For customers whose AI-for-security vendors have been surprising them with behaviour changes, the comprehensive eval discipline is the architectural difference.

Related articles in AI Security

AI Security

Safeguard Now Supports Every Major AI Model Family for Zero-Day Discovery: Anthropic, OpenAI, Gemini, Microsoft, Meta, and Your Own Models

You should not have to choose between your organization's AI strategy and your security platform. Safeguard's agentic zero-day discovery and remediation pipeline now works on Anthropic Claude Fable 5, OpenAI GPT, Google Gemini, Microsoft Phi, Meta Llama, Safeguard native models, and privately hosted custom models — all running as first-class agents in the same Multi-Agent TAOR Deep Think AI Engine.

June 9, 2026Read
AI Security

Anthropic Claude Mythos Releases Tomorrow: Capabilities, Benchmarks, and What Security Teams Must Do Now

Anthropic's Claude Mythos model goes public on June 10, 2026 — a frontier AI that scored 97.6% on the Math Olympiad, completed expert-level hacking tasks at 73% success, and found 271 vulnerabilities in Firefox 150. Here is everything security teams need to know before it lands, and how Safeguard already supports Mythos zero-day discovery natively.

June 9, 2026Read
AI Security

Claude Fable 5: Anthropic's Most Capable Public Model Is Here — Benchmarks, Capabilities, and What It Means for Security

Anthropic just released Claude Fable 5, its most capable publicly available model and the first Mythos-class AI open to everyone. 80.3% on SWE-Bench Pro, 88% on Terminal-Bench 2.1, state-of-the-art across software engineering, vision, and scientific research. Safeguard has already integrated Fable 5 natively — here is everything you need to know.

June 9, 2026Read

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.