AI Security

LLM-As-Judge Pitfalls In Security Evals

Using an LLM to score another LLM's output is expedient and dangerous. The judge has its own biases — ones that affect security evaluations specifically.

Nayan Dey
Senior Security Engineer
2 min read

Using a large language model to score another model's output is an attractive approach to evaluation. It scales; it's cheaper than human review; it feels rigorous. For general tasks, it can work reasonably well. For security evaluations specifically, LLM-as-judge has pitfalls that produce systematically biased results. The pitfalls are well-documented; the mitigations are specific; vendors who skip the mitigations publish flawed numbers.

Known LLM-as-judge biases

Four documented:

  • Position bias. Models consistently prefer the first or last response in a comparison.
  • Length bias. Longer responses are scored higher even when they're worse.
  • Self-preference. Models rate their own outputs higher than other models'.
  • Confidence-intonation bias. Confident-sounding output scores higher than correct-but-hedged output.

Each biases security eval results in specific directions.

Why security evals are particularly vulnerable

Three reasons:

  • Verbosity correlates with thoroughness expectation. Security findings "should" be detailed; length bias compounds.
  • Confidence correlates with authority expectation. Authoritative-sounding analysis scores higher; confident hallucinations benefit.
  • Self-preference matters. A vendor's model scoring the vendor's own outputs produces directly biased results.

Mitigations that work

Four practices:

  • Randomise position. Always randomise comparison order; average across positions.
  • Control for length. Normalise for output length in scoring.
  • Cross-judge. Use a different model family to judge than the model under test.
  • Spot-check with humans. Sample a fraction of judge outputs and verify against expert reviewers.

Each adds cost. Combined, they produce evaluation numbers that survive external review.

How Griffin AI's evals handle it

Three practices:

  • Position and length biases are controlled in the scoring pipeline.
  • Cross-judging uses different model families for scoring than for generation.
  • Human spot-checks are sampled on every eval run; disagreement with the judge triggers investigation.

What customers should ask

Three questions:

  1. What LLM-as-judge biases are controlled in your eval methodology?
  2. Do you cross-judge across model families?
  3. What percentage of judgments are human-verified?

How Safeguard Helps

Safeguard's Griffin AI eval harness applies bias controls, cross-judging, and human sampling. The published numbers reflect this discipline. For customers whose evaluation of AI-for-security vendors depends on benchmark integrity, this is the methodology depth to look for.

Related articles in AI Security

AI Security

Safeguard Now Supports Every Major AI Model Family for Zero-Day Discovery: Anthropic, OpenAI, Gemini, Microsoft, Meta, and Your Own Models

You should not have to choose between your organization's AI strategy and your security platform. Safeguard's agentic zero-day discovery and remediation pipeline now works on Anthropic Claude Fable 5, OpenAI GPT, Google Gemini, Microsoft Phi, Meta Llama, Safeguard native models, and privately hosted custom models — all running as first-class agents in the same Multi-Agent TAOR Deep Think AI Engine.

June 9, 2026Read
AI Security

Anthropic Claude Mythos Releases Tomorrow: Capabilities, Benchmarks, and What Security Teams Must Do Now

Anthropic's Claude Mythos model goes public on June 10, 2026 — a frontier AI that scored 97.6% on the Math Olympiad, completed expert-level hacking tasks at 73% success, and found 271 vulnerabilities in Firefox 150. Here is everything security teams need to know before it lands, and how Safeguard already supports Mythos zero-day discovery natively.

June 9, 2026Read
AI Security

Claude Fable 5: Anthropic's Most Capable Public Model Is Here — Benchmarks, Capabilities, and What It Means for Security

Anthropic just released Claude Fable 5, its most capable publicly available model and the first Mythos-class AI open to everyone. 80.3% on SWE-Bench Pro, 88% on Terminal-Bench 2.1, state-of-the-art across software engineering, vision, and scientific research. Safeguard has already integrated Fable 5 natively — here is everything you need to know.

June 9, 2026Read

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.