AI Security

LLM-As-Judge Pitfalls In Security Evals

Using an LLM to score another LLM's output is expedient and dangerous. The judge has its own biases — ones that affect security evaluations specifically.

Nayan Dey
Senior Security Engineer
2 min read

Using a large language model to score another model's output is an attractive approach to evaluation. It scales; it's cheaper than human review; it feels rigorous. For general tasks, it can work reasonably well. For security evaluations specifically, LLM-as-judge has pitfalls that produce systematically biased results. The pitfalls are well-documented; the mitigations are specific; vendors who skip the mitigations publish flawed numbers.

Known LLM-as-judge biases

Four documented:

  • Position bias. Models consistently prefer the first or last response in a comparison.
  • Length bias. Longer responses are scored higher even when they're worse.
  • Self-preference. Models rate their own outputs higher than other models'.
  • Confidence-intonation bias. Confident-sounding output scores higher than correct-but-hedged output.

Each biases security eval results in specific directions.

Why security evals are particularly vulnerable

Three reasons:

  • Verbosity correlates with thoroughness expectation. Security findings "should" be detailed; length bias compounds.
  • Confidence correlates with authority expectation. Authoritative-sounding analysis scores higher; confident hallucinations benefit.
  • Self-preference matters. A vendor's model scoring the vendor's own outputs produces directly biased results.

Mitigations that work

Four practices:

  • Randomise position. Always randomise comparison order; average across positions.
  • Control for length. Normalise for output length in scoring.
  • Cross-judge. Use a different model family to judge than the model under test.
  • Spot-check with humans. Sample a fraction of judge outputs and verify against expert reviewers.

Each adds cost. Combined, they produce evaluation numbers that survive external review.

How Griffin AI's evals handle it

Three practices:

  • Position and length biases are controlled in the scoring pipeline.
  • Cross-judging uses different model families for scoring than for generation.
  • Human spot-checks are sampled on every eval run; disagreement with the judge triggers investigation.

What customers should ask

Three questions:

  1. What LLM-as-judge biases are controlled in your eval methodology?
  2. Do you cross-judge across model families?
  3. What percentage of judgments are human-verified?

How Safeguard Helps

Safeguard's Griffin AI eval harness applies bias controls, cross-judging, and human sampling. The published numbers reflect this discipline. For customers whose evaluation of AI-for-security vendors depends on benchmark integrity, this is the methodology depth to look for.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.