Using a large language model to score another model's output is an attractive approach to evaluation. It scales; it's cheaper than human review; it feels rigorous. For general tasks, it can work reasonably well. For security evaluations specifically, LLM-as-judge has pitfalls that produce systematically biased results. The pitfalls are well-documented; the mitigations are specific; vendors who skip the mitigations publish flawed numbers.
Known LLM-as-judge biases
Four documented:
- Position bias. Models consistently prefer the first or last response in a comparison.
- Length bias. Longer responses are scored higher even when they're worse.
- Self-preference. Models rate their own outputs higher than other models'.
- Confidence-intonation bias. Confident-sounding output scores higher than correct-but-hedged output.
Each biases security eval results in specific directions.
Why security evals are particularly vulnerable
Three reasons:
- Verbosity correlates with thoroughness expectation. Security findings "should" be detailed; length bias compounds.
- Confidence correlates with authority expectation. Authoritative-sounding analysis scores higher; confident hallucinations benefit.
- Self-preference matters. A vendor's model scoring the vendor's own outputs produces directly biased results.
Mitigations that work
Four practices:
- Randomise position. Always randomise comparison order; average across positions.
- Control for length. Normalise for output length in scoring.
- Cross-judge. Use a different model family to judge than the model under test.
- Spot-check with humans. Sample a fraction of judge outputs and verify against expert reviewers.
Each adds cost. Combined, they produce evaluation numbers that survive external review.
How Griffin AI's evals handle it
Three practices:
- Position and length biases are controlled in the scoring pipeline.
- Cross-judging uses different model families for scoring than for generation.
- Human spot-checks are sampled on every eval run; disagreement with the judge triggers investigation.
What customers should ask
Three questions:
- What LLM-as-judge biases are controlled in your eval methodology?
- Do you cross-judge across model families?
- What percentage of judgments are human-verified?
How Safeguard Helps
Safeguard's Griffin AI eval harness applies bias controls, cross-judging, and human sampling. The published numbers reflect this discipline. For customers whose evaluation of AI-for-security vendors depends on benchmark integrity, this is the methodology depth to look for.