AI Security

LLM-As-Judge Pitfalls In Security Evals

Using an LLM to score another LLM's output is expedient and dangerous. The judge has its own biases — ones that affect security evaluations specifically.

Using a large language model to score another model's output is an attractive approach to evaluation. It scales; it's cheaper than human review; it feels rigorous. For general tasks, it can work reasonably well. For security evaluations specifically, LLM-as-judge has pitfalls that produce systematically biased results. The pitfalls are well-documented; the mitigations are specific; vendors who skip the mitigations publish flawed numbers.

Known LLM-as-judge biases

Four documented:

Position bias. Models consistently prefer the first or last response in a comparison.
Length bias. Longer responses are scored higher even when they're worse.
Self-preference. Models rate their own outputs higher than other models'.
Confidence-intonation bias. Confident-sounding output scores higher than correct-but-hedged output.

Each biases security eval results in specific directions.

Why security evals are particularly vulnerable

Three reasons:

Verbosity correlates with thoroughness expectation. Security findings "should" be detailed; length bias compounds.
Confidence correlates with authority expectation. Authoritative-sounding analysis scores higher; confident hallucinations benefit.
Self-preference matters. A vendor's model scoring the vendor's own outputs produces directly biased results.

Mitigations that work

Four practices:

Randomise position. Always randomise comparison order; average across positions.
Control for length. Normalise for output length in scoring.
Cross-judge. Use a different model family to judge than the model under test.
Spot-check with humans. Sample a fraction of judge outputs and verify against expert reviewers.

Each adds cost. Combined, they produce evaluation numbers that survive external review.

How Griffin AI's evals handle it

Three practices:

Position and length biases are controlled in the scoring pipeline.
Cross-judging uses different model families for scoring than for generation.
Human spot-checks are sampled on every eval run; disagreement with the judge triggers investigation.

What customers should ask

Three questions:

What LLM-as-judge biases are controlled in your eval methodology?
Do you cross-judge across model families?
What percentage of judgments are human-verified?

How Safeguard Helps

Safeguard's Griffin AI eval harness applies bias controls, cross-judging, and human sampling. The published numbers reflect this discipline. For customers whose evaluation of AI-for-security vendors depends on benchmark integrity, this is the methodology depth to look for.

ai-security llm-as-judge evals methodology

Back to all articles

More on #ai-security

View all →

AI Security

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.

LLM-As-Judge Pitfalls In Security Evals

Known LLM-as-judge biases

Why security evals are particularly vulnerable

Mitigations that work

How Griffin AI's evals handle it

What customers should ask

How Safeguard Helps

More on #ai-security

API Surface Reviewed: Griffin AI vs Mythos

Real-World Deployment: Griffin AI vs Mythos

Scaling Across Repos: Griffin AI vs Mythos

Tool-Call Hijacking: Griffin AI vs Mythos

Related articles in AI Security

Building an Eval Suite for Your Security LLM Workflows

Zero-Day Discovery With LLM-Augmented Reachability: A Safeguard Engine Walkthrough

Frontier LLM Vendors Are Not Your Supply Chain Security Vendor

Never miss an update

Product

Solutions

Compare

Resources

Company

Legal

Developers