LLM-As-Judge Pitfalls In Security Evals
Using an LLM to score another LLM's output is expedient and dangerous. The judge has its own biases — ones that affect security evaluations specifically.
Deep dives, practical guides, and incident analyses from engineers who build Safeguard. No fluff, no vendor FUD — just what you need to ship secure software.
Using an LLM to score another LLM's output is expedient and dangerous. The judge has its own biases — ones that affect security evaluations specifically.
Two years ago, AI vendors shipped without evals. In 2026, the posture has shifted. Customers expect benchmarks. Vendors without them lose deals.
Benchmark scores are only as honest as the dataset behind them. Griffin AI publishes golden-dataset design notes; Mythos-class tools rarely explain theirs.
A benchmark that the model has seen in training is a benchmark of memorisation. Specific leakage-testing methods separate generalisation from recall.
Every release risks making the model worse. Griffin AI's regression gates block bad builds before they ship. Mythos-class tools rarely describe a gate process at all.
When the test set is in the training set, the benchmark is broken. Security eval contamination is widespread and the mitigations are specific.
A security AI that refuses too often is useless. One that refuses too rarely is dangerous. Griffin AI publishes calibrated refusal benchmarks; Mythos does not.
SEvenLLM set out to measure how well LLMs handle Security Event analysis, the unglamorous day-to-day work of SOCs and IR teams. A design review of what the benchmark covers, how it was built, and where the coverage maps or does not map to real operations.
An AI security tool that cites the wrong advisory is worse than one that says nothing. Griffin AI benchmarks citation accuracy at 0.89 similarity; Mythos does not.
Weekly insights on software supply chain security, delivered to your inbox.