LLM-As-Judge Pitfalls In Security Evals
Using an LLM to score another LLM's output is expedient and dangerous. The judge has its own biases — ones that affect security evaluations specifically.
Deep dives, practical guides, and incident analyses from engineers who build Safeguard. No fluff, no vendor FUD — just what you need to ship secure software.
Using an LLM to score another LLM's output is expedient and dangerous. The judge has its own biases — ones that affect security evaluations specifically.
SEvenLLM set out to measure how well LLMs handle Security Event analysis, the unglamorous day-to-day work of SOCs and IR teams. A design review of what the benchmark covers, how it was built, and where the coverage maps or does not map to real operations.
SecBench positioned itself as a comprehensive cybersecurity knowledge and reasoning benchmark for LLMs. A methodology review of its construction, scoring, and the gaps that separate the advertised coverage from what the benchmark actually exercises.
SWE-bench became the default benchmark for measuring AI coding agents, but the security extensions that were bolted on afterwards deserve their own scrutiny. A field review of what they measure, where they break, and whether you should trust the numbers.
A working engineer's review of CyberSecEval, the Meta-originated benchmark that has quietly become the default sniff test for AI-for-security claims. What it actually measures, what it misses, and how to read its scores without fooling yourself.
A practical framework for scoring and ranking software vendor risk based on supply chain security posture, vulnerability history, and development practices.
Weekly insights on software supply chain security, delivered to your inbox.