Benchmark Contamination Concerns In Security Evals
When the test set is in the training set, the benchmark is broken. Security eval contamination is widespread and the mitigations are specific.
Deep dives, practical guides, and incident analyses from engineers who build Safeguard. No fluff, no vendor FUD — just what you need to ship secure software.
When the test set is in the training set, the benchmark is broken. Security eval contamination is widespread and the mitigations are specific.
Anthropic's Claude Agent Skills let you package tools and context for Claude. Here's how that primitive compares to Griffin's security-specific workflow scaffolding.
A senior engineer's side-by-side look at Griffin AI and Mythos — why engine-grounded reasoning beats pure-LLM security intuition when the audit clock starts.
A million-token context window is a tool, not a solution. Context grounding for security requires architecture, not just capacity.
Reasoning models have arrived in security tooling. Evaluating them requires different methodology from evaluating classification or generation models. Here is what good evaluation looks like.
RSA Conference 2026 centered on AI governance, software supply chain regulation, and vendor consolidation. Here is the analyst view of what mattered.
When an agent can call tools, the permission boundary is no longer between the user and the system. It is between the model's current beliefs and everything the model can reach. That is a much harder boundary to defend.
Gemini's function calling is strong and flexible. Griffin AI's tool layer is narrow and opinionated. For security workflows, the opinionated approach wins.
Model weights are binaries with the privilege of code and the review of documents. Here is what signing, attestation, and provenance should actually look like.
Weekly insights on software supply chain security, delivered to your inbox.