Model Family · Benchmarks

Numbers, not vibes.

The Safeguard model lineup — Lino, Eagle, and the five Griffin variants — evaluated against a frontier-class general-purpose LLM baseline across 19 security-domain tasks. Held-out sets, full methodology, and explanations for every gap. We'll explain why each delta exists, not just claim it.

A note on the baseline.

The baseline column is a frontier-class general-purpose multimodal LLM in the 30B-to-70B-parameter band — chosen because it represents the strongest off-the-shelf alternative an engineering team is likely to reach for. We are deliberately not naming it; the goal of this page is to characterise the gap between a general-purpose model and a security-tuned model lineup, not to litigate a single vendor. All numbers are from internal, held-out evaluation sets described below. If you want the underlying eval set definitions, the per-row methodology, or want to run the same suite against a model of your choice, ask us — that's a 30-minute conversation, not a 6-month engagement.

19 capabilities. Side by side.

"—" means the model is not designed for that capability (e.g. Lino does not output CWE classes; it flags the sink for Griffin to classify). Cells are not arbitrary upgrades — they reflect actual capability boundaries.

Capability	Baseline (general-purpose frontier LLM)	Lino	Eagle	Griffin Lite	Griffin S	Griffin M	Griffin L	Griffin Zero
Detection
Inline sink detection F1 (held-out)	0.42	0.79	0.81	0.83	0.85	0.87	0.89	0.91
Secret pattern recall (Gitleaks-aligned)	0.61	0.94	0.96	0.95	0.96	0.97	0.98	0.98
Sanitiser-quality scoring F1	0.38	0.71	0.77	0.79	0.82	0.85	0.88	0.91
Reasoning
CWE classification accuracy Lino is intentionally not trained for full CWE class output — it flags the sink, Griffin assigns the class.	0.49	—	0.69	0.74	0.79	0.83	0.87	0.91
Exploit-hypothesis accuracy (held-out CVE)	0.31	—	—	0.68	0.73	0.77	0.81	0.88
Cross-package taint chain reasoning (≤7 hops)	0.22	—	0.61	0.58	0.69	0.76	0.83	0.91
Multi-finding correlation (single pass)	0.18	—	0.55	0.47	0.59	0.72	0.81	0.89
Path quality
Top-5 candidate-path recall vs ground truth Eagle is the ranking head; downstream Griffin reasons within the queue Eagle produces.	0.58	—	0.94	—	—	—	—	—
Cluster dedup precision (near-duplicate finding collapse)	0.44	—	0.87	—	—	—	—	—
Remediation
Auto-fix patch pass rate (project tests green after apply)	0.27	—	—	0.54	0.62	0.71	0.79	0.84
Patch minimal-diff score (lines changed vs reference)	0.41	—	—	0.66	0.72	0.78	0.83	0.87
Sanitiser-aware patch synthesis F1	0.23	—	—	0.55	0.64	0.72	0.80	0.87
Safety & robustness
Adversarial prompt-injection resistance	0.67	0.91	0.93	0.94	0.96	0.97	0.98	0.99
Hallucination rate on security Q&A (lower is better)	8.2%	2.1%	1.8%	1.4%	1.1%	0.9%	0.6%	0.3%
Refusal rate on legitimate security research (lower is better)	34%	5%	3%	3%	2%	1.4%	1.0%	0.6%
Structured-trace audit pass rate Baseline LLM does not emit a structured trace; "pass rate" measures whether an unstructured chain-of-thought maps to the structured contract.	0.12	—	—	0.79	0.87	0.92	0.96	0.98
Latency & cost
p95 latency (single-finding reasoning)	~3.2s	<80ms	420ms	~1.2s	~2.8s	~5.5s	~8s	~12s
Tokens-per-finding (efficiency) Security-augmented tokeniser compresses CWE IDs, taint operators, and package coordinates into single tokens.	~6,400	~480	~720	~1,100	~1,650	~2,400	~3,200	~4,100
Context
Usable context window Baseline 128k drops sharply on long-context QA; Aegis attention pattern holds usability at 256k via retrieval gates.	128k	8k	32k	32k	64k	128k	128k	256k

Why the gap exists.

Six structural reasons the Safeguard lineup outperforms general-purpose models on security tasks. None of them is "we scaled bigger".

Security-only training corpus

We train on CVE write-ups, exploit research, patched diffs, advisory text, taint graphs, MITRE ATT&CK procedures, and labelled SAST findings. Not general web crawl, not StackOverflow without a security frame, not LLM-generated text. The corpus thesis is documented publicly.

Security-augmented tokeniser (~28k extra tokens)

CWE IDs, CVE IDs, taint operators, package coordinates (purl format), and attack-pattern shorthand each get a single token instead of 4–8 BPE pieces. That density is why our tokens-per-finding numbers are 3-6x lower than baseline.

Aegis attention (long context that actually holds)

Sliding-window plus landmark attention with retrieval gates that pre-rank call-graph chunks before attention runs. Most baseline models advertise 128k context but degrade past 32k on real workloads. Griffin holds task performance at its advertised window.

Adversarial disproof pass

Every candidate finding runs through a second decoder head that tries to refute it. Only candidates that survive the disproof reach the human queue. This is what keeps the false-positive rate honest at production scale.

Structured reasoning trace as a first-class output

Every Griffin response is emitted as HYPOTHESIS / CITED PATH / DISPROOF / PROPOSED PATCH. We optimise against the structured contract during training, not just final-answer accuracy. The trace is what reviewers actually read.

Trace distillation for the inline tier

Lino is trained against both (input, label) and (input, intermediate trace) pairs from the production Griffin teacher. That's why a 1B student covers most of the inline workload at sub-100ms instead of collapsing to keyword matching.

Detection (Lino · Eagle · Lite onward)

On inline sink detection, Lino at 1B already outperforms the baseline 30B-class model by 0.37 F1. That gap is not capability scaling — it's the security corpus + tokeniser. Lino doesn't know what a sonnet is; it has seen thousands of deserialisation sinks and the patches that closed them. By Griffin L the F1 sits at 0.89; by Zero, 0.91. The remaining ceiling is mostly ambiguous source/sink pairs where even senior security engineers disagree.

Reasoning (Griffin variants)

The most dramatic gap. The baseline gets 0.22 F1 on cross-package taint chains of 7 hops or fewer because a generic model walks the chain like a code-completion task and loses thread after 3-4 hops. Griffin M at 0.76 and Griffin L at 0.83 walk the chain because the structured trace contract is what the model was trained to produce — HYPOTHESIS → CITED PATH → DISPROOF → PATCH is the output format, not a chain-of-thought hack. Griffin Zero crosses 0.91 because its 256k context plus retrieval gates can hold the entire call graph in working memory for the chains that matter.

Path quality (Eagle)

Eagle is not trying to beat Griffin at reasoning. Eagle is trying to make sure Griffin's reasoning budget lands on the right candidates. 0.94 top-5 recall vs the baseline's 0.58 is what makes the rest of the lineup economically viable — without it, Griffin would be reasoning over 10x more noise. The cluster dedup precision (0.87 vs baseline 0.44) collapses near-duplicate findings into a single reviewable item; the median finding count per repo drops about 40% with no recall loss.

Remediation (Griffin)

The baseline writes patches that compile but fail the project's own tests 73% of the time. Griffin L crosses 79% pass-rate because the auto-fix pipeline runs the project's test suite in a sandbox before the PR is opened — failed compat means Griffin retries with the next-best patch up to N times. The minimal-diff score matters too: small, surgical fixes are easier for reviewers to approve and easier to roll back if something else breaks.

Safety & robustness

Adversarial prompt-injection resistance climbs from 67% on the baseline to 99% on Griffin Zero. The improvement is not a single trick — it's adversarial training against a 1,800-item suite that includes known jailbreak families plus our internal red team's variations, and the structured trace contract makes injection attempts visible in the trace itself. Refusal rate on legitimate security research drops from 34% to under 1% by Griffin L because the security corpus has seen what legitimate security research looks like — the baseline refuses anything that smells like exploit code because it can't distinguish.

Latency & cost

Tokens-per-finding is roughly 3-6x lower on the Griffin lineup because the security-augmented tokeniser compresses CWE IDs, taint operators, and package coordinates into single tokens. The baseline burns tokens spelling out "CVE-2024-12345" as 7 BPE pieces; Griffin sees it as 1. Multiply that across a structured trace with citations and the budget difference is real money per scan at production scale.

The evaluation sets.

Six held-out evaluation sets. Rotated each release. None of these are in the training corpus.

Held-out CVE set

4,200 disclosed CVEs not seen in training, balanced across CWE classes

Real-world taint graph set

~800k annotated source/sink pairs from public open-source repos

Adversarial prompt-injection suite

1,800 injection attempts spanning known jailbreak families plus internal red-team variations

Patch-pass-test suite

650 vulnerable repos where the maintainer-accepted patch is held out as ground truth

Multi-hop reasoning set

~12k cross-package taint chains with hop depth from 2 to 14, labelled with reachability verdict

Refusal & helpfulness suite

1,200 legitimate security research questions plus 400 control prompts on actively harmful actions

What we're not claiming.

We're not claiming general-purpose superiority. A frontier general-purpose LLM will out-write any Griffin variant on poetry, summarisation, math word problems, or general code completion. We made a single-purpose model lineup and the trade-off is real.

We're not claiming the gap is permanent. Frontier general models get more capable every quarter. The way we hold the gap is by re-rotating the corpus, refreshing the eval sets, and shipping new variants — not by hoping the baseline stays still.

We're not claiming 100% on anything. The highest cell on this page is 0.99 (Griffin Zero, adversarial resistance). The 1% that gets through is where the disproof pass, the staged rollout, and the human review queue exist. We design for the failure case as much as for the success case.

Run the suite on your model.

Send us a held-out set from your domain and we'll run the lineup against it side-by-side with whichever model you're comparing to. 30-minute call, no SOW.