Griffin. The hypothesis engine.
Griffin is the heavyweight reasoning family — five size variants spanning 8B to a 671B-MoE flagship, all weighted purely on a cybersecurity corpus. It hypothesises exploit chains, cites the call-graph path, attempts a disproof against the project's sanitiser config, and writes the patch.
One brain, five reasoning budgets.
Every variant shares the corpus, tokeniser and reasoning trace format. They differ in parameter count, context window and where they run.
| Variant | Parameters | Context window | Latency p95 | Deployment shape | Typical use |
|---|---|---|---|---|---|
| Griffin Lite | 8B | 32k | ~1.2s | IDE-side cloud burst / CLI deep-scan | Fast single-finding reasoning. |
| Griffin S | 14B | 64k | ~2.8s | Cloud | Mid-depth call-graph reasoning, PR-level reviews. |
| Griffin M | 32B | 128k | ~5.5s | Cloud | Repo-wide reasoning, transitive taint chains. |
| Griffin L | 70B | 128k | ~8s | Dedicated GPU | Multi-hop cross-package exploit hypothesis. Default production tier. |
| Griffin Zero | 671B-MoE (~37B active) | 256k | ~12s | Multi-GPU cluster / sovereign | Deepest reasoning, supply-chain-scale audits. |
The internals that earn the verdict.
Architectural commitments
- Mixture-of-experts (Zero: 8 experts, top-2 routing, ~5.5% activated params per token).
- Security-augmented tokeniser with ~28k extra tokens covering CWE / CVE IDs, taint operators, package coordinates, and attack-pattern shorthand.
- Sliding-window plus landmark attention for long-context call-graph reasoning at 256k.
- Structured reasoning trace: hypothesise the exploit, cite the path, propose a disproof, propose a patch.
- Security-domain RLHF using preference data labelled by senior offensive-security engineers, not generic annotation vendors.
Measured against known ground truth.
The trace is the finding.
Every Griffin call emits a four-stage trace. Reviewers see the chain, not a single label, and can reject at any stage.
[01] HYPOTHESIS
class: CWE-502 (unsafe deserialization)
entry: HTTP POST /api/import-config
gadget: pkg:maven/com.fasterxml.jackson.core/jackson-databind@2.9.10
[02] CITED PATH
handler.parseRequest() -> service.importConfig()
-> codec.decode(bytes) -> ObjectMapper.readValue(InputStream, Object.class)
6 hops, 3 package boundaries, 1 sanitiser bypassed (allow-list mismatch).
[03] DISPROOF ATTEMPT
- polymorphic typing disabled? no (DefaultTyping.NON_FINAL active)
- allow-list enforced? partial; missing on nested key 'plugins'
- sandbox or seccomp profile? none on this code path
refutation failed; finding stands.
[04] PROPOSED PATCH
- replace ObjectMapper.readValue with constrained reader
using ALLOWED_TYPES allow-list
- bump jackson-databind to >= 2.15.2 (advisory-aligned)
- add SecurityManager-equivalent unit test covering nested 'plugins'.Each finding goes to the cheapest variant that can handle it.
A triage score decides which Griffin size handles each candidate. You don't pay Zero-tier compute for an in-package call.
Eagle assigns a complexity score from the call graph: depth, sanitiser ambiguity, cross-package edges, sink severity.
Cheap, in-package candidates route to Lite. Mid-depth PR work routes to S or M. Multi-hop cross-package paths route to L. Sovereign or long-budget audits route to Zero.
The chosen variant runs the hypothesise / cite / disprove / patch trace. The trace ships with the finding so reviewers can audit which variant produced what.
How Griffin got to where it is.
Three years, five variants, one corpus discipline. Each release earned its slot against the eval set, not a roadmap deadline.
Internal prototype, "Aegis-0".
Started as a research prototype to test whether a transformer trained narrowly on CVE descriptions, exploit write-ups, and patched diffs would outperform a general-purpose model on three tasks: CWE classification, taint-path hypothesis generation, and patch suggestion. Initial parameter count under 1B. Eval against an internal held-out set of 4,200 disclosed CVEs showed a 28-point F1 lift over an off-the-shelf code model — enough to justify scaling.
First production-grade model: Griffin S (14B).
Scaled the architecture, introduced the security-augmented tokeniser (~28k additional tokens for CWE/CVE IDs, taint operators, package coordinates). Trained on a curated corpus that grew from 1.2M to 11M security-domain documents. Adversarial prompt-injection eval rate dropped from 42% to 6%.
Griffin M (32B) and the structured trace contract.
Introduced the structured reasoning trace as a first-class output (HYPOTHESIS / CITED PATH / DISPROOF / PROPOSED PATCH). Eval methodology shifted from "did the model find the bug" to "did the model find the bug and refute its own hypothesis under sanitiser-aware constraints." This is what later became the disproof pass.
Griffin L (70B) — default production tier.
The 70B variant became the default for production-grade reasoning workloads. Long-context attention added: sliding-window plus landmark, taking usable context from 32k to 128k. Distillation experiments started in parallel — early Lion prototypes derived from Griffin L's reasoning traces.
Griffin Zero (671B-MoE) — sovereign tier.
Mixture-of-experts variant introduced for sovereign and air-gapped deployments. Eight experts, top-2 routing, ~5.5% of parameters activated per token — Zero reaches 671B parameters at the inference cost of a ~37B dense model. Context extended to 256k usable via retrieval gates that pre-rank call-graph chunks before attention. First Sovereign-tier customers onboarded on internal pilots.
Griffin Zero general availability.
Zero made generally available for Sovereign and Air-Gapped tiers in May 2026. Multi-GPU sizing documented from 11x H100 (Growth) to 22x H100 multi-AZ (Mature). Cross-package taint-path precision improved by 12 points over Griffin L on the internal evaluation set. Adversarial disproof pass moved to parallel decoding, reducing end-to-end p95 latency to roughly 12s.
Current research direction.
Three concurrent research tracks: (1) on-device distillation of larger reasoning traces into Lion-class students, with the goal of pushing more reasoning depth into the IDE without breaking the sub-100ms latency budget; (2) adversarial training against prompt-injection attacks observed in real MCP-server traffic; (3) longer-horizon agentic workflows for coordinated disclosure, where Griffin Zero proposes upstream patches, runs them through the maintainer's project test suite, and drafts the disclosure thread.
How a Griffin release actually ships.
Every variant — Lite through Zero — moves through the same six gates. Each one can block the release on its own.
Curation pass
Corpus is filtered against the security-only criteria (no general web crawl, no LLM-generated text, no customer code), deduplicated, and the held-out eval set is rotated.
Pretraining + security RLHF
Base pretraining on the curated corpus, followed by RLHF where the preference data is labelled by senior offensive-security engineers, not crowdworkers.
Adversarial red team
Internal red team runs prompt-injection, jailbreak, and refusal-rate suites against every checkpoint. A checkpoint that regresses the adversarial scores does not ship, regardless of capability gains.
Eval gate + cited-trace audit
Quantitative eval against the held-out CVE / taint-path / patch suites, plus a manual audit of 300 reasoning traces by the engineering team. Trace-quality regressions block release the same way capability regressions do.
Staged rollout
Variant ships to shared-cloud tier first (lowest blast radius), then dedicated cluster and VPC-isolated, then sovereign. Each tier has a 14-day soak period.
Post-release telemetry
Anonymised, aggregated usage and trace-quality metrics feed back into the next curation pass. Customer code never enters the loop.
Put Griffin on your hardest path.
Pick the variant that fits your budget and watch it reason through your real call graph, not a benchmark.