Model Family · Griffin

Griffin. The hypothesis engine.

Griffin is the heavyweight reasoning family — five size variants spanning 8B to a 671B-MoE flagship, all weighted purely on a cybersecurity corpus. It hypothesises exploit chains, cites the call-graph path, attempts a disproof against the project's sanitiser config, and writes the patch.

Size variants

One brain, five reasoning budgets.

Every variant shares the corpus, tokeniser and reasoning trace format. They differ in parameter count, context window and where they run.

Variant	Parameters	Context window	Latency p95	Deployment shape	Typical use
Griffin Lite	8B	32k	~1.2s	IDE-side cloud burst / CLI deep-scan	Fast single-finding reasoning.
Griffin S	14B	64k	~2.8s	Cloud	Mid-depth call-graph reasoning, PR-level reviews.
Griffin M	32B	128k	~5.5s	Cloud	Repo-wide reasoning, transitive taint chains.
Griffin L	70B	128k	~8s	Dedicated GPU	Multi-hop cross-package exploit hypothesis. Default production tier.
Griffin Zero	671B-MoE (~37B active)	256k	~12s	Multi-GPU cluster / sovereign	Deepest reasoning, supply-chain-scale audits.

Architecture

The internals that earn the verdict.

Architectural commitments

Mixture-of-experts (Zero: 8 experts, top-2 routing, ~5.5% activated params per token).
Security-augmented tokeniser with ~28k extra tokens covering CWE / CVE IDs, taint operators, package coordinates, and attack-pattern shorthand.
Sliding-window plus landmark attention for long-context call-graph reasoning at 256k.
Structured reasoning trace: hypothesise the exploit, cite the path, propose a disproof, propose a patch.
Security-domain RLHF using preference data labelled by senior offensive-security engineers, not generic annotation vendors.

Eval highlights

Measured against known ground truth.

81%

Exploit-hypothesis accuracy

98%

Adversarial prompt resistance

0.6%

Hallucination rate on security Q&A

94%

Top-5 candidate path retention vs CVE ground truth

Reasoning trace

The trace is the finding.

Every Griffin call emits a four-stage trace. Reviewers see the chain, not a single label, and can reject at any stage.

griffin-L · finding #4129

[01] HYPOTHESIS
     class: CWE-502 (unsafe deserialization)
     entry: HTTP POST /api/import-config
     gadget: pkg:maven/com.fasterxml.jackson.core/jackson-databind@2.9.10

[02] CITED PATH
     handler.parseRequest()  -> service.importConfig()
       -> codec.decode(bytes)  -> ObjectMapper.readValue(InputStream, Object.class)
     6 hops, 3 package boundaries, 1 sanitiser bypassed (allow-list mismatch).

[03] DISPROOF ATTEMPT
     - polymorphic typing disabled?  no (DefaultTyping.NON_FINAL active)
     - allow-list enforced?           partial; missing on nested key 'plugins'
     - sandbox or seccomp profile?    none on this code path
     refutation failed; finding stands.

[04] PROPOSED PATCH
     - replace ObjectMapper.readValue with constrained reader
       using ALLOWED_TYPES allow-list
     - bump jackson-databind to >= 2.15.2 (advisory-aligned)
     - add SecurityManager-equivalent unit test covering nested 'plugins'.

Auto-routing

Each finding goes to the cheapest variant that can handle it.

A triage score decides which Griffin size handles each candidate. You don't pay Zero-tier compute for an in-package call.

Triage score

Eagle assigns a complexity score from the call graph: depth, sanitiser ambiguity, cross-package edges, sink severity.

Variant selection

Cheap, in-package candidates route to Lite. Mid-depth PR work routes to S or M. Multi-hop cross-package paths route to L. Sovereign or long-budget audits route to Zero.

Reasoning pass

The chosen variant runs the hypothesise / cite / disprove / patch trace. The trace ships with the finding so reviewers can audit which variant produced what.

Development history

How Griffin got to where it is.

Three years, five variants, one corpus discipline. Each release earned its slot against the eval set, not a roadmap deadline.

2023

Internal prototype, "Aegis-0".

Started as a research prototype to test whether a transformer trained narrowly on CVE descriptions, exploit write-ups, and patched diffs would outperform a general-purpose model on three tasks: CWE classification, taint-path hypothesis generation, and patch suggestion. Initial parameter count under 1B. Eval against an internal held-out set of 4,200 disclosed CVEs showed a 28-point F1 lift over an off-the-shelf code model — enough to justify scaling.

Q2 2024

First production-grade model: Griffin S (14B).

Scaled the architecture, introduced the security-augmented tokeniser (~28k additional tokens for CWE/CVE IDs, taint operators, package coordinates). Trained on a curated corpus that grew from 1.2M to 11M security-domain documents. Adversarial prompt-injection eval rate dropped from 42% to 6%.

Q4 2024

Griffin M (32B) and the structured trace contract.

Introduced the structured reasoning trace as a first-class output (HYPOTHESIS / CITED PATH / DISPROOF / PROPOSED PATCH). Eval methodology shifted from "did the model find the bug" to "did the model find the bug and refute its own hypothesis under sanitiser-aware constraints." This is what later became the disproof pass.

Q2 2025

Griffin L (70B) — default production tier.

The 70B variant became the default for production-grade reasoning workloads. Long-context attention added: sliding-window plus landmark, taking usable context from 32k to 128k. Distillation experiments started in parallel — early Lion prototypes derived from Griffin L's reasoning traces.

Q4 2025

Griffin Zero (671B-MoE) — sovereign tier.

Mixture-of-experts variant introduced for sovereign and air-gapped deployments. Eight experts, top-2 routing, ~5.5% of parameters activated per token — Zero reaches 671B parameters at the inference cost of a ~37B dense model. Context extended to 256k usable via retrieval gates that pre-rank call-graph chunks before attention. First Sovereign-tier customers onboarded on internal pilots.

Q1 2026

Griffin Zero general availability.

Zero made generally available for Sovereign and Air-Gapped tiers in May 2026. Multi-GPU sizing documented from 11x H100 (Growth) to 22x H100 multi-AZ (Mature). Cross-package taint-path precision improved by 12 points over Griffin L on the internal evaluation set. Adversarial disproof pass moved to parallel decoding, reducing end-to-end p95 latency to roughly 12s.

Now

Current research direction.

Three concurrent research tracks: (1) on-device distillation of larger reasoning traces into Lion-class students, with the goal of pushing more reasoning depth into the IDE without breaking the sub-100ms latency budget; (2) adversarial training against prompt-injection attacks observed in real MCP-server traffic; (3) longer-horizon agentic workflows for coordinated disclosure, where Griffin Zero proposes upstream patches, runs them through the maintainer's project test suite, and drafts the disclosure thread.

Release pipeline

How a Griffin release actually ships.

Every variant — Lite through Zero — moves through the same six gates. Each one can block the release on its own.

Curation pass

Corpus is filtered against the security-only criteria (no general web crawl, no LLM-generated text, no customer code), deduplicated, and the held-out eval set is rotated.

Pretraining + security RLHF

Base pretraining on the curated corpus, followed by RLHF where the preference data is labelled by senior offensive-security engineers, not crowdworkers.

Adversarial red team

Internal red team runs prompt-injection, jailbreak, and refusal-rate suites against every checkpoint. A checkpoint that regresses the adversarial scores does not ship, regardless of capability gains.

Eval gate + cited-trace audit

Quantitative eval against the held-out CVE / taint-path / patch suites, plus a manual audit of 300 reasoning traces by the engineering team. Trace-quality regressions block release the same way capability regressions do.

Staged rollout

Variant ships to shared-cloud tier first (lowest blast radius), then dedicated cluster and VPC-isolated, then sovereign. Each tier has a 14-day soak period.

Post-release telemetry

Anonymised, aggregated usage and trace-quality metrics feed back into the next curation pass. Customer code never enters the loop.

Put Griffin on your hardest path.

Pick the variant that fits your budget and watch it reason through your real call graph, not a benchmark.