How we built it. The research arc behind Griffin, Eagle, and Lion.
This is the public-facing engineering changelog of the model lineup. The individual family pages cover one model in depth; this page is the unified narrative — what shipped, what changed, what we learned, in chronological order, across all three families. No marketing arc, no roadmap promises. Just the research log.
Different layers, one corpus.
The three families live in different layers of the stack — inline on the developer machine, batched across the repo, and reasoning per finding in the cloud. They share a corpus, a tokeniser, and a trace contract. Each family page is the deep dive; this page tells you how they grew up together.
Major shipping events, oldest first.
One vertical narrative, three colour-coded families. Indigo for Griffin, violet for Eagle, purple for Lion. Each entry is the release event, not the marketing announcement.
Aegis-0 prototype (Griffin lineage)
A 700M-parameter prototype trained on CVE descriptions, exploit write-ups, and patched diffs. Eval against 4,200 disclosed CVEs showed a 28-point F1 lift over a general-purpose code baseline. Enough signal to justify scaling — the prototype told us the corpus mattered more than the parameter count, which became the founding assumption for everything that followed.
Eagle prototype as a ranking head
Eagle began as a ranking head bolted on top of an early Griffin checkpoint, not a standalone model. The motivation was economic: triage the tens of thousands of candidate taint paths a real repo generates before Griffin spent any reasoning budget on them. The ranking-head experiment beat keyword filters by a wide enough margin to justify pulling Eagle out as its own training track.
Griffin S (14B) — first production-grade variant
Griffin S was the first variant we were willing to put in front of customers. It introduced the security-augmented tokeniser — roughly 28k new tokens covering CWE and CVE identifiers, taint operators, and package coordinates. Adversarial prompt-injection rate dropped from 42% on the baseline to 6% on Griffin S. The tokeniser, not the parameter count, was the unlock.
Eagle as a standalone 13B dense model
Eagle split out into its own training track as a 13B dense transformer. Trained on roughly 800k labelled source/sink pairs annotated by senior security engineers, with attention biased toward dataflow tokens — sources, sinks, sanitiser operators. We measured a clear lift on cross-package taint-path recall versus generic code baselines at the same parameter count.
Griffin M (32B) + structured trace contract
Griffin M doubled parameters and introduced the structured reasoning trace as a first-class output: HYPOTHESIS, CITED PATH, DISPROOF, PROPOSED PATCH. The eval methodology pivoted at the same time — we stopped grading on "is there a bug?" and started grading on "find the bug and refute its own hypothesis under sanitiser-aware constraints." That contract is still the shape of every Griffin output today.
Eagle clustering head + dedup
We added a clustering head on top of Eagle's ranking output. Median finding count per repo dropped roughly 40% with no measurable recall loss — duplicates and near-duplicates that used to clog reviewer queues collapsed into single representative entries. This is the moment Eagle became economically load-bearing: it lets Griffin reason about a curated queue, not a firehose.
Griffin L (70B) — default production tier
The 70B variant became the default production tier. Long-context attention was added in two pieces: sliding-window for local coherence and landmark tokens for cross-section retrieval. Usable context moved from 32k to 128k, which is the first point where a multi-hop cross-package taint chain fits in one window without aggressive chunking.
First Lion prototype distilled from Griffin S
The first Lion prototype was a plain label distillation from Griffin S into a 1B student. It ran at roughly 140 ms p95 on Apple Silicon — close to the latency target but not under it, and the accuracy was acceptable while the trace quality was clearly missing. The prototype was honest about the problem: label-only distillation gives you a fast guesser, not a fast reasoner.
Trace distillation pipeline for Lion
We rebuilt the distillation pipeline to supervise the student on both (input, final label) and (input, intermediate reasoning trace). The student learned to mimic Griffin's chain, not just its verdict. This is what gave Lion its accuracy at sub-100 ms — the trace turned out to be a stronger learning signal than the label, especially on sanitiser-aware negative examples.
Eagle INT8 quantisation
Eagle's weights were quantised to INT8 with a per-channel calibration pass. p95 sweep latency on a representative 5,000-package monorepo dropped to roughly 510 ms. Recall regression on the held-out eval was under one percentage point. The quantised path is what we now run in any tier that pays per scan.
Lion 1.0 GA in VS Code
Lion 1.0 shipped distilled from Griffin L (70B), not Griffin S. Sub-100 ms p95 on a developer laptop, sink-detection F1 above 0.78 on the held-out evaluation set. The on-device, no-egress posture was contractual from day one — the IDE extension can run with the network disabled and Lion still works.
Griffin Zero (671B-MoE) introduced
Griffin Zero introduced a mixture-of-experts variant for sovereign and air-gapped deployments. Eight experts, top-2 routing, roughly 5.5% of parameters activated per token. Usable context extended to 256k through retrieval gates that page the right slice of the call graph in around the hypothesis. Internal pilots only at first — Zero was not released broadly until the eval suite was stable enough to certify it.
Eagle ranking head v2
The ranking head was retrained against Griffin's disproof outcomes — Eagle now learns from which of its candidates Griffin actually refuted. Top-5 candidate-path recall climbed to 94%. Each candidate ships with a confidence score so Griffin routes only above-threshold candidates by default, which cut wasted reasoning budget on low-confidence triage.
Lion signed weights + JetBrains + Cursor
Lion weights ship as sigstore-signed bundles. The IDE extension verifies the signature on install and refuses to load unsigned or mismatched weights — a small but contractual guarantee that the model on the developer machine is the model that passed eval. JetBrains and Cursor reached parity with the VS Code extension in the same release.
Aegis architecture documented publicly
The reasoning architecture inside every Griffin variant — sliding-window plus landmark attention, the security-augmented tokeniser, the structured trace, mixture-of-experts in the largest tier — was published as a standalone architecture page. The intent was to remove ambiguity about what is actually inside the model when an enterprise asks for an architecture review.
Griffin Zero general availability
Griffin Zero became generally available for Sovereign and Air-Gapped tiers. Multi-GPU sizing is documented from 11x H100 (Growth) to 22x H100 multi-AZ (Mature). Cross-package taint-path precision improved 12 points over Griffin L on the internal eval suite. The adversarial disproof pass moved to parallel decoding, which is what made the 256k window economical at scale.
What we're working on right now.
One open track per family. These are descriptions of active research, not roadmap commitments — the items below ship when they pass eval, not when a quarter ends.
Longer-horizon agentic workflows
Zero proposes upstream patches, runs them through the maintainer's test suite, drafts the coordinated disclosure thread. Parallel track: adversarial training against real prompt-injection traffic observed in MCP-server logs (anonymised, aggregated, never per-tenant).
Cross-language taint awareness
Polyglot repos lose recall when taint flows cross a language boundary — JS calling a Python service calling a Go binary. The current track teaches Eagle a unified dataflow grammar across languages, plus a feedback loop from Griffin's disproof pass so refuted candidates fold back into the next training run.
Language-specific student heads
Language-specific heads (JVM, Python, Go) with shared base weights and task-specific fine-tunes. The motivation is deeper reasoning depth on language-particular sink patterns without breaking the sub-100 ms latency budget on a developer laptop.
Six steps from curation to general availability.
Every variant — Lion, Eagle, every Griffin tier — passes the same pipeline. No model ships because a date arrived; every model ships because every gate cleared.
Curation pass
Corpus filtered against security-only criteria, deduplicated against the previous release, and the held-out eval set is rotated so the model has not seen the new evaluation prompts.
Pretraining + security RLHF
Preference data labelled by senior offensive-security engineers, not crowdworkers. The reward model penalises plausible-sounding hallucinations on CWE classification and treats unverified reachability claims as failures.
Adversarial red team
Prompt-injection, jailbreak, and refusal-rate suites are run against every checkpoint. Any regression on a previous-quarter test case blocks ship until the regression is explained or fixed.
Eval gate + cited-trace audit
Quantitative eval is necessary but not sufficient. The engineering team manually audits 300 reasoning traces per release — the trace has to read like a defender wrote it, not like a model hallucinated one.
Staged rollout
Shared cloud first, then dedicated cluster, then VPC-isolated, then sovereign. Each tier gets a 14-day soak window with telemetry on refusal rate, latency, and finding precision before the next tier opens.
Post-release telemetry
Anonymised, aggregated metrics feed the next curation pass. Customer code never enters the loop — the telemetry is shape-level (counts, latencies, refusal categories), never content-level.
Open about the method, closed about the data.
Published
- Model cards for every shipped variant, with capability and limitation notes.
- Eval results on the internal suite, methodology reviewable under NDA.
- Engineering blog posts when an architecture change ships.
- The release changelog — what changed, when, and why.
- This research page itself, updated alongside the changelog.
Not published
- The training corpus itself, beyond the categorical description.
- Internal customer telemetry of any kind.
- Individual customer findings, even in aggregate form.
- The held-out eval set — publishing it would contaminate the next eval.
- Per-tenant configuration, prompts, or KV-cache state.
Where to go next.
Model family overview
Three models, one corpus. The parent page covering Griffin, Eagle, and Lion side by side.
Aegis architecture
The reasoning architecture inside every Griffin variant — tokeniser, attention, trace contract.
The security corpus
What is and is not in the training data, and why the corpus matters more than the parameter count.
Model distillation
How Lion is distilled from Griffin without losing the reasoning trace, and why label-only distillation is not enough.
See the lineup on your code.
Lion at the commit. Eagle across the repo. Griffin proving the survivors. The same models the research log describes — running on your real codebase.