This is the public-facing engineering changelog of the model lineup. The individual family pages cover one model in depth; this page is the unified narrative — what shipped, what changed, what we learned, in chronological order, across all three families. No marketing arc, no roadmap promises. Just the research log.
The three families live in different layers of the stack — inline on the developer machine, batched across the repo, and reasoning per finding in the cloud. They share a corpus, a tokeniser, and a trace contract. Each family page is the deep dive; this page tells you how they grew up together.
One vertical narrative, three colour-coded families. Indigo for Griffin, violet for Eagle, purple for Lino. Each entry is the release event, not the marketing announcement.
A 700M-parameter prototype trained on CVE descriptions, exploit write-ups, and patched diffs. Eval against 4,200 disclosed CVEs showed a 28-point F1 lift over a general-purpose code baseline. Enough signal to justify scaling — the prototype told us the corpus mattered more than the parameter count, which became the founding assumption for everything that followed.
Eagle began as a ranking head bolted on top of an early Griffin checkpoint, not a standalone model. The motivation was economic: triage the tens of thousands of candidate taint paths a real repo generates before Griffin spent any reasoning budget on them. The ranking-head experiment beat keyword filters by a wide enough margin to justify pulling Eagle out as its own training track.
Griffin S was the first variant we were willing to put in front of customers. It introduced the security-augmented tokeniser — roughly 28k new tokens covering CWE and CVE identifiers, taint operators, and package coordinates. Adversarial prompt-injection rate dropped from 42% on the baseline to 6% on Griffin S. The tokeniser, not the parameter count, was the unlock.
Eagle split out into its own training track as a 13B dense transformer. Trained on roughly 800k labelled source/sink pairs annotated by senior security engineers, with attention biased toward dataflow tokens — sources, sinks, sanitiser operators. We measured a clear lift on cross-package taint-path recall versus generic code baselines at the same parameter count.
Griffin M doubled parameters and introduced the structured reasoning trace as a first-class output: HYPOTHESIS, CITED PATH, DISPROOF, PROPOSED PATCH. The eval methodology pivoted at the same time — we stopped grading on "is there a bug?" and started grading on "find the bug and refute its own hypothesis under sanitiser-aware constraints." That contract is still the shape of every Griffin output today.
We added a clustering head on top of Eagle's ranking output. Median finding count per repo dropped roughly 40% with no measurable recall loss — duplicates and near-duplicates that used to clog reviewer queues collapsed into single representative entries. This is the moment Eagle became economically load-bearing: it lets Griffin reason about a curated queue, not a firehose.
The 70B variant became the default production tier. Long-context attention was added in two pieces: sliding-window for local coherence and landmark tokens for cross-section retrieval. Usable context moved from 32k to 128k, which is the first point where a multi-hop cross-package taint chain fits in one window without aggressive chunking.
The first Lino prototype was a plain label distillation from Griffin S into a 1B student. It ran at roughly 140 ms p95 on Apple Silicon — close to the latency target but not under it, and the accuracy was acceptable while the trace quality was clearly missing. The prototype was honest about the problem: label-only distillation gives you a fast guesser, not a fast reasoner.
We rebuilt the distillation pipeline to supervise the student on both (input, final label) and (input, intermediate reasoning trace). The student learned to mimic Griffin's chain, not just its verdict. This is what gave Lino its accuracy at sub-100 ms — the trace turned out to be a stronger learning signal than the label, especially on sanitiser-aware negative examples.
Eagle's weights were quantised to INT8 with a per-channel calibration pass. p95 sweep latency on a representative 5,000-package monorepo dropped to roughly 510 ms. Recall regression on the held-out eval was under one percentage point. The quantised path is what we now run in any tier that pays per scan.
Lino 1.0 shipped distilled from Griffin L (70B), not Griffin S. Sub-100 ms p95 on a developer laptop, sink-detection F1 above 0.78 on the held-out evaluation set. The on-device, no-egress posture was contractual from day one — the IDE extension can run with the network disabled and Lino still works.
Griffin Zero introduced a mixture-of-experts variant for sovereign and air-gapped deployments. Eight experts, top-2 routing, roughly 5.5% of parameters activated per token. Usable context extended to 256k through retrieval gates that page the right slice of the call graph in around the hypothesis. Internal pilots only at first — Zero was not released broadly until the eval suite was stable enough to certify it.
The ranking head was retrained against Griffin's disproof outcomes — Eagle now learns from which of its candidates Griffin actually refuted. Top-5 candidate-path recall climbed to 94%. Each candidate ships with a confidence score so Griffin routes only above-threshold candidates by default, which cut wasted reasoning budget on low-confidence triage.
Lino weights ship as sigstore-signed bundles. The IDE extension verifies the signature on install and refuses to load unsigned or mismatched weights — a small but contractual guarantee that the model on the developer machine is the model that passed eval. JetBrains and Cursor reached parity with the VS Code extension in the same release.
The reasoning architecture inside every Griffin variant — sliding-window plus landmark attention, the security-augmented tokeniser, the structured trace, mixture-of-experts in the largest tier — was published as a standalone architecture page. The intent was to remove ambiguity about what is actually inside the model when an enterprise asks for an architecture review.
Griffin Zero became generally available for Sovereign and Air-Gapped tiers. Multi-GPU sizing is documented from 11x H100 (Growth) to 22x H100 multi-AZ (Mature). Cross-package taint-path precision improved 12 points over Griffin L on the internal eval suite. The adversarial disproof pass moved to parallel decoding, which is what made the 256k window economical at scale.
One open track per family. These are descriptions of active research, not roadmap commitments — the items below ship when they pass eval, not when a quarter ends.
Zero proposes upstream patches, runs them through the maintainer's test suite, drafts the coordinated disclosure thread. Parallel track: adversarial training against real prompt-injection traffic observed in MCP-server logs (anonymised, aggregated, never per-tenant).
Polyglot repos lose recall when taint flows cross a language boundary — JS calling a Python service calling a Go binary. The current track teaches Eagle a unified dataflow grammar across languages, plus a feedback loop from Griffin's disproof pass so refuted candidates fold back into the next training run.
Language-specific heads (JVM, Python, Go) with shared base weights and task-specific fine-tunes. The motivation is deeper reasoning depth on language-particular sink patterns without breaking the sub-100 ms latency budget on a developer laptop.
Every variant — Lino, Eagle, every Griffin tier — passes the same pipeline. No model ships because a date arrived; every model ships because every gate cleared.
Corpus filtered against security-only criteria, deduplicated against the previous release, and the held-out eval set is rotated so the model has not seen the new evaluation prompts.
Preference data labelled by senior offensive-security engineers, not crowdworkers. The reward model penalises plausible-sounding hallucinations on CWE classification and treats unverified reachability claims as failures.
Prompt-injection, jailbreak, and refusal-rate suites are run against every checkpoint. Any regression on a previous-quarter test case blocks ship until the regression is explained or fixed.
Quantitative eval is necessary but not sufficient. The engineering team manually audits 300 reasoning traces per release — the trace has to read like a defender wrote it, not like a model hallucinated one.
Shared cloud first, then dedicated cluster, then VPC-isolated, then sovereign. Each tier gets a 14-day soak window with telemetry on refusal rate, latency, and finding precision before the next tier opens.
Anonymised, aggregated metrics feed the next curation pass. Customer code never enters the loop — the telemetry is shape-level (counts, latencies, refusal categories), never content-level.
Three models, one corpus. The parent page covering Griffin, Eagle, and Lino side by side.
The reasoning architecture inside every Griffin variant — tokeniser, attention, trace contract.
What is and is not in the training data, and why the corpus matters more than the parameter count.
How Lino is distilled from Griffin without losing the reasoning trace, and why label-only distillation is not enough.
Lino at the commit. Eagle across the repo. Griffin proving the survivors. The same models the research log describes — running on your real codebase.