Research · Model Family Log

How we built it. The research arc behind Griffin, Eagle, and Lino.

This is the public-facing engineering changelog of the model lineup. The individual family pages cover one model in depth; this page is the unified narrative — what shipped, what changed, what we learned, in chronological order, across all three families. No marketing arc, no roadmap promises. Just the research log.

11M+
Docs in corpus
~28k
Security tokens
5
Griffin variants
0
Customer code in training
The three families

Different layers, one corpus.

The three families live in different layers of the stack — inline on the developer machine, batched across the repo, and reasoning per finding in the cloud. They share a corpus, a tokeniser, and a trace contract. Each family page is the deep dive; this page tells you how they grew up together.

Unified timeline

Major shipping events, oldest first.

One vertical narrative, three colour-coded families. Indigo for Griffin, violet for Eagle, purple for Lino. Each entry is the release event, not the marketing announcement.

Griffin
2023

Aegis-0 prototype (Griffin lineage)

A 700M-parameter prototype trained on CVE descriptions, exploit write-ups, and patched diffs. Eval against 4,200 disclosed CVEs showed a 28-point F1 lift over a general-purpose code baseline. Enough signal to justify scaling — the prototype told us the corpus mattered more than the parameter count, which became the founding assumption for everything that followed.

Eagle
Q1 2024

Eagle prototype as a ranking head

Eagle began as a ranking head bolted on top of an early Griffin checkpoint, not a standalone model. The motivation was economic: triage the tens of thousands of candidate taint paths a real repo generates before Griffin spent any reasoning budget on them. The ranking-head experiment beat keyword filters by a wide enough margin to justify pulling Eagle out as its own training track.

Griffin
Q2 2024

Griffin S (14B) — first production-grade variant

Griffin S was the first variant we were willing to put in front of customers. It introduced the security-augmented tokeniser — roughly 28k new tokens covering CWE and CVE identifiers, taint operators, and package coordinates. Adversarial prompt-injection rate dropped from 42% on the baseline to 6% on Griffin S. The tokeniser, not the parameter count, was the unlock.

Eagle
Q3 2024

Eagle as a standalone 13B dense model

Eagle split out into its own training track as a 13B dense transformer. Trained on roughly 800k labelled source/sink pairs annotated by senior security engineers, with attention biased toward dataflow tokens — sources, sinks, sanitiser operators. We measured a clear lift on cross-package taint-path recall versus generic code baselines at the same parameter count.

Griffin
Q4 2024

Griffin M (32B) + structured trace contract

Griffin M doubled parameters and introduced the structured reasoning trace as a first-class output: HYPOTHESIS, CITED PATH, DISPROOF, PROPOSED PATCH. The eval methodology pivoted at the same time — we stopped grading on "is there a bug?" and started grading on "find the bug and refute its own hypothesis under sanitiser-aware constraints." That contract is still the shape of every Griffin output today.

Eagle
Q1 2025

Eagle clustering head + dedup

We added a clustering head on top of Eagle's ranking output. Median finding count per repo dropped roughly 40% with no measurable recall loss — duplicates and near-duplicates that used to clog reviewer queues collapsed into single representative entries. This is the moment Eagle became economically load-bearing: it lets Griffin reason about a curated queue, not a firehose.

Griffin
Q2 2025

Griffin L (70B) — default production tier

The 70B variant became the default production tier. Long-context attention was added in two pieces: sliding-window for local coherence and landmark tokens for cross-section retrieval. Usable context moved from 32k to 128k, which is the first point where a multi-hop cross-package taint chain fits in one window without aggressive chunking.

Lino
Q2 2025

First Lino prototype distilled from Griffin S

The first Lino prototype was a plain label distillation from Griffin S into a 1B student. It ran at roughly 140 ms p95 on Apple Silicon — close to the latency target but not under it, and the accuracy was acceptable while the trace quality was clearly missing. The prototype was honest about the problem: label-only distillation gives you a fast guesser, not a fast reasoner.

Lino
Q3 2025

Trace distillation pipeline for Lino

We rebuilt the distillation pipeline to supervise the student on both (input, final label) and (input, intermediate reasoning trace). The student learned to mimic Griffin's chain, not just its verdict. This is what gave Lino its accuracy at sub-100 ms — the trace turned out to be a stronger learning signal than the label, especially on sanitiser-aware negative examples.

Eagle
Q3 2025

Eagle INT8 quantisation

Eagle's weights were quantised to INT8 with a per-channel calibration pass. p95 sweep latency on a representative 5,000-package monorepo dropped to roughly 510 ms. Recall regression on the held-out eval was under one percentage point. The quantised path is what we now run in any tier that pays per scan.

Lino
Q4 2025

Lino 1.0 GA in VS Code

Lino 1.0 shipped distilled from Griffin L (70B), not Griffin S. Sub-100 ms p95 on a developer laptop, sink-detection F1 above 0.78 on the held-out evaluation set. The on-device, no-egress posture was contractual from day one — the IDE extension can run with the network disabled and Lino still works.

Griffin
Q4 2025

Griffin Zero (671B-MoE) introduced

Griffin Zero introduced a mixture-of-experts variant for sovereign and air-gapped deployments. Eight experts, top-2 routing, roughly 5.5% of parameters activated per token. Usable context extended to 256k through retrieval gates that page the right slice of the call graph in around the hypothesis. Internal pilots only at first — Zero was not released broadly until the eval suite was stable enough to certify it.

Eagle
Q1 2026

Eagle ranking head v2

The ranking head was retrained against Griffin's disproof outcomes — Eagle now learns from which of its candidates Griffin actually refuted. Top-5 candidate-path recall climbed to 94%. Each candidate ships with a confidence score so Griffin routes only above-threshold candidates by default, which cut wasted reasoning budget on low-confidence triage.

Lino
Q1 2026

Lino signed weights + JetBrains + Cursor

Lino weights ship as sigstore-signed bundles. The IDE extension verifies the signature on install and refuses to load unsigned or mismatched weights — a small but contractual guarantee that the model on the developer machine is the model that passed eval. JetBrains and Cursor reached parity with the VS Code extension in the same release.

Griffin
Q1 2026

Aegis architecture documented publicly

The reasoning architecture inside every Griffin variant — sliding-window plus landmark attention, the security-augmented tokeniser, the structured trace, mixture-of-experts in the largest tier — was published as a standalone architecture page. The intent was to remove ambiguity about what is actually inside the model when an enterprise asks for an architecture review.

Griffin
May 2026

Griffin Zero general availability

Griffin Zero became generally available for Sovereign and Air-Gapped tiers. Multi-GPU sizing is documented from 11x H100 (Growth) to 22x H100 multi-AZ (Mature). Cross-package taint-path precision improved 12 points over Griffin L on the internal eval suite. The adversarial disproof pass moved to parallel decoding, which is what made the 256k window economical at scale.

Open research tracks

What we're working on right now.

One open track per family. These are descriptions of active research, not roadmap commitments — the items below ship when they pass eval, not when a quarter ends.

Griffin

Longer-horizon agentic workflows

Zero proposes upstream patches, runs them through the maintainer's test suite, drafts the coordinated disclosure thread. Parallel track: adversarial training against real prompt-injection traffic observed in MCP-server logs (anonymised, aggregated, never per-tenant).

Eagle

Cross-language taint awareness

Polyglot repos lose recall when taint flows cross a language boundary — JS calling a Python service calling a Go binary. The current track teaches Eagle a unified dataflow grammar across languages, plus a feedback loop from Griffin's disproof pass so refuted candidates fold back into the next training run.

Lino

Language-specific student heads

Language-specific heads (JVM, Python, Go) with shared base weights and task-specific fine-tunes. The motivation is deeper reasoning depth on language-particular sink patterns without breaking the sub-100 ms latency budget on a developer laptop.

How research becomes a release

Six steps from curation to general availability.

Every variant — Lino, Eagle, every Griffin tier — passes the same pipeline. No model ships because a date arrived; every model ships because every gate cleared.

01

Curation pass

Corpus filtered against security-only criteria, deduplicated against the previous release, and the held-out eval set is rotated so the model has not seen the new evaluation prompts.

02

Pretraining + security RLHF

Preference data labelled by senior offensive-security engineers, not crowdworkers. The reward model penalises plausible-sounding hallucinations on CWE classification and treats unverified reachability claims as failures.

03

Adversarial red team

Prompt-injection, jailbreak, and refusal-rate suites are run against every checkpoint. Any regression on a previous-quarter test case blocks ship until the regression is explained or fixed.

04

Eval gate + cited-trace audit

Quantitative eval is necessary but not sufficient. The engineering team manually audits 300 reasoning traces per release — the trace has to read like a defender wrote it, not like a model hallucinated one.

05

Staged rollout

Shared cloud first, then dedicated cluster, then VPC-isolated, then sovereign. Each tier gets a 14-day soak window with telemetry on refusal rate, latency, and finding precision before the next tier opens.

06

Post-release telemetry

Anonymised, aggregated metrics feed the next curation pass. Customer code never enters the loop — the telemetry is shape-level (counts, latencies, refusal categories), never content-level.

What we publish

Open about the method, closed about the data.

Published

  • Model cards for every shipped variant, with capability and limitation notes.
  • Eval results on the internal suite, methodology reviewable under NDA.
  • Engineering blog posts when an architecture change ships.
  • The release changelog — what changed, when, and why.
  • This research page itself, updated alongside the changelog.

Not published

  • The training corpus itself, beyond the categorical description.
  • Internal customer telemetry of any kind.
  • Individual customer findings, even in aggregate form.
  • The held-out eval set — publishing it would contaminate the next eval.
  • Per-tenant configuration, prompts, or KV-cache state.

See the lineup on your code.

Lino at the commit. Eagle across the repo. Griffin proving the survivors. The same models the research log describes — running on your real codebase.