← Concepts & Glossary
Concept · Security Corpus

What's actually in a security-only LLM corpus?

A general-internet-trained model drifts on security questions because most of its training data is not about security. It has seen orders of magnitude more cooking recipes and product tutorials than CVE patches. When you ask it about a deserialisation gadget, it answers from inference rather than memory — and that is where hallucinations come from.

A curated security corpus shifts that distribution materially. The model has actually read the disclosure, seen the patch, and been graded on whether it can describe the reachability path. Refusal rates fall, hallucination rates fall, and the tokeniser stops shattering CWE IDs into byte-pair noise. The Griffin, Eagle, and Lino lineup is trained on one of these — and the contents of that corpus matter more than the parameter count.

What's in the corpus

Eleven data classes, all auditable at ingest.

Each class earns its place because it carries supervision signal a general corpus cannot. Every document has provenance metadata, a security tag, and a label that lets it participate in training rather than just inflate token counts.

  • CVE descriptions paired with the patch that resolved them

    Each disclosure is linked to its remediating commit, so the model sees the vulnerable code, the fix, and the diff between them — not just an abstract description.

  • Exploit write-ups and proof-of-concept code

    Curated, deduplicated, and stripped of weaponised payloads. The signal is the reasoning chain, not a ready-to-run weapon.

  • Vendor PSIRT advisories, CERT bulletins, MITRE entries

    Authoritative disclosure text from upstream maintainers and coordination bodies — the ground truth for how a class of bug is described publicly.

  • Static-analysis findings with ground-truth pass/fail labels

    Real analyzer output graded by senior engineers as truly exploitable, latent, or false positive — the supervision signal that teaches the model what reachable actually means.

  • Taint graphs extracted from open-source repos

    Annotated source-to-sink paths across hundreds of thousands of public repositories, with sanitiser edges, sink severities, and CWE classes attached.

  • MITRE ATT&CK technique descriptions and procedure examples

    Tactic, technique, and procedure mappings link source-level patterns to the adversary behaviour they enable, so the model can talk about both.

  • Fuzzing corpora and crash triage notes

    Input grammars, crash classifications, and triage decisions from public fuzz campaigns — coverage of the bugs that pattern scanners cannot find.

  • Package-registry metadata and malware behavioural traces

    Typosquats, dependency-confusion incidents, and post-install script analyses — the literature of supply-chain attacks rather than benign README files.

  • IDA / Ghidra disassembly excerpts (security-tagged)

    Selected reverse-engineering snippets with vulnerability annotations, so the model gains intuition for binary-level patterns and not just source-level ones.

  • Vulnerability disclosure threads, with maintainer consent

    Public discussion of how a bug was found, triaged, and fixed — including the dead ends — captures the reasoning style of real triage, not just the verdict.

  • Security RFCs and standards (NIST, ENISA, OWASP, NCSC)

    Canonical guidance text grounds the model in the vocabulary defenders actually use, instead of approximations inferred from general web text.

What's NOT in the corpus

The exclusion list is contractual, not aspirational.

Everything below is filtered out at ingest, not after the fact. The point isn't purity — it's that contamination from these classes degrades the model's behaviour on security tasks in ways that are hard to undo with later fine-tuning.

  • No customer code, no customer scan outputs

    Repos and findings that pass through Safeguard's scanners stay in the customer's tenant. They are never used as training data, at any tier.

  • No general web crawl

    Common-crawl-style dumps drag in marketing fluff, low-signal forum text, and untrustworthy code — exactly the noise we want absent from a security model.

  • No StackOverflow snippets without a security frame

    Q&A code lifted out of disclosure context teaches a model patterns that look reasonable but ship CVEs in production. We exclude it by default.

  • No PII, no chat logs

    Personal data and private conversations have no place in a model that reasons about exploit primitives. Ingestion explicitly screens them out.

  • No closed-source proprietary disassembly

    Disassembly we don't have the right to redistribute stays out, regardless of how useful it might be — provenance has to be auditable.

  • No LLM-generated text (no self-training feedback)

    Synthetic security text loops a model's mistakes back into itself. We refuse the convenience and pay for human-labelled exemplars instead.

  • No marketing or product copywriting

    Vendor decks describe a sanitised world. Defenders don't live in that world, and a model trained on it learns to flinch at the wrong words.

Why it matters

Three measurable shifts in model behaviour.

Refusal rates drop on security Q&A

General-internet-trained models flinch at words like 'exploit' and 'gadget' because their RLHF rewarded refusal. A security corpus weighted on disclosures, write-ups, and patch diffs gives canonical answers and lets the model help defenders instead of stonewalling them.

Tokeniser learns CWE/CVE IDs and taint operators

CWE-89, CVE-2024-XXXX, source→sink arrows, sanitiser markers — these become single tokens with neighbours in embedding space that are other vulnerable patterns. Long-context behaviour improves because the model isn't burning attention on shattered byte-pair fragments.

Hallucination rate falls on security claims

Every plausible-sounding statement about a CWE class or reachability pattern is anchored to labelled exemplars in training. The model has seen the canonical answer enough times that it stops inventing alternatives that sound right but aren't.

How the corpus is curated

Four stages, every document accounted for.

The pipeline runs end-to-end on every training cut. Provenance metadata flows from ingest into the labelled exemplars and into the RLHF preference data, so any claim about the corpus can be traced back to a document and an annotator.

01

Ingest

Disclosures, patch diffs, advisories, fuzzing corpora, and taint graphs are pulled from authoritative sources with provenance metadata attached at the document level.

02

Security-domain dedup

Near-duplicate disclosures, mirrored advisories, and reposted PoCs are collapsed. The dedup keys are CVE-aware so we don't accidentally erase a critical variant.

03

Ground-truth labelling

Senior security engineers grade exemplars for reachability, exploitability, CWE class, and sanitiser coverage. The labels become supervision signal — not annotation gig-work output.

04

Security RLHF

Preference data is collected against a rubric written by offensive-security engineers. Plausible-sounding hallucinations on CWE classification are penalised; unverified reachability claims are treated as failures.

Related concepts

Keep reading the lineup.

The corpus is one half of the story. The other half is how the lineup turns it into three different models — each with a different latency and reasoning budget — without losing the security taste the corpus encodes.

See the corpus shape the answers you get.

Ask Griffin a question a general model would refuse. Compare the answer. The corpus is the difference.

Browse all concepts