AI Security

Fine-Tuning Security LLMs vs Grounding: Which Wins

Fine-tuning teaches a model to be a security expert. Grounding lets a general model act like one by reading the right sources. The right answer is usually both, but the proportions matter.

Nayan Dey
AI Platform Engineer
7 min read

Two camps have emerged in how teams give language models security expertise. The fine-tuning camp believes the knowledge belongs inside the model, baked into weights through careful training on curated security corpora. The grounding camp believes the knowledge belongs outside the model, retrieved fresh on each query from a live source of truth. Both camps are right about something and wrong about something else. This post works through the trade-offs and ends with a pattern that actually ships.

The case for fine-tuning

Fine-tuning produces a model that has internalised the shape of security work. It does not need to be told what a CVE looks like because it has seen thousands of them. It does not need a prompt explaining CVSS vectors because it can produce them by reflex. It can mimic the writing style of a seasoned security engineer because that style is encoded in the parameters. When the model is running inference, none of that knowledge needs to travel over a network or be retrieved from a database. It is just there.

This gives fine-tuned models three concrete advantages. The first is latency. No retrieval step means the time-to-first-token is shorter, which matters for interactive tools. The second is cost per call. A smaller fine-tuned model can match a larger general model on narrow tasks while using a fraction of the compute. The third is consistency. A fine-tuned model produces outputs in the expected format because that format is part of what was trained in.

The downsides are structural. A fine-tuned model is frozen at the moment of training. Every new CVE, advisory, or exploit after that point is invisible to it. You can mitigate this by retraining on a regular cadence, but the economics get ugly fast. A fine-tuning run on a modest model takes real GPU hours and requires careful validation so you do not degrade capability while adding knowledge. Doing this monthly for a fast-moving domain like security is possible but onerous, and doing it weekly is rarely feasible.

Fine-tuned models also tend to over-fit on the style of their training data. If the training set emphasised a particular ecosystem or language, the model will perform less well outside it. They can also hallucinate confidently in situations where a general model would express uncertainty, because fine-tuning often teaches the model to commit to answers rather than hedge.

The case for grounding

Grounding flips the architecture. You keep the model general and teach it to consult an external knowledge source before answering. For security, that source is typically a curated database of CVEs, advisories, package metadata, exploitability data, and your own organisation's findings. Safeguard's Griffin engine is an example of this pattern applied at production scale. The model itself is not a security specialist. The pipeline around it is.

The benefits are the mirror image of fine-tuning's downsides. The knowledge base can be updated in real time. A CVE disclosed this morning can influence answers this afternoon. The model never ages because it never stored the facts in the first place. When the source of truth changes, the answers change automatically. This is particularly valuable for security because the rate of change is high and the cost of outdated information is also high.

Grounding also provides traceability. Every answer can cite the specific document, advisory, or finding that justified it. This matters for compliance, for post-incident review, and for building trust with users who need to know where a claim came from. A pure fine-tuned model cannot tell you why it believes something. A grounded model can point at the source.

The trade-offs show up in latency, complexity, and retrieval quality. Each call now includes a retrieval step that adds latency and introduces a failure mode. If the retrieval misses the relevant document, the model produces a worse answer than it would have with that context. Retrieval quality becomes the bottleneck, and optimising it is a separate engineering discipline involving embeddings, reranking, chunking, and query rewriting.

Where fine-tuning wins decisively

Fine-tuning wins on tasks that are stable, high-volume, and narrow in scope. Classifying the severity of a vulnerability description is a good example. The task does not require current CVE data because it operates on the text in front of it. It is high-volume enough that latency and cost matter. It is narrow enough that a small model can be trained to near-human accuracy. Every dollar you spend fine-tuning this task pays back in reduced inference cost.

Style transfer tasks are another category where fine-tuning wins. If your organisation has a specific way of writing advisories, remediation notes, or risk summaries, fine-tuning on historical examples produces a model that naturally conforms to that style. Prompting a general model to mimic the style works less consistently and uses more tokens per call.

Tasks that require internalised reasoning about code patterns also favour fine-tuning. A model that has seen ten thousand examples of vulnerable SQL query construction will spot the eleventh one faster than a general model reading the code cold. The pattern recognition is baked in.

Where grounding wins decisively

Grounding wins on any task where the answer depends on information that changes. Vulnerability lookup is the clearest case. The question "is this version affected by any known CVE" cannot be answered by a fine-tuned model at all, because the answer depends on the current state of the CVE database. Grounding is the only correct architecture.

Questions about your organisation's specific state also favour grounding. "What is the status of CVE-2026-xxxx in our environment" cannot be answered by a generic model. It requires retrieval from your inventory, your scan results, and your ticketing system. Fine-tuning the model on your environment is impractical because your environment changes every hour.

Multi-step reasoning tasks that need to pull in different evidence at different steps also favour grounding. An investigation that starts with a finding, pivots to the affected component, then to the dependency graph, then to the remediation options involves retrieving different documents at each step. Baking that graph into a model's weights is infeasible.

The pattern that actually ships

In production systems that work well, the split is roughly this. Fine-tuning is used for the parts of the stack that are stylistic, high-volume, or involve stable pattern recognition. Grounding is used for the parts that need current facts, organisational state, or traceability. A single query often uses both.

Consider a typical triage flow. A finding comes in. A small fine-tuned classifier assigns an initial severity. A grounded retrieval step pulls the current CVE data, the affected component's position in the dependency graph, and relevant exploitability signals. A frontier model (possibly lightly fine-tuned for your output style) synthesises the retrieved context into a remediation recommendation. Each stage uses the technique that fits its role.

The practical decision

If you have infinite resources, do both. In the real world, start with grounding because it gives you a working system faster and keeps pace with the domain without retraining. Once you are in production, measure where the same query patterns repeat thousands of times per day. Those are your fine-tuning candidates. Measure where answers need to be current or traceable. Those stay grounded.

Neither approach wins on its own. The teams that ship reliable security AI treat fine-tuning and grounding as complementary tools and let workload characteristics pick between them on a case by case basis.

Related articles in AI Security

AI Security

Safeguard Now Supports Every Major AI Model Family for Zero-Day Discovery: Anthropic, OpenAI, Gemini, Microsoft, Meta, and Your Own Models

You should not have to choose between your organization's AI strategy and your security platform. Safeguard's agentic zero-day discovery and remediation pipeline now works on Anthropic Claude Fable 5, OpenAI GPT, Google Gemini, Microsoft Phi, Meta Llama, Safeguard native models, and privately hosted custom models — all running as first-class agents in the same Multi-Agent TAOR Deep Think AI Engine.

June 9, 2026Read
AI Security

Anthropic Claude Mythos Releases Tomorrow: Capabilities, Benchmarks, and What Security Teams Must Do Now

Anthropic's Claude Mythos model goes public on June 10, 2026 — a frontier AI that scored 97.6% on the Math Olympiad, completed expert-level hacking tasks at 73% success, and found 271 vulnerabilities in Firefox 150. Here is everything security teams need to know before it lands, and how Safeguard already supports Mythos zero-day discovery natively.

June 9, 2026Read
AI Security

Claude Fable 5: Anthropic's Most Capable Public Model Is Here — Benchmarks, Capabilities, and What It Means for Security

Anthropic just released Claude Fable 5, its most capable publicly available model and the first Mythos-class AI open to everyone. 80.3% on SWE-Bench Pro, 88% on Terminal-Bench 2.1, state-of-the-art across software engineering, vision, and scientific research. Safeguard has already integrated Fable 5 natively — here is everything you need to know.

June 9, 2026Read

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.