Two camps have emerged in how teams give language models security expertise. The fine-tuning camp believes the knowledge belongs inside the model, baked into weights through careful training on curated security corpora. The grounding camp believes the knowledge belongs outside the model, retrieved fresh on each query from a live source of truth. Both camps are right about something and wrong about something else. This post works through the trade-offs and ends with a pattern that actually ships.
The case for fine-tuning
Fine-tuning produces a model that has internalised the shape of security work. It does not need to be told what a CVE looks like because it has seen thousands of them. It does not need a prompt explaining CVSS vectors because it can produce them by reflex. It can mimic the writing style of a seasoned security engineer because that style is encoded in the parameters. When the model is running inference, none of that knowledge needs to travel over a network or be retrieved from a database. It is just there.
This gives fine-tuned models three concrete advantages. The first is latency. No retrieval step means the time-to-first-token is shorter, which matters for interactive tools. The second is cost per call. A smaller fine-tuned model can match a larger general model on narrow tasks while using a fraction of the compute. The third is consistency. A fine-tuned model produces outputs in the expected format because that format is part of what was trained in.
The downsides are structural. A fine-tuned model is frozen at the moment of training. Every new CVE, advisory, or exploit after that point is invisible to it. You can mitigate this by retraining on a regular cadence, but the economics get ugly fast. A fine-tuning run on a modest model takes real GPU hours and requires careful validation so you do not degrade capability while adding knowledge. Doing this monthly for a fast-moving domain like security is possible but onerous, and doing it weekly is rarely feasible.
Fine-tuned models also tend to over-fit on the style of their training data. If the training set emphasised a particular ecosystem or language, the model will perform less well outside it. They can also hallucinate confidently in situations where a general model would express uncertainty, because fine-tuning often teaches the model to commit to answers rather than hedge.
The case for grounding
Grounding flips the architecture. You keep the model general and teach it to consult an external knowledge source before answering. For security, that source is typically a curated database of CVEs, advisories, package metadata, exploitability data, and your own organisation's findings. Safeguard's Griffin engine is an example of this pattern applied at production scale. The model itself is not a security specialist. The pipeline around it is.
The benefits are the mirror image of fine-tuning's downsides. The knowledge base can be updated in real time. A CVE disclosed this morning can influence answers this afternoon. The model never ages because it never stored the facts in the first place. When the source of truth changes, the answers change automatically. This is particularly valuable for security because the rate of change is high and the cost of outdated information is also high.
Grounding also provides traceability. Every answer can cite the specific document, advisory, or finding that justified it. This matters for compliance, for post-incident review, and for building trust with users who need to know where a claim came from. A pure fine-tuned model cannot tell you why it believes something. A grounded model can point at the source.
The trade-offs show up in latency, complexity, and retrieval quality. Each call now includes a retrieval step that adds latency and introduces a failure mode. If the retrieval misses the relevant document, the model produces a worse answer than it would have with that context. Retrieval quality becomes the bottleneck, and optimising it is a separate engineering discipline involving embeddings, reranking, chunking, and query rewriting.
Where fine-tuning wins decisively
Fine-tuning wins on tasks that are stable, high-volume, and narrow in scope. Classifying the severity of a vulnerability description is a good example. The task does not require current CVE data because it operates on the text in front of it. It is high-volume enough that latency and cost matter. It is narrow enough that a small model can be trained to near-human accuracy. Every dollar you spend fine-tuning this task pays back in reduced inference cost.
Style transfer tasks are another category where fine-tuning wins. If your organisation has a specific way of writing advisories, remediation notes, or risk summaries, fine-tuning on historical examples produces a model that naturally conforms to that style. Prompting a general model to mimic the style works less consistently and uses more tokens per call.
Tasks that require internalised reasoning about code patterns also favour fine-tuning. A model that has seen ten thousand examples of vulnerable SQL query construction will spot the eleventh one faster than a general model reading the code cold. The pattern recognition is baked in.
Where grounding wins decisively
Grounding wins on any task where the answer depends on information that changes. Vulnerability lookup is the clearest case. The question "is this version affected by any known CVE" cannot be answered by a fine-tuned model at all, because the answer depends on the current state of the CVE database. Grounding is the only correct architecture.
Questions about your organisation's specific state also favour grounding. "What is the status of CVE-2026-xxxx in our environment" cannot be answered by a generic model. It requires retrieval from your inventory, your scan results, and your ticketing system. Fine-tuning the model on your environment is impractical because your environment changes every hour.
Multi-step reasoning tasks that need to pull in different evidence at different steps also favour grounding. An investigation that starts with a finding, pivots to the affected component, then to the dependency graph, then to the remediation options involves retrieving different documents at each step. Baking that graph into a model's weights is infeasible.
The pattern that actually ships
In production systems that work well, the split is roughly this. Fine-tuning is used for the parts of the stack that are stylistic, high-volume, or involve stable pattern recognition. Grounding is used for the parts that need current facts, organisational state, or traceability. A single query often uses both.
Consider a typical triage flow. A finding comes in. A small fine-tuned classifier assigns an initial severity. A grounded retrieval step pulls the current CVE data, the affected component's position in the dependency graph, and relevant exploitability signals. A frontier model (possibly lightly fine-tuned for your output style) synthesises the retrieved context into a remediation recommendation. Each stage uses the technique that fits its role.
The practical decision
If you have infinite resources, do both. In the real world, start with grounding because it gives you a working system faster and keeps pace with the domain without retraining. Once you are in production, measure where the same query patterns repeat thousands of times per day. Those are your fine-tuning candidates. Measure where answers need to be current or traceable. Those stay grounded.
Neither approach wins on its own. The teams that ship reliable security AI treat fine-tuning and grounding as complementary tools and let workload characteristics pick between them on a case by case basis.