AI Security

Griffin AI vs Fine-Tuned Open Weights for SecOps

Fine-tuning an open-weight model sounds like a shortcut to a custom SecOps copilot. In practice, it is one step of a much longer journey.

Nayan Dey
Security Automation Lead
7 min read

Every few weeks, a security leader asks us some version of the same question: "what if we just fine-tune Llama or Mistral on our internal security data? Would that get us most of the way to a Griffin-style copilot?"

It is a reasonable instinct. Fine-tuning is cheaper than it used to be, the tooling is more accessible than it used to be, and the datasets required are within reach for a mid-sized security organisation. The question is whether fine-tuned open weights can replace Griffin AI for SecOps workflows, or whether fine-tuning solves a different problem than the one that actually matters.

What fine-tuning is good for

Fine-tuning an open-weight model is genuinely useful for specific things:

  • Making the model respond in a particular tone or format consistent with internal conventions
  • Reducing the prompt overhead required to explain context that is stable across calls
  • Improving performance on domain terminology that the base model has not seen often
  • Encoding organisation-specific policy language so the model can apply it without retrieval

None of those are trivial. A well-fine-tuned model that produces consistent, on-brand, policy-aware SecOps output is a real improvement over a raw base model.

What fine-tuning is not good for

Fine-tuning is bad at several things that matter for SecOps workflows.

It does not give the model access to current data. A model fine-tuned in March cannot answer questions about a CVE published in April. Keeping the model current means either retraining on a cadence (expensive, risky for regressions) or retrieval-augmenting (which is the scaffolding Griffin provides anyway).

It does not make the model better at tool use. The tool-calling capabilities of open-weight models come from their base training. Fine-tuning on SecOps data might teach the model to phrase tool calls in a particular way, but it does not add new tools or fix the underlying failure modes of generic tool use.

It does not add memory. Each call is still stateless. The context that the fine-tune encodes is frozen at training time.

It does not validate outputs. A fine-tuned model will confidently produce wrong output, just with more convincing phrasing.

It does not fix calibration. A fine-tuned model has its own confidence distribution, which rarely matches the distribution needed for production decisions.

In other words, fine-tuning improves the fluency and prior of the model. It does not replace the engine.

The training data problem

Fine-tuning on internal SecOps data sounds clean until you start assembling the dataset. The data you want is:

  • Historical incident reports with their final dispositions
  • Tickets that went through the full lifecycle from alert to resolution
  • Review threads where experienced engineers explained their reasoning
  • Policy decisions with their justifications
  • Suppression records with their rationales

That data is usually scattered across Jira, Slack, Confluence, GitHub, and whatever ticketing system the SecOps team uses. Most of it is not labelled. Much of it contains sensitive information. A significant fraction of it is outdated or reflects practices the team no longer follows.

Assembling a clean training set is typically six to twelve months of work for a small team, and the result has a shelf life of maybe a year before the tooling, the policies, and the threat landscape have drifted enough that another round is needed.

Griffin AI does not ask customers to do this. The engine's knowledge of SecOps patterns comes from curated datasets maintained by our research team, plus retrieval over customer-specific context that does not require training. Customer policies are encoded in rules, not weights.

The catastrophic forgetting problem

When you fine-tune an open-weight model on SecOps data, you risk degrading its general capabilities. A model that was good at general reasoning can become narrowly good at SecOps talk and broadly worse at the tasks it used to handle.

The mitigation is careful fine-tuning: low learning rates, representative holdout sets, continued pretraining with a mix of general data. This is a specialist skill. The teams that do it well are research teams with dedicated ML engineers. For a security organisation whose core competency is security, not ML training, the maintenance burden is substantial.

Griffin AI handles this internally. Base models are updated on a cadence that we own, evaluation is continuous, and regressions are caught before they reach production.

The eval harness, again

If there is one theme that recurs across every comparison between Griffin AI and open-weight options, it is the evaluation harness. A fine-tune without an eval harness is a fine-tune that is flying blind.

A SecOps-grade eval harness covers at minimum:

  • Historical alerts with known good dispositions
  • Adversarial inputs designed to trick the model
  • Multi-turn conversations that stress context handling
  • Structured output validation against downstream consumer schemas
  • Latency budgets for each workflow step
  • Cost budgets for each workflow step
  • Regression tests against previously-fixed failure modes

Building this harness is comparable in effort to building the model itself. Griffin AI comes with the harness. A fine-tuned open-weight deployment comes with the model weights and a hope.

Tool use, policies, and the engine layer

SecOps workflows lean heavily on tool use. Fetching current alert data from the SIEM, cross-referencing against asset inventory, pulling relevant threat intel, opening or updating tickets, sending notifications. A fine-tuned open-weight model that cannot reliably invoke tools is a chat interface, not a copilot.

Griffin AI's tool layer is a specific, curated set of integrations with contract-level guarantees. Each tool has known inputs, known outputs, retry behaviour, rate limiting, and error handling. The model's job is to decide which tool to call and how to interpret the result. The engine's job is to make that call behave correctly.

Fine-tuning an open-weight model on examples of tool use helps the model produce correct-looking JSON. It does not build the actual tools. That is integration work, and it is work that does not go away just because the model got better at pretending to do it.

Provenance and audit

SecOps workflows have compliance obligations. Every decision the copilot makes will eventually be reviewed by an auditor, a regulator, or a customer's security team. "The model said so" is not an acceptable provenance.

Griffin AI produces structured provenance for every decision: which inputs were considered, which tools were called, which validators ran, which models contributed to which step, which policies applied. The provenance is auditable and reproducible.

A fine-tuned open-weight model, used directly, produces text. Extracting provenance from that text is a retrofit that often does not survive contact with a real auditor.

When fine-tuning an open-weight model makes sense

There are specific scenarios where fine-tuning is the right move:

  • Customising the tone and output format of Griffin's own responses for specific internal audiences, as a surface-level personalisation layer
  • Research teams that want to understand how models respond to SecOps-specific data
  • Deeply air-gapped environments where the full Griffin engine cannot run and a fine-tuned model is the best available option
  • Specialised narrow tasks, such as classifying internal alert taxonomies, where a fine-tune can dominate

For general-purpose SecOps copilot workflows, fine-tuning an open-weight model is a detour. The engine is the work. The model is the commodity.

The honest framing

"Let us fine-tune an open-weight model and build our own SecOps copilot" is a project that teams often start and rarely finish to the standard they hoped for. The first demo looks great. The first production incident reveals the gaps. The next year is spent building the scaffolding that Griffin AI already has.

The more productive framing is: use Griffin AI for the workflow, use fine-tuning for the narrow personalisation that genuinely benefits from it, and do not confuse the two. Fine-tuning is a real technique with real value. It is not a substitute for an engine.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.