Distillation is the unfashionable cousin of fine-tuning. It does not get the marketing attention that large models attract, but for production security pipelines it is quietly doing most of the work. The pattern is simple. A large teacher model produces high-quality outputs on a security task. A small student model is trained to imitate those outputs. The student ends up much cheaper and faster than the teacher while retaining most of the capability on the specific task. For workflows that run millions of times a day, distillation turns an unaffordable pipeline into an affordable one.
Why distillation fits security
Security workflows have a specific shape that makes them ideal distillation targets. They are high-volume, narrow in scope, and tolerant of some quality loss in exchange for speed and cost. Classifying findings, suggesting upgrades, summarising advisories, generating CVSS vectors from descriptions, and routing tickets to teams are all examples of tasks where a small, distilled model can do nearly as well as a large one at a fraction of the cost.
The alternative to distillation is running the large model on every instance, which rapidly becomes economically infeasible. A team processing a million dependency findings per day cannot afford to call a top-tier frontier model a million times. Even at a cent per call, that is ten thousand dollars a day on a single workflow. A distilled seven billion parameter model running on modest hardware can handle the same volume for a small fraction of that cost.
Distillation also produces models with consistent output formats, which matters for downstream processing. A fine-tuned or distilled model learns the exact format your pipeline expects and produces it reliably. A general model often drifts in formatting across calls, which forces extra parsing logic and retry handling in the pipeline.
The distillation recipe
The practical recipe has a few steps. First, define the task precisely. "Classify this finding" is too vague. "Given this finding in the format X, produce a JSON object with fields Y" is the level of specificity distillation needs. The task definition doubles as the output contract for the distilled model.
Second, generate training data by running the teacher model on a representative sample of inputs. The sample needs to cover the distribution of real inputs, not just easy cases. Bias toward ambiguous cases if you can identify them because those are where the student will struggle most and where the teacher's reasoning is most valuable to copy.
Third, train the student on the input-output pairs. Modern training recipes use distillation-specific losses that go beyond just matching the teacher's final answer. They try to match the teacher's output distribution across possible answers, which gives the student a richer signal about how confident to be in different situations. This matters because a naive student trained only on hard labels tends to be over-confident in the wrong places.
Fourth, evaluate rigorously. The student will not match the teacher perfectly. The question is whether the gap is tolerable for the workflow. For some workflows, a five percent drop in accuracy is a dealbreaker. For others, especially filtering tasks where missed positives get caught by a downstream stage, a larger drop is acceptable. Pick your evaluation metric to match the downstream consequences.
Size selection
Choosing the student model size is an engineering decision that balances cost against capability. Smaller students cost less per call but hit a capability floor below which the task cannot be learned well. In our experience with security workflows, the sweet spot for most narrow tasks is around three to seven billion parameters. Below that, even well-distilled models struggle with tasks that require any multi-step reasoning. Above that, you are paying for capacity you do not need.
Quantisation is a separate knob. A seven billion parameter model quantised to four bits per parameter can run inference on a modest GPU or even on CPU for certain workloads. The quality loss from quantisation is usually small compared to the quality loss from picking a smaller parameter count, so quantising a larger student often beats not quantising a smaller one.
Architecture also matters. Modern small models optimised for inference throughput can be dramatically faster than older architectures of the same parameter count. When selecting a student base model, pay attention to inference benchmarks on hardware you actually have, not just to capability benchmarks.
Pitfalls we have seen
The most common distillation failure is training on too narrow a sample. The student performs well in testing, then fails in production on inputs that never appeared in training. Security inputs have long tails. Unusual package names, obscure CVE fields, non-English advisory text, and code in less common languages all hit distilled models hard if they were not represented. The fix is to invest in sample coverage, including deliberately seeking out edge cases.
The second common failure is trusting the teacher too much. Distillation is teacher-bounded. If the teacher makes a systematic mistake, the student will learn that mistake. We have seen cases where a frontier model incorrectly assigns severity in a specific edge case and the distilled model reliably reproduces the error because it was never corrected during distillation. Audit the teacher's outputs on edge cases before using them as training signal.
The third failure is neglecting drift. A model distilled from the 2024 teacher on 2024 data behaves like a 2024 expert. As the domain shifts and the teacher improves, the student stays frozen. Plan for periodic redistillation from a newer teacher, especially for workflows where domain knowledge is part of the task.
Where distillation does not fit
Not every security task is a good distillation target. Tasks that require current knowledge are better served by grounding, because distillation cannot give a small model access to facts it was never shown. Tasks that involve open-ended reasoning across diverse inputs are better served by a frontier model, because distillation works best when the input distribution is bounded.
Tasks that need to produce traceable evidence are also poor distillation targets. A grounded system can cite the specific CVE or advisory it used to answer. A distilled model cannot cite anything because its answers come from internalised patterns rather than retrieved documents. For compliance-sensitive workflows, grounding usually beats distillation.
How to introduce distillation
The cheapest way to get started is to find a single high-volume workflow in your pipeline that currently calls a large model and measure what it would cost to replace that call with a distilled student. Pick a workflow where the inputs and outputs are well-defined and the evaluation is clear. Generate a few thousand teacher outputs, distill a small student, and run it in shadow mode alongside the teacher to compare quality on real traffic.
If the quality is acceptable, switch the workflow over. Measure the cost and latency improvements. Use the savings to fund the next distillation. Over time, the pipeline accumulates a portfolio of distilled students handling the high-volume narrow work while a small number of large-model calls handle the open-ended tasks at the edges. That architecture is boring but it scales, and scale is ultimately what security AI needs.