AI Security

Domain-Adapted LLMs For Vulnerability Detection in 2026

Domain adaptation has quietly become the default for LLM-assisted vulnerability detection. A look at what works in 2026, what does not, and what teams should plan for next.

Shadab Khan
AI Security Researcher
6 min read

The state of vulnerability detection with language models in 2026 looks different from the state eighteen months ago. Domain-adapted models are now the default for production pipelines, not a curiosity. The results have pushed static analysis vendors, runtime security vendors, and dependency scanners to incorporate specialised LLMs into their detection stacks. This post surveys what is working, what is not, and how teams should think about the category going forward.

What domain adaptation actually means

Domain adaptation covers a spectrum of techniques. At one end, you have continued pre-training where a general model is further trained on a large corpus of security-relevant text, including source code, CVE descriptions, exploit proof-of-concept code, advisory writeups, and remediation patches. At the other end, you have targeted fine-tuning for specific detection tasks like classifying whether a given code diff introduces a vulnerability.

In between sits instruction tuning on security tasks, which teaches a model to follow detection-shaped prompts reliably, and preference tuning where human analysts rate the quality of detection outputs and the model learns to produce more of the good ones. Most production detection pipelines in 2026 use some combination of these stages, often stacked on top of each other.

The point of all this adaptation is to give the model an internalised sense of what vulnerabilities look like. A general model can reason about security if prompted correctly, but a domain-adapted model has the reflexes built in. It notices that a specific string concatenation pattern looks like SQL injection before it notices anything else about the code.

What works well

Pattern detection on well-represented vulnerability classes is the strongest success story. Classic injection vulnerabilities, path traversal, command injection, insecure deserialisation, and memory safety issues in C and C-plus-plus code are now being caught at rates that compete with specialised static analysis tools. The LLM-based detectors benefit from being more robust to code style variations and non-standard framework usage, where pattern-matching tools often miss.

Cross-file reasoning has also improved significantly. Vulnerabilities where a sink in one file is reachable only through a call path that traverses several files used to be out of reach for LLMs with limited context windows. With modern context lengths measured in millions of tokens, combined with retrieval strategies that pull relevant code into context, detectors can now reason across call graphs that would have been infeasible in 2024.

Natural language understanding of security advisories has also matured. Domain-adapted models reliably extract affected versions, vulnerable code paths, and remediation guidance from the inconsistent prose of real-world advisories. This is less glamorous than code-level detection but has significant downstream value because it feeds structured data to every other tool in the pipeline.

What does not work well

Novel vulnerability classes remain a weakness. Any class of vulnerability that is under-represented in training data tends to be missed. Logic vulnerabilities, authorisation flaws, business logic bugs, and protocol-level weaknesses fall into this category. These require reasoning about intended behaviour, which LLMs struggle with because the intent is usually not written down. Teams that expect a domain-adapted model to catch these will be disappointed.

False positive rates remain high on certain categories. Code that looks vulnerable but is protected by upstream validation is a persistent source of false alarms. Frameworks that automatically escape or sanitise input confuse detection models that have been trained to pattern-match on the raw sink. The fix for this is usually context injection through grounding, where the model is also shown the framework's security properties, but that adds latency and complexity.

Determinism is another persistent problem. The same detector given the same code on two different days may produce different findings. For pipelines that need audit trails and reproducible reviews, this is a real operational issue. Teams address it with temperature-zero inference, careful seed management, and ensemble voting, but residual non-determinism remains.

The emerging architecture

The detection architecture that has become common in 2026 is a three-layer pipeline. The first layer is fast and recall-heavy. A small domain-adapted model scans every file and flags anything that looks plausibly concerning. This layer produces many false positives on purpose because its job is not to miss anything.

The second layer is a precision filter. A larger domain-adapted model, often with retrieval augmentation, reviews each flagged finding with more context. It pulls in related files, framework documentation, and historical findings on the same code. It either confirms the finding or discards it. This layer reduces the false positive rate dramatically without missing much that the first layer caught.

The third layer is prioritisation and remediation. A frontier model, typically grounded in the current CVE database and the codebase's broader context, takes the confirmed findings and produces prioritised remediation recommendations. This layer benefits from general reasoning capability because the task involves judgment about business impact and fix complexity.

This three-layer pattern works because it matches model capability to task difficulty. Cheap, fast models do the bulk filtering. Expensive, capable models do the judgment calls. The overall cost per finding is manageable even at scale.

Evaluation is harder than it looks

The field has not yet settled on good evaluation practices for domain-adapted detection. Vendor benchmarks typically show detection rates on curated vulnerability datasets, but these datasets leak into training data often enough that the reported numbers overstate real-world performance. When teams run the same detectors on fresh code from their own repositories, the recall and precision numbers are usually worse.

The most rigorous evaluation we have seen involves withholding a time-bounded slice of recent CVE-affected code and measuring detection performance on it. This approach respects the temporal structure of the problem because real detection always happens on code that the model has not seen before. Any team selecting a detection model should ask the vendor about their temporal evaluation methodology and push back if they only have retrospective numbers.

Measuring false positive rate in realistic conditions is also difficult. Lab benchmarks usually exclude the mitigating context that would suppress findings in production. If your codebase uses strong upstream validation, the false positive rate reported by a vendor on their benchmark corpus will underestimate what you see in your own code.

What to plan for

Teams planning a vulnerability detection stack in 2026 should assume that domain adaptation is table stakes, not a differentiator. Every serious vendor now offers some form of adapted model. The questions to ask are about adaptation quality and the broader pipeline. How recent is the training data. How is the model evaluated against fresh vulnerabilities. How are false positives managed. How does grounding work for current CVE data.

Plan for the reality that no single model will dominate all vulnerability classes. The pipeline will need to route findings through different detectors based on file type, language, and vulnerability category. The vendor that gives you the best flexibility to mix detectors is often a better choice than the vendor with the highest score on a single benchmark.

Finally, plan for the model landscape to keep moving. The detection quality of 2027 will exceed what we have today. Architectures that are easy to upgrade will age better than architectures that are tightly coupled to a specific model vintage.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.