AI Security

Why LLMs Are Structurally Insecure (and What That Means for Your Pipeline)

Language models are not insecure because of a bug you can patch. They are insecure by construction — non-deterministic, context-poisonable, and unreproducible. Here is how to reason about them without pretending otherwise.

Nayan Dey
Senior Security Engineer
7 min read

Every few months a new LLM vulnerability gets a logo and a press cycle — indirect prompt injection, training data extraction, membership inference, tool-call hijacking. The response is always the same: vendor ships a mitigation, researchers point out that the mitigation can be bypassed, vendor ships another mitigation, cycle repeats. The pattern is exhausting because it frames the problem the wrong way. The problem is not that LLMs have security bugs. The problem is that LLMs are structurally insecure in a way no patch cycle will ever fully resolve. They behave like an untrusted compiler you cannot reproduce, fed by an input channel you cannot fully validate, producing output you cannot fully constrain. That is not a CVE. That is a class of system. Once you accept that and stop trying to treat LLM security like web app security, a workable threat model becomes possible.

In what sense are LLMs "structurally" insecure?

Four properties, none of which are bugs:

Non-determinism. Identical inputs produce different outputs. Even with temperature zero and fixed seeds, floating-point non-associativity on different GPU hardware means bit-identical reproduction is not guaranteed across deployments. Security controls built on exact output matching do not work.

Instruction/data conflation. The model cannot robustly distinguish instructions from data. Any content in the context window — retrieved documents, tool outputs, prior turn history, user uploads — competes for the model's attention with the system prompt. This is the root cause of prompt injection, and it is a property of the architecture, not a flaw of any specific model.

Unbounded output space. Unlike a compiler that emits one of a finite set of well-formed token sequences, an LLM can emit anything representable in the output vocabulary. Sandboxing the output ("only return JSON with these fields") is a convention, not an enforceable constraint. Models violate output contracts under adversarial pressure.

Training data opacity. Even open-weight models rarely have fully auditable training corpora. You cannot prove the model does not contain memorized secrets, copyrighted code, or behavior-shaping data injected by an adversary during a supply chain attack on the training pipeline. SBOM does not exist for model weights in any standardized, verifiable form.

These four properties combine. A non-deterministic, instruction-confused system with unbounded output fed by opaque training data is, to a security reviewer, a maximally permissive blob of behavior. That is the honest baseline.

Isn't prompt injection just an input validation problem?

No, and this is the single most expensive misconception in the space. Input validation works when you can enumerate the set of valid inputs and reject anything outside it. LLMs are valuable precisely because they accept unbounded natural language — restricting that input meaningfully defeats the use case. Every proposed "prompt firewall" that claims to solve prompt injection has so far been bypassed within days of public release. The research record here is brutal and consistent.

The working mental model is that any text reaching the model's context window is a candidate instruction, and the model will sometimes follow it. Your threat model should treat retrieval context, tool outputs, and prior conversation turns as equivalent in privilege to the user's prompt. Anything less is wishful thinking. This means designs that put an LLM behind a system prompt saying "do not reveal X" and expect it to hold under adversarial pressure are not a security control — they are a usability hint.

How does the training pipeline become a supply chain risk?

Modern foundation models are trained on datasets measured in trillions of tokens drawn from a mix of web crawl, licensed corpora, and curated sources. The fine-tuning stage adds smaller, higher-signal datasets — instruction pairs, human preference data, RLHF reward signals. Each of these is a supply chain attack surface.

The 2024 Hugging Face typosquatting incident made the point publicly: attacker-uploaded models mimicking legitimate names accumulated thousands of downloads before takedown. More subtly, research on backdoor attacks during fine-tuning has demonstrated that a very small number of carefully crafted training examples (on the order of 0.1% of a dataset) can implant trigger phrases that cause specific misbehavior at inference time, with no measurable drop in benchmark performance. The backdoor is invisible unless you know the trigger to test for.

From a supply chain security posture, this means "we use a well-known frontier model" is not a full answer. The model weights are an artifact with provenance, and you want — at minimum — a signed attestation of which weights are loaded, a hash match against a published reference, and a monitoring layer that flags behavior drift across deployments.

Why do tool-using agents make all of this worse?

Because they convert a text-generation system into an action-taking system, and actions have blast radius. A chatbot that emits a bad sentence is an annoyance. An agent that emits a bad execute_sql call is an incident. The standard agent loop — model proposes a tool call, executor runs it, result returns to the model — means prompt injection in any upstream input can escalate directly to arbitrary execution of whatever tools the agent has access to.

The common mitigations are partial. Human-in-the-loop approval works until users approval-fatigue. Tool-scope narrowing works until the narrowed scope is still enough for an attacker (most real-world scopes are). Confirmation prompts work until the model is manipulated into auto-confirming. There is no complete defense here; there is a stack of partial defenses, and the job is to pick enough of them that the aggregate residual risk is tolerable for the use case.

Three defenses that consistently pay back their complexity cost: capability-based tool gating (the tool checks the caller's authority, not the model's claim), out-of-band confirmation for irreversible actions (SMS, not a chat turn), and trace-level tool-call auditing with alerting on distribution shift.

What does a realistic threat model look like?

Stop trying to prove the model is safe. Prove the system around the model is safe given that the model is not.

  1. Assume the model will produce any output the adversary wants on at least some inputs. Design downstream parsers and executors to fail closed, not fail open.
  2. Assume retrieval context can be attacker-controlled if any indexed source is user-writable or external. Treat those sources as untrusted.
  3. Assume tool calls can be hijacked via any input channel. Make irreversibility expensive and observable.
  4. Assume the model weights can be substituted by a compromised proxy. Pin model versions, verify response headers where the provider supports it, and monitor output distribution.
  5. Assume evals and traces will be the post-incident artifact. Build them now, not after the incident.

This threat model is not novel. It is the same posture you would adopt toward a third-party SaaS that ships code into your runtime — trust the interface, not the internals. The novelty is that people still argue against applying it to LLMs because the LLM feels like part of your application. It is not. It is a vendor system you embed.

Does any of this mean teams should not use LLMs?

No. The productivity and product-experience gains are real. The argument is not to abstain; the argument is to embed with the right level of skepticism and the right controls. Every prior wave of transformative technology went through this arc — databases, web apps, cloud, containers — and each one was at its most dangerous during the window when builders had adopted it but security reasoning had not caught up. LLMs are in that window right now. Teams that treat them as infrastructure with adversarial properties will be fine. Teams that treat them as magic will learn why they are not.

How Safeguard Helps

Safeguard builds on the "LLM is an untrusted component" threat model end-to-end. Our SBOM module supports AI-BOM ingestion so model artifacts, training data sources, and fine-tuning recipes are tracked as supply chain components with their own provenance, not hidden inside application code. Reachability analysis extends to tool-call graphs, so when a model issues a privileged call the policy engine can evaluate it against the same policy set that governs the rest of the application. Griffin AI, our reasoning layer, is built with explicit separation between instruction and data channels and runs a continuous eval suite with drift alerting baked in. For organizations adopting LLM-based agents at scale, Safeguard provides the SBOM, policy, trace, and eval control plane you need to ship them without the usual wishful thinking.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.