AI Security

Model Substitution Attacks: An Emerging Pattern

An attacker who can swap the model behind an API call can read every prompt and shape every response. The emerging trend in 2026 is model substitution as an attack class with its own techniques and disclosures.

Shadab Khan
Security Engineer
7 min read

Model substitution sits in a class of attacks that has always existed in software security and is now arriving in AI. The defender thinks they are talking to component A. The attacker has replaced it with component B. Component B looks similar enough that nothing obvious breaks. While it operates, the attacker reads everything that flows through and influences everything that flows out. The 2026 versions of this attack target the AI model — the most consequential component in any modern application — and the pattern is becoming distinct enough to merit its own name and its own playbook.

The Threat Surface

Model substitution can happen at several layers. The most direct is endpoint substitution, where the attacker compromises the configuration that points an application at its model and redirects traffic to an attacker-controlled endpoint that proxies to a real model and logs everything in between. Several disclosed incidents in late 2025 followed this pattern. An attacker with limited access modified an environment variable or a Kubernetes secret pointing the application at a "model gateway" they had stood up. The gateway answered correctly, and only an unrelated audit caught the fact that traffic was leaving the expected provider's network.

The second layer is proxy substitution. Many enterprises sit a model proxy between their applications and the model provider — for cost tracking, prompt logging, or guardrails. That proxy is now a high-value target. Compromising it gives the attacker the same vantage as endpoint substitution, but with additional cover because traffic still appears to leave the provider's network normally. The 2026 pattern includes attackers compromising the proxy's CI pipeline rather than the proxy directly, slipping in code that exfiltrates a sample of prompts to a side channel.

The third layer is identity substitution, where the attacker compromises an API key or service account and routes their own traffic through the victim's account. This is less about reading the victim's traffic and more about consuming budget and laundering attacker activity through a legitimate identity. We are including it in the substitution category because the defensive posture is similar: detecting that "your" model traffic has changed shape.

The fourth and newest layer is model file substitution in self-hosted deployments. An attacker who can write to the model registry or the inference server's model directory replaces a fine-tuned model with a near-identical model that has been instrumented or trojaned. We saw the first credible disclosed example of this in February — an attacker swapped a customer-service fine-tune for one that included a backdoor activated by a specific token sequence. The substituted model passed standard quality checks because its behavior on benign inputs was indistinguishable from the original.

What Makes It Hard To Detect

Several properties of AI systems make substitution unusually hard to spot.

Outputs are non-deterministic by design. An application that switched from one model to another would normally surface immediate test failures. With AI models, even the legitimate model produces different outputs on different runs. The signal that distinguishes a substituted model from a sampling difference is buried in distributional statistics, not single-trace comparisons.

Vendors do not currently sign responses. When a model provider returns a completion, it usually includes metadata identifying the model name. That metadata is trivial for a proxy to forge. There is no widely deployed cryptographic attestation that a given response came from a specific model running in a specific provider environment. Standards work is happening, but it is not in production.

Cost and latency profiles overlap. A swap from model A to a smaller, cheaper model B may show up as faster responses and lower bills — outcomes a finance team is unlikely to flag as suspicious. A swap to a larger model masquerading as a smaller one can be hidden in the noise of normal capacity variability.

Most application logs do not capture enough to investigate. Teams log the prompt and the response, but not the network path, the resolved IP, the certificate fingerprint, or the inter-token timing distribution that would allow forensic analysis after the fact.

Defensive Patterns That Work

A few practices have been adopted by the teams that have caught substitution attempts, often by accident, and used the experience to harden their setup.

Pinned and signed endpoints. The model endpoint is hard-coded to a specific hostname, certificate fingerprint, and provider account. Configuration changes to the endpoint require a code review and a deployment, not a runtime config flip. This converts endpoint substitution from a config-only attack to a code-deployment attack, raising the bar significantly.

Independent verification calls. Critical workflows include a low-cost test call with a known prompt and a known expected response distribution. The verification call is sent through the same path as production traffic, on a randomized schedule. Output that drifts outside an expected range surfaces as an alert. This catches the silent-swap-to-cheaper-model pattern reliably.

Canary tokens in prompts. The system prompt includes a canary string that the model is instructed never to repeat. A response that contains the canary indicates the model received a system prompt — the legitimate one — and is misbehaving. A response that systematically fails to "know" the canary in tests where it should indicates the model never saw the system prompt, which is what happens when an attacker proxies prompts to a different model that does not pass the system prompt through correctly.

Provenance for self-hosted weights. Model files are content-addressed and verified against a known good hash on every load. The model registry has a signed log of who pushed what when. Drift between the running model's hash and the expected hash is a paged alert.

Network egress controls scoped to providers. Outbound traffic to model APIs is restricted to a small list of provider endpoints, with monitoring on any deviation. This is unfashionable, because it is operationally fiddly, but it remains the most effective control against endpoint substitution.

Where The Standards Are Headed

The interesting development in early 2026 is the convergence of several proposals around model response attestation. The shape is similar across them: the model provider signs responses with a key tied to a specific model version, the client verifies the signature, and the verification is a precondition for trusting the response. This will not deploy quickly. It requires changes to client SDKs, server infrastructure, and key management at every provider. But the trajectory is clear, and within two years signed responses will likely be table stakes for high-assurance deployments. Until then, the controls described above carry the weight.

A second standards thread is around model registry transparency for self-hosted environments. The proposals borrow from container image registries — signed manifests, transparency logs, content addressing — and apply them to model artifacts. Adoption is faster here because the analog with containers is direct and the tooling overlap is large.

The Direction For The Rest Of 2026

Substitution attacks will get more reports. We expect at least one disclosed incident in 2026 where a publicly used AI product turned out to have been silently routing traffic through an attacker-controlled proxy for some period of time. The discovery will probably come from cost or latency anomalies rather than from a security control. That incident will accelerate adoption of the controls above, and it will compress the timeline on response signing. The category will be named in vendor security questionnaires before the end of the year.

How Safeguard Helps

Safeguard treats every model endpoint and self-hosted model artifact as a tracked component in your AI bill of materials, with provenance, signed configuration, and an expected runtime profile. Drift in endpoint configuration, certificate fingerprints, response latency distributions, or model file hashes surfaces as a finding the moment it occurs. Policy gates require that production model endpoints come from approved providers with verified configuration before a deployment is allowed to proceed, and that self-hosted weights match a known signed manifest before they can be loaded. Canary verification calls and prompt-canary checks can be configured per product and routed through the platform's anomaly detection. When response signing standards land, Safeguard's verification layer will be ready to enforce them — and until then, your team has the runtime and configuration controls to detect substitution before it costs you anything but time.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.