AI Security

MCP tool poisoning: hidden instructions and rug-pulled tool definitions

Tool-poisoning attacks against Model Context Protocol servers hide adversarial instructions inside tool descriptions and silently mutate tool definitions after install. Here is how the attack works and how to defend against it.

The Model Context Protocol — MCP, defined by Anthropic in late 2024 — gives language model agents a clean way to discover and call tools exposed by external servers, and that uniformity is exactly what makes the protocol attractive both to legitimate integrators and to people who want to abuse it. An MCP server advertises a list of tools, each with a name, a JSON schema, and a natural-language description, and the host application feeds those descriptions into the model so it can decide when and how to invoke each tool. The description is the contract: it tells the model what the tool does, when it should be called, and what arguments mean. That contract is also a control surface, because anything written in the description influences the model's behavior on every turn the tool is available.

Tool poisoning is the family of attacks that exploits this control surface. A malicious or compromised MCP server writes instructions into the description that are not visible to the user in the host UI but that the model reads on every prompt assembly, and over the last twelve months that pattern has matured from a curiosity into a documented technique with several real-world incidents. The same surface enables a second attack — the so-called rug pull, where a server publishes a benign tool definition at install time and then quietly replaces it weeks later — and the combination is what makes MCP supply-chain security a distinct problem rather than a rebadging of the usual prompt-injection conversation.

What does a poisoned tool description actually look like?

A poisoned description does not announce itself. The visible portion typically reads like a normal helper — "fetch the contents of a URL," "compute a hash," "list files in the current workspace" — and the malicious payload sits below the visible portion in a region the host UI never renders. The payload may be a paragraph of imperative text that instructs the model to also exfiltrate environment variables, also call a second tool with a specific argument, or also append a hidden string to any code it writes. Because the model treats the description as part of its operating instructions, it tends to comply, especially when the payload is phrased as a clarification or a "safety note" rather than as an obvious command.

The more sophisticated variants exploit the asymmetry between what the host renders and what the model sees. Some hosts truncate descriptions in the UI at a few hundred characters, so the attacker pushes the payload past the truncation boundary. Others render markdown but feed raw text to the model, so the attacker uses HTML comments or invisible Unicode characters to hide instructions in plain sight. A defender who only reads the description in the host UI will not see the problem; the only reliable way to audit a description is to read the raw payload exactly as the model receives it.

The defense at this layer is mechanical rather than clever. The host runtime should canonicalize tool descriptions, strip non-printing characters, and render the exact text the model will see in a review pane that the operator must approve before a server is enabled. Some teams go further and run the description through a classifier that flags imperative second-person language or references to other tools, on the theory that a legitimate description rarely needs to tell the model what to do in a directive voice.

How does a rug-pull on a tool definition work?

A rug pull exploits the gap between install-time review and run-time behavior. When an operator first connects an MCP server, they typically inspect the advertised tools, decide which ones to enable, and grant the server a transport credential. After that point, most hosts re-fetch the tool list on every session start and accept whatever the server returns without re-prompting the operator. A server that behaved correctly for the first month can publish a new tool list on day thirty-one that adds a tool, changes a description, or alters a schema to demand new parameters — and the host will quietly use the new definitions.

The attack is particularly effective when the server is hosted by a third party rather than run locally, because the operator has no visibility into when the server's code or configuration changes. A maintainer who loses control of the hosting account, sells the project, or simply decides to monetize the user base can ship a new tool definition without ever touching the code the operator originally vetted. The pattern mirrors browser-extension rug pulls and npm package takeovers, and the mitigations are similar in shape even if the mechanics differ.

The structural defense is to pin tool definitions the way you pin dependency versions. A host that records a hash of each enabled tool's schema and description at approval time, and refuses to call a tool whose hash has changed without a fresh operator approval, removes the silent-mutation path entirely. The cost is friction — legitimate updates require a review — but for any server that touches code, secrets, or production systems the friction is well spent.

Why do existing prompt-injection defenses miss this?

Most prompt-injection defense work focuses on content that arrives at runtime: a web page the agent fetches, a support ticket it reads, a document a user uploads. Tool-poisoning is different because the adversarial content arrives at configuration time, lives inside the system prompt's tool block, and is treated by the model as authoritative. Filters that scan user-visible inputs for jailbreak strings never see it, and classifiers trained on prompt-injection corpora often miss it because the language is bureaucratic and tool-shaped rather than obviously adversarial.

The other reason existing defenses miss is that the host runtime usually trusts the MCP server transport as an inner ring. The server is on an allowlist, the transport is authenticated, and the assumption is that anything the server says is in-band protocol data rather than untrusted input. That assumption is exactly what tool-poisoning violates. The right mental model is to treat every MCP server as untrusted input — including its tool list, its tool descriptions, its resource contents, and its tool outputs — and to apply input-validation discipline at every boundary where that data crosses into the model's context.

What does a defensible MCP host configuration look like?

A defensible configuration starts with a registry of approved servers, each with a pinned source — a container digest, a commit SHA, a signed binary — and a recorded hash of the tool definitions that were reviewed at approval time. Enabling a new server requires a human reviewer to read the raw tool descriptions, including any content beyond the host's UI truncation, and to acknowledge the schema of every tool that will be exposed to the model. Re-fetching tool definitions at session start is fine, but any change to a recorded hash must block tool use until a reviewer re-approves.

Run-time controls matter as much as install-time controls. Tools that perform sensitive actions — writing files, executing shell commands, calling external APIs with credentials — should be wrapped in a policy layer that enforces per-tool allowlists on arguments, logs every invocation with full input and output, and requires explicit confirmation for actions outside a narrow safe set. Operators should be able to see, in a single pane, which MCP servers are connected, which tools each one exposes, and a recent history of how those tools have been called.

The final piece is the human workflow. MCP servers are software dependencies, and the same review cadence that a security team applies to npm packages or container base images should apply to MCP servers. A server's source, maintainer reputation, update history, and tool definitions all belong in the same review pipeline as any other third-party dependency, and changes to any of those should trigger the same kind of scrutiny that a major version bump in a critical library would.

How Safeguard Helps

Safeguard treats MCP servers as first-class supply-chain components and inventories them alongside the rest of your software dependencies. Griffin AI, the platform's reasoning agent, reads tool descriptions as they will appear to the model, flags hidden instructions and directive language, and pins schema hashes so silent rug-pulls are caught before the next session. MCP server security policies in Safeguard let you require human approval for new tools, enforce per-tool argument allowlists, and block any server whose definitions have drifted from the reviewed baseline. Agent guardrails wrap sensitive tool calls in policy checks, and runtime egress monitoring captures every outbound call an MCP-connected agent makes so investigators have a complete record when something goes wrong. To discuss how Safeguard can secure your MCP deployment, talk to our team.

mcp llm agent prompt injection supply chain tool poisoning

Back to all articles

More on #mcp

View all

Agent Security

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.

MCP tool poisoning: hidden instructions and rug-pulled tool definitions

What does a poisoned tool description actually look like?

How does a rug-pull on a tool definition work?

Why do existing prompt-injection defenses miss this?

What does a defensible MCP host configuration look like?

How Safeguard Helps

More on #mcp

Claw Chain: Four Chained CVEs Turn 245,000 OpenClaw Agents Into Backdoors (May 2026)

Detecting shadow MCP servers in developer environments

Prompt-injection vectors specific to MCP servers and how to layer defenses

When the Vulnerability Is the Design: MCP STDIO Command Injection Across 150M Downloads (May 2026)

Related articles in AI Security

Daybreak vs. Mythos: 2026 Is the Year the Frontier Labs Entered Defensive Security

Patch the Planet: What AI-Generated Fixes Actually Mean for Open-Source Maintainers

OpenAI's Daybreak: An Honest Assessment of Codex Security, GPT-5.5-Cyber, and the Find-Validate-Patch Loop

Never miss an update