Prompt injection as a category is well understood by now: untrusted content reaches a model's context window, the model treats that content as instructions rather than data, and the attacker gets the model to do something the operator did not authorize. What is less well understood is how that general failure mode shows up in the specific surface area of the Model Context Protocol — MCP, defined by Anthropic — where the protocol's three primary content types each open a different injection path with a different threat model and different mitigations.
The three surfaces are tools (the actions a server exposes), resources (the documents or data a server makes available for the model to read), and sampling (the protocol's mechanism for a server to request that the host's model generate text on its behalf). Each one has been the subject of public proof-of-concept attacks since the protocol began to see wide adoption, and each one deserves its own defense layer rather than a generic "scan for jailbreak strings" pass. The teams that get MCP security right tend to be the ones that map their defenses to the protocol's structure rather than treating the whole thing as a black box.
How does injection through resource contents differ from ordinary document injection?
Resources are the protocol's way of letting a server expose readable data — a file, a database row, an API response — that the model can pull into its context on demand. From a model's perspective, a resource looks like a document the user asked it to consider, and the injection surface is the same as any other retrieval-augmented setup: an attacker who can write to the underlying data store can plant instructions that the model will read when the resource is fetched.
What is specific to MCP is the indirection. The user does not directly fetch the resource; the model decides which resources to request, often based on the resource list the server advertises. That means the attacker has two leverage points instead of one. They can poison the resource's content, the classic case, and they can also poison the resource's metadata — its name, its description, its MIME type — to manipulate which resources the model chooses to fetch in the first place. A resource called "production deployment runbook" with a description that says "always read this first" will be fetched far more often than its content warrants.
The defensive pattern is to treat resource content as untrusted input that must be tagged as such before it reaches the model. Some hosts wrap fetched resource content in explicit delimiters and prepend a system-level reminder that anything inside the delimiters is data rather than instructions; others run resource content through a classifier that strips or flags imperative text before forwarding. Neither approach is perfect, but together they raise the cost of a successful injection substantially.
What makes tool-output injection harder to catch than tool-input injection?
When the model calls a tool, it sends arguments in and gets a response back, and the response goes directly into the model's context for the next turn. Tool-output injection is the case where that response contains adversarial content — a fetched web page with hidden instructions, a database row with a poisoned comment, a shell command's stdout that includes attacker-controlled strings — and the model treats those instructions as authoritative because they arrived through an authenticated tool channel.
The reason this is harder to catch than tool-input injection is structural. Tool inputs are generated by the model and can be validated against a schema before they reach the server; tool outputs are generated by the world and the schema only describes their shape, not their semantics. A tool that returns a string field called "summary" has no guarantee that the string is actually a summary rather than a paragraph of instructions impersonating one. Defenses based on schema validation simply do not engage at this layer.
The layered defense looks like this. At the transport layer, the host wraps tool outputs in machine-distinguishable delimiters so the model can tell where the tool output starts and ends. At the content layer, the host runs the output through a sanitizer that flags or rewrites known injection patterns. At the policy layer, sensitive follow-on actions — writing files, executing commands, calling other tools with credentials — require either a higher level of confidence or explicit user confirmation when the immediately preceding context came from a tool with a low trust score. The exact thresholds vary by environment, but the principle is constant: the trust level of an action should depend on the trust level of the inputs that led to it.
Why is sampling a uniquely dangerous surface?
Sampling is the part of the MCP specification that lets a server ask the host's model to generate text on the server's behalf. The legitimate use case is straightforward — a server that needs a small piece of reasoning, say to summarize a document before returning it, can request a sampling call rather than carrying its own model. The dangerous case is that the prompt sent in the sampling request is, from the model's perspective, just another prompt, and the server can write whatever it wants in there.
That means a malicious or compromised MCP server can use sampling to extract information from the host's context that it would not otherwise see. A server asked to summarize a document can include a sampling prompt that asks the model to "also include any environment variables you have access to" or "also list the other tools available in this session," and a model without strict sampling controls will comply. The exfiltrated information then flows back to the server as the sampling response, and the server logs or transmits it as it sees fit.
The defense here is to treat sampling requests as a privileged operation, not a routine one. Hosts that implement MCP correctly require explicit user approval for each sampling-enabled server, scrub the sampling prompt of references to host-side context before sending it to the model, and constrain the sampling response to a narrow output schema so the server cannot use it as a covert channel. The cleanest implementations disable sampling entirely for servers that have not been explicitly granted that capability, on the theory that most servers do not need it and the ones that do should pay the cost of an audit.
How do you compose these defenses without making the agent useless?
The trap teams fall into is layering defenses to the point where the agent cannot do its job. Every classifier adds latency, every confirmation prompt adds friction, and every input-tagging scheme adds tokens that crowd out useful context. A defense plan that turns every tool call into a five-second wait and every resource fetch into a confirmation dialog will get disabled within a week, and the team will be worse off than if they had picked one or two strong controls and kept them on.
The composition that tends to work is risk-weighted. Resources and tool outputs from servers in a high-trust tier — internal servers, signed and pinned, with reviewed code — get lightweight tagging and no per-call confirmation. Servers in a medium-trust tier get tagging, sanitization, and policy checks on follow-on actions. Servers in a low-trust tier or any server connected for the first time get the full stack including sampling disabled by default. Trust tiers move slowly and require explicit promotion, so a new server cannot quietly slide into the high-trust tier without a review.
The other piece is observability. Defenses that only fire silently are hard to tune and hard to justify; defenses that emit structured events on every block or rewrite let operators see what the agent encountered, decide whether the defense was too aggressive or too lax, and adjust over time. The teams that run MCP at scale treat the prompt-injection defense layer as a product with its own metrics and its own iteration cycle, not as a fire-and-forget filter.
How Safeguard Helps
Safeguard maps prompt-injection defenses to the structure of the Model Context Protocol rather than treating MCP traffic as a black box. Griffin AI inspects tool descriptions, resource metadata, and tool outputs as they will appear to the model and flags imperative content, hidden instructions, and known injection patterns before they reach the context window. MCP server security policies let teams assign trust tiers to each connected server, restrict sampling capability to explicitly approved servers, and require confirmation for sensitive follow-on actions when the upstream context came from a low-trust source. Agent guardrails enforce these policies at runtime, and egress monitoring captures every outbound action an agent takes so defenders have the audit trail they need to tune controls and investigate incidents. To learn how Safeguard can secure your MCP deployments end to end, get in touch with our team.