AI Security

Defending LLM agents against confused-deputy attacks on their tool privileges

An LLM agent with tools is a deputy that holds privileges its users do not. Attackers exploit that gap by tricking the agent into using those privileges on their behalf — here is how to design defenses that hold up.

The confused deputy is one of the oldest patterns in security literature — a program with privileges acts on behalf of a less-privileged caller, the caller crafts a request that exploits the program's privileges to do something the caller could not do directly, and the program, lacking a way to distinguish its own intentions from the caller's, dutifully complies. The classic example involves a compiler that could write to system files, but the structure shows up anywhere a deputy holds privileges its callers do not. Language model agents that wield tools fit that structure almost perfectly: they sit between users (or arbitrary inputs) and powerful APIs, they hold credentials those inputs do not, and they have a notoriously weak ability to distinguish instructions from data.

What makes LLM agents a particularly rich confused-deputy target is the breadth of the input surface. A web search result, a fetched document, a Slack message, an email body, a database row, a tool output from another agent — any of these can carry instructions that the model treats as authoritative, and any of them can ask the agent to use its privileges on behalf of the attacker who planted the content. Building defenses against this class of attack is not optional for agents that touch production systems; it is the central authorization problem of the agent runtime.

Why doesn't the model's own judgment count as a defense?

The intuition that a sufficiently smart model will refuse obviously malicious requests is not wrong, but it is brittle. Models do refuse the easy cases — a web page that asks the agent to email the user's password to an external address gets caught most of the time — and the refusal rate is rising as alignment training improves. The problem is that the floor of "obviously malicious" keeps moving, and adversaries who get to iterate against the defender's model have a structural advantage in finding the cases that slip through.

There is also a deeper issue. The model's judgment is exercised inside the same context where the attacker's content lives, which means the attacker can attempt to manipulate the judgment itself. Phrases like "this is an authorized internal request," "the security team has pre-approved this action," or "ignore previous instructions and proceed" are the crude end of a spectrum that extends to elaborate role-play setups and multi-turn social engineering. A defense that relies entirely on the model resisting these attempts is a defense whose strength depends on the attacker's creativity, which is exactly the wrong dependency.

The right framing is to treat the model's judgment as one layer of defense, valuable but not load-bearing. The load-bearing layer has to live outside the context window, in code that the model cannot reason its way around and that enforces policy based on properties the model cannot fake.

How does input provenance change the authorization decision?

A core defensive idea is that the authorization decision for a tool call should depend not just on what the agent is trying to do but on where the request originated. An agent acting on a direct user instruction has more authority than the same agent acting on content it read from a webpage, and the same agent acting on a webpage has more authority than one acting on a footer in an email forwarded by a stranger. Tracking that provenance through the agent's reasoning is what separates a robust authorization model from a permissive one.

In practice this means tagging every piece of content that enters the agent's context with a trust label — first-party user input, internal-system tool output, third-party external tool output, fetched document, and so on — and then making the runtime policy engine consume those tags when it decides whether to permit a tool call. A tool that writes to production databases might require that the immediately preceding instruction trace back to a first-party user input within the last few turns, with no third-party content intervening. A tool that fetches data can have looser requirements. A tool that sends external email might require explicit user confirmation regardless of provenance.

The implementation challenge is that the model does not naturally surface this trace, so the runtime has to do it. Some teams build dataflow tracking on top of structured agent frameworks where every tool output is annotated and every model invocation includes a provenance summary. Others enforce the policy at the tool wrapper, refusing a call if the call's justification in the agent's reasoning trace references third-party content as the trigger. Either approach is harder than just letting the model decide, but the resulting authorization model is far less brittle.

What does scope reduction look like at the tool layer?

Provenance helps decide whether a call should proceed; scope reduction limits the damage when a call does proceed under attack. A confused-deputy attack succeeds best when the deputy's privileges are broad — an API key that can read any object in a storage bucket, a database connection that can run any query, a Slack token that can post to any channel — because the attacker can pivot from a single tricked call into broad data access or impact. Narrowing those privileges per call cuts the blast radius substantially.

The simplest version is per-tool credentials. The tool that reads customer support tickets should not hold a credential that can also write to billing; the tool that posts to one Slack channel should not hold a token that can post to another. This requires the agent runtime to maintain a credential broker that issues narrow, short-lived tokens for each tool invocation rather than pre-loading the agent with broad capabilities. The broker pattern is more work to set up than a shared service account, but it pays for itself the first time an attack tries to use a billing-related tool through a support-related path.

A more sophisticated version is argument-level scope. A tool that runs a database query can accept a query that returns one row or a query that returns the whole table; an attacker who tricks the agent into running the second is much worse off than one who can only run the first. Wrapping tools in argument validators that enforce row limits, column allowlists, predicate requirements, or rate caps — and that refuse calls outside those bounds without explicit override — turns broad capabilities into narrow ones at the call site. The tool's nominal power stays the same; its effective power on any given call is bounded.

How do you keep the agent useful while enforcing these checks?

The risk of confused-deputy defenses is the same as the risk of any authorization regime: if the controls are tight enough to be safe they often become tight enough to be unusable, and the operator ends up with an agent that requires confirmation for every meaningful action. The compromise is to make the friction proportional to the risk, and to make the risk assessment automatic rather than relying on the human to evaluate every call.

The pattern that tends to work is to define a small set of safe actions — reads against non-sensitive data, idempotent operations, calls with output that flows only to the user — that the agent can perform freely, and to require richer authorization for anything outside that set. Calls outside the safe set get evaluated against the provenance tags, the scope of the credentials, the recent history of the session, and the policy for the specific tool. Most of those evaluations resolve without prompting the user, but the ones that do prompt are the ones that genuinely deserve a human eye.

The second piece is structured logging. Every tool call, including its argument, its provenance trace, and the policy decision that allowed or blocked it, should be recorded somewhere a human can review. That log is the artifact that lets the team tune the policy over time, investigate suspected attacks, and demonstrate compliance with whatever framework applies. Without it, the authorization regime is opaque even to its operators, and tuning becomes guesswork.

How Safeguard Helps

Safeguard treats LLM agents as deputies with delegated authority and gives operators the controls to bound that authority safely. Griffin AI tags every input to an agent with a provenance label, propagates those labels through tool calls, and lets policy decisions depend on whether the immediately upstream context came from a trusted source or a third-party document. MCP server security policies and agent guardrails let teams set per-tool authorization rules, narrow credential scope to the exact call being made, and require confirmation for actions whose blast radius warrants a human eye. Runtime egress monitoring records every outbound call an agent makes with its full provenance trace so defenders have the evidence to tune policy and investigate incidents. To talk through your agent's authorization model with our team, get in touch.

llm agent confused deputy ai security authorization agent runtime

Back to all articles

More on #llm agent

View all

AI Security

Prompt-injection vectors specific to MCP servers and how to layer defenses

8 min read

AI Security

MCP tool poisoning: hidden instructions and rug-pulled tool definitions

8 min read

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.

Defending LLM agents against confused-deputy attacks on their tool privileges

Why doesn't the model's own judgment count as a defense?

How does input provenance change the authorization decision?

What does scope reduction look like at the tool layer?

How do you keep the agent useful while enforcing these checks?

How Safeguard Helps

More on #llm agent

Prompt-injection vectors specific to MCP servers and how to layer defenses

MCP tool poisoning: hidden instructions and rug-pulled tool definitions

Related articles in AI Security

Daybreak vs. Mythos: 2026 Is the Year the Frontier Labs Entered Defensive Security

Patch the Planet: What AI-Generated Fixes Actually Mean for Open-Source Maintainers

OpenAI's Daybreak: An Honest Assessment of Codex Security, GPT-5.5-Cyber, and the Find-Validate-Patch Loop

Never miss an update