AI Security

Agent-to-Agent Security in Multi-Agent Systems

Multi-agent systems inherit every trust problem of single-agent systems and add a few more. Here is how the threat model actually shifts.

Shadab Khan
Security Engineer
7 min read

Multi-agent architectures have moved from research to production fast enough that most teams running them have not thought through the security implications. The single-agent threat model was already a handful; the multi-agent model adds delegation, transitive authority, and a fresh class of injection attacks between agents. This post lays out the threat model, the failure modes that have actually been seen, and the patterns that hold up.

What is new about the multi-agent threat model?

The primary new surface is agent-to-agent communication. When one agent delegates to another, the delegator is effectively passing a task description (prose) and possibly some credentials or context. The delegate receives that as input and acts on it. This is a trust boundary, but it is frequently drawn where there is no enforcement: the delegate trusts that the delegator validated its inputs, the delegator trusts that the delegate will not exceed its remit, and in practice neither is true.

The attack shape is: content that reaches the first agent (through a retrieved document, a user message, or a tool output) contains instructions crafted to propagate through the delegation. The first agent summarizes the content and hands the summary to a subagent. The summary carries the injection payload forward. The subagent acts on it. Because the authority gets compressed as it moves (the subagent might have a tighter tool scope but also less context), the injection often lands as a plausible-looking request the subagent cannot distinguish from legitimate delegation.

How does transitive authority creep in?

By default, in most frameworks. A planner agent that can call an executor agent that can call a tool-using agent that holds production credentials has a chain of delegation that is, in authority terms, as privileged as the leaf. Unless each link in the chain enforces its own authorization check, the planner effectively has production credentials. Most teams deploying multi-agent systems in 2025 discovered this the hard way when an incident traced back several layers of delegation that no one had realized constituted a privilege path.

The fix is that every agent-to-agent call should be treated as a cross-service call in a microservices system would be. It carries its own principal, its own scope, and its own audit record. "The planner said it was okay" is not authorization; it is a statement the executor can evaluate but cannot trust blindly. This is the same security lesson the microservices world learned around 2018, translated into agent vocabulary.

Where does agent-to-agent prompt injection actually land?

At the summarization and hand-off boundary. An agent that reads a document and prepares a task description for a sibling agent is, in effect, compressing a document into a prompt. The compression is lossy and is model-driven, which means the compression can be steered by content in the original document. A malicious document can cause the summarization to produce a task description that looks innocuous but includes an instruction the sibling agent will act on.

This has been seen in customer support multi-agent systems where a user submits a message crafted to influence the downstream triage agent's behavior, in research-assistant systems where a paper's abstract is crafted to influence the literature-review agent's subsequent queries, and in code-review systems where a comment in source steers a dependency-update agent. Each of these is a variant of the same pattern: content in, compressed prompt out, sibling agent acts.

The mitigation is structural. The hand-off between agents should not be free-form prose. It should be a constrained schema with typed fields, validated before transmission and again on receipt. A well-designed schema denies the attacker the channel bandwidth to carry an instruction through. A schema like { "task_type": "search", "query": "...", "max_results": 10 } is much harder to inject through than a freeform task_description: str. Frameworks default to the latter; teams should default to the former.

How should capabilities be scoped per agent?

Minimum authority per role, enforced at the boundary, not requested. Every agent in the system should have a capability set defined externally, not derived from its role description. The capability set should be enforced by the transport or the broker, not by the agent's good behavior. A research agent that supposedly only does web search should not be able to call the billing API, not because it would not think to, but because its credentials materially do not work against that endpoint.

Capability-based design is older than LLMs and the literature is clear about what makes it work. The temptation in multi-agent systems is to skip it because the system feels fluid and the agents feel smart. Both are true and neither is a security argument. The moment the system moves to production, fluidity becomes a liability and intelligence becomes an attack surface. Enforce capabilities at the gateway.

What about shared memory and shared state?

Shared state between agents is a data integrity problem as much as a security problem. If agent A writes to a scratch space that agent B reads from, anyone who can influence A's output can influence B's input. In multi-agent systems with a "shared working memory" abstraction, this channel is rich: a poisoning attack against the shared memory propagates to every agent that reads from it.

The pattern that holds up: shared state is versioned, signed (or at least integrity-checked) by the writer, and validated by the reader against a schema. No agent reads raw prose from another agent's output and acts on it. Shared state is a queue of structured messages, not a whiteboard. This rules out some of the more fluid architectures that have been popular in research, but it rules them out because they are structurally unsafe for production.

How do we audit multi-agent systems?

Every agent action logged with a trace ID that ties back to the original user request, every cross-agent call logged with input and output, and the full trace retained for at least as long as your incident response horizon. The practical challenge is that multi-agent traces are large and noisy. Teams that have done this well sample aggressively for normal operation and capture everything for flagged cases. A classifier that flags sessions with unusual tool-call patterns is often enough to drive the capture logic.

The audit story is the one that lets you answer "what did the system actually do" after an incident. Without it, post-incident analysis is guesswork, and multi-agent systems generate too many plausible-looking actions for guesswork to be reliable. Teams that deployed multi-agent systems without serious audit logging in 2025 reported incidents they could not reconstruct, which is a worse outcome than the incident itself.

What about open-source frameworks?

LangGraph, AutoGen, CrewAI, and similar frameworks move fast and ship regularly. Their security posture varies, and the community norm of pinning to a specific version is less established than in traditional backend frameworks. A transitively poisoned update to one of these frameworks, similar in shape to the Ultralytics PyPI compromise, would propagate into every multi-agent system that runs pip install --upgrade. Pinning, scanning, and reviewing updates is exactly as important here as it is for any other production dependency, and in 2026 it is still not the default practice.

How Safeguard.sh Helps

Safeguard.sh applies reachability analysis across the full multi-agent framework dependency stack, cutting 60 to 80 percent of the SCA noise that would otherwise obscure real issues in tools like LangGraph and AutoGen. Griffin AI monitors agent-to-agent communication patterns for injection signatures and flags delegations that cross authority boundaries, while keeping the findings correlated against the SBOM. TPRM workflows track each framework's maintainer posture, and the 100-level dependency depth surfaces compromises in the long tail of agent-related Python and Node packages. Container self-healing rebuilds agent service images automatically as fixes land, so multi-agent deployments do not drift behind the current patch level.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.