AI Security

Tool-Call Hijacking: Griffin AI vs Mythos

A hijacked tool call is more consequential than a hijacked response. The defence requires the tool layer to police the model, not the other way around.

Nayan Dey
Senior Security Engineer
3 min read

Tool-call hijacking is the attack variant where prompt injection or model manipulation induces the model to invoke tools the operator did not intend. Unlike prompt injection that produces a bad response, tool-call hijacking produces bad actions in the world — files deleted, emails sent, money moved. The defence requires the tool layer to enforce scope rather than trusting the model to request only authorised actions. Architectural choices matter here more than at the model layer.

What hijacking attacks look like

Three patterns:

  • Scope-expansion attempts. The model is induced to call a tool that exists but is not authorised for the current session.
  • Argument manipulation. The model calls an authorised tool but with manipulated arguments that expand the effective access.
  • Cascading tool invocation. The model is induced to make a chain of tool calls that together produce unauthorised behaviour even if each individual call is authorised.

Each needs different defence.

Where model-level defences fail

"Ask the model nicely to only call authorised tools" is the weakest defence. It depends on the model's judgment, which is exactly what the attack exploits. Mythos-class tools that rely primarily on model-level enforcement have measurable vulnerability.

How Griffin AI handles it

Three architectural layers:

Authorisation at the tool layer. Each tool has scope, and the scope is enforced by the tool handler, not by the model. When the model emits a call to a tool outside the session's scope, the tool layer refuses regardless of how the model justified the call.

Argument validation against capability manifest. Each tool's capability manifest declares what arguments are acceptable. Argument values outside the declared range are rejected at the tool layer.

Cascading-invocation rate limits. Anomalous patterns of cascading tool calls (e.g., read followed immediately by exfiltrative write) trigger rate limits and optional out-of-band confirmation.

A concrete example

An MCP server exposes a list_files(directory) tool scoped to a specific directory. An attacker's prompt injection induces the model to call list_files(directory="../../etc/").

With model-level defences: the model may or may not refuse depending on training.

With Griffin AI's tool-layer enforcement: the tool handler rejects the call because the directory argument escapes the declared scope. The attacker's attempt is logged; the audit trail shows the model attempted an out-of-scope call.

What to evaluate

Three concrete checks:

  1. Configure an MCP server with narrow scope. Induce the model to attempt a scope-escape. Verify the scope holds.
  2. Attempt argument manipulation (path traversal, SQL injection in tool arguments). Verify the arguments are validated at the tool layer.
  3. Attempt cascading tool invocation. Verify rate limits trigger.

How Safeguard Helps

Safeguard's tool-call security is built on tool-layer enforcement of scope, argument validation, and cascading-call rate limits. The defence does not depend on the model cooperating. For organisations whose AI agents have access to real production tools, this architectural choice is the difference between a safe deployment and a latent incident.

Related articles in AI Security

AI Security

Safeguard Now Supports Every Major AI Model Family for Zero-Day Discovery: Anthropic, OpenAI, Gemini, Microsoft, Meta, and Your Own Models

You should not have to choose between your organization's AI strategy and your security platform. Safeguard's agentic zero-day discovery and remediation pipeline now works on Anthropic Claude Fable 5, OpenAI GPT, Google Gemini, Microsoft Phi, Meta Llama, Safeguard native models, and privately hosted custom models — all running as first-class agents in the same Multi-Agent TAOR Deep Think AI Engine.

June 9, 2026Read
AI Security

Anthropic Claude Mythos Releases Tomorrow: Capabilities, Benchmarks, and What Security Teams Must Do Now

Anthropic's Claude Mythos model goes public on June 10, 2026 — a frontier AI that scored 97.6% on the Math Olympiad, completed expert-level hacking tasks at 73% success, and found 271 vulnerabilities in Firefox 150. Here is everything security teams need to know before it lands, and how Safeguard already supports Mythos zero-day discovery natively.

June 9, 2026Read
AI Security

Claude Fable 5: Anthropic's Most Capable Public Model Is Here — Benchmarks, Capabilities, and What It Means for Security

Anthropic just released Claude Fable 5, its most capable publicly available model and the first Mythos-class AI open to everyone. 80.3% on SWE-Bench Pro, 88% on Terminal-Bench 2.1, state-of-the-art across software engineering, vision, and scientific research. Safeguard has already integrated Fable 5 natively — here is everything you need to know.

June 9, 2026Read

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.