In November 2025, a misconfigured internal agent at a fintech we work with invoked transfer_funds from a support conversation because the LLM decided the user's request to "fix my balance" was best served by moving money. The tool was listed in the agent's manifest; no one had ever written a check that said "support agents may not call treasury tools." The prompt template said so in English. Claude 4 Sonnet, faithful and helpful, ignored the English. The same failure mode shows up in every framework we audit — LangChain's create_tool_calling_agent, OpenAI's Responses API with tools, Google's Vertex AI function calling, Anthropic's computer-use loop. The LLM is given a tool list and told in prose what not to do. Prose is not a boundary. This post walks through the enforcement patterns that actually hold when the model is wrong.
Why does prose scoping fail?
Because the tool call happens before your authorization layer ever sees a decision. Frameworks like LangChain 0.3 and LlamaIndex 0.12 wire tools directly into the model context; the model emits a structured call, the framework dispatches it, the side effect runs. The "guardrail" is whatever the model chose to do with the system prompt. OpenAI's own July 2025 function-calling eval showed GPT-5 honored explicit "do not call X under condition Y" instructions roughly 88% of the time under benign traffic and closer to 41% under adversarial prompts pulled from the InjecAgent benchmark. Claude 4 Opus did better — about 63% — but 63% is not an access control. If a human engineer shipped a checkAuth() that returned the right answer 63% of the time, they would be fired. We accept it from LLMs because we call it a "soft constraint." It is a vulnerability.
What does capability-token scoping look like in practice?
The pattern we recommend: the agent never sees a tool, only a capability token that expands to a tool server-side. When a user session starts, the policy layer mints tokens like read:customer:1234, write:ticket:self, search:kb:public. Those tokens are bound to the session, signed, and expire. The agent calls invoke(token, args). The gateway unpacks the token, checks the policy, and either dispatches or rejects. The model cannot forge a token because it never sees the signing key, and it cannot call tools outside its mint because those tokens were never issued. We have seen teams implement this on top of SPIFFE workload identities and on top of plain HMAC; both work. The key property is that scope is decided by the policy engine at session creation, not by the model at inference time.
How do tool firewalls differ from guardrails?
Tool firewalls sit between the agent runtime and the tool backend, not between the user and the model. NVIDIA's NeMo Guardrails and Protect AI's Rebuff try to filter inputs and outputs; that is useful for content policy but does nothing if the agent correctly decides to call a dangerous tool with well-formed arguments. A tool firewall intercepts every invoke call and evaluates it against a policy written in Rego, Cedar, or a similar language. At one client running Pinecone-backed RAG with a LangGraph supervisor, we moved from system-prompt guardrails to an Envoy-based tool firewall with OPA policies in Q3 2025; denied-but-attempted tool calls jumped from a reported 0.3% to an observed 7.1% in the first week. The calls were always happening. Guardrails were hiding them.
Where do reachability and tool scope meet?
They meet at the dependency graph. An agent that imports LangChain 0.3 and registers a SQLDatabaseToolkit has, by default, tools like sql_db_query with full database credentials baked into the chain. Reachability analysis against the agent's code and config tells you which tools are actually instantiable from a given entrypoint. In a recent audit of a 14-service agent fleet, we found 62% of registered tools were dead code — bound in initialization, never reachable from any live user intent. They were still valid targets if an attacker got prompt control. Pruning unreachable tool registrations cut the effective tool surface by more than half without touching business logic. This is the same shift-left reachability move we use for CVEs, applied to agent capability.
What about multi-hop agents and subagent delegation?
Multi-hop makes scope worse, not better. When a supervisor agent spawns a researcher subagent via the AutoGen 0.4 or CrewAI patterns, the subagent often inherits the full tool list of its parent. If the supervisor has access to send_email because it needs to notify users, and it delegates a "summarize this ticket" subtask, the subagent inherits email access it will never legitimately use. The enforcement pattern is delegation-with-attenuation: the parent mints a narrower capability token for the child, the child can only exercise that subset, and the policy layer refuses to widen. Anthropic's computer-use SDK added sub_scope support for this in February 2026; most open-source frameworks still require you to implement it by hand.
How do you test this without shipping breakage?
Shadow mode plus an eval harness. Run the tool firewall in observation-only against production traffic for two weeks; log every decision. Build an eval set of 200–500 adversarial prompts drawn from InjecAgent, AgentDojo, and your own incident postmortems. Measure the delta between what the model wants to do and what the policy would allow. At a healthcare client in December 2025, this surfaced a tool registered in a patient-lookup agent that could export entire cohorts via CSV — the model had never called it under normal traffic, but 12 of 300 adversarial prompts elicited the call. The eval caught it before the firewall went enforcing.
How Safeguard Helps
Safeguard's Griffin AI builds an AI-BOM for every agent service, mapping registered tools, their transitive dependencies, and the identities they run as. Reachability analysis over the agent code and config flags tool registrations that are dead, over-privileged, or duplicated across services. The eval harness runs adversarial prompt batteries against your agents and compares attempted tool calls to a declarative scope policy, producing a diff you can review before any firewall change goes enforcing. Policy gates in CI block merges that add new tool bindings without an accompanying scope declaration, so capability drift is caught at PR time rather than in production.