AI Security

Out-Of-Band Confirmation For Irreversible Tool Calls

Some tool calls cannot be undone. Out-of-band confirmation is the cheapest defense for that small set, and the most expensive thing to skip.

Shadab Khan
Security Engineer
7 min read

The class of calls that need a different default

Most tool calls an AI agent makes are recoverable. If the agent reads the wrong record, no harm is done. If the agent writes a draft that turns out to be wrong, you delete the draft. If the agent triggers a build that fails, you fix the issue and rebuild. The cost of a mistake is the cost of redoing the work, and the user experience benefits from those calls running automatically without a human in the loop.

A small but important class of calls does not have this property. Sending a wire transfer is irreversible. Deleting a production database is irreversible. Posting a public statement under the company's account is irreversible. Pushing to a regulated production system is irreversible in the sense that the rollback path requires a deploy, an explanation, and possibly a disclosure. For these calls, the right default is not to run automatically. The right default is to require out-of-band confirmation from a human who can see what the agent intends to do.

What out-of-band actually means

Out-of-band means the confirmation arrives through a channel that is not the same channel the agent is acting in. If the agent is operating inside a chat session, the confirmation does not happen in the chat session. It arrives through email, a mobile push, a separate approval app, or a phone call. The reason is straightforward. If the same surface that an attacker can compromise is also the surface where confirmation happens, the confirmation provides almost no protection. An attacker who has hijacked the agent's reasoning can also hijack the in-band confirmation flow.

A separate channel forces the attacker to compromise two systems instead of one. That is a meaningful escalation. It is also what gives the user a chance to notice that something is wrong, because the confirmation arrives in a context where the user is paying attention rather than scrolling through agent output.

What to confirm

The confirmation has to show enough information that the user can reason about it. The minimum is the action, the target, and a short description of what will happen. If the action is a wire transfer, the recipient and the amount have to be visible. If the action is a database deletion, the database identifier and the scope have to be visible. If the action is a deployment, the cluster, the namespace, and the version have to be visible.

Showing only the agent wants to perform an action, please confirm is worse than not asking at all, because users learn to click yes by reflex. The confirmation prompt is doing actual security work only if the user can reasonably understand what they are confirming. The cost of designing the prompt correctly is small. The cost of a confirmation that users have learned to ignore is large.

When to require it

The decision of which calls need confirmation is a policy decision, not an engineering decision. The policy should be written in terms the business can review, not in terms only the agent platform team understands. A reasonable starting set is any tool call that moves money, any tool call that affects production systems with regulated data, any tool call that changes external customer-facing state, and any tool call that deletes data above a certain volume.

The set should be small. If everything requires confirmation, users develop habits that defeat the purpose. The right way to think about it is that confirmation is a budget. Every confirmation prompt costs user attention, and you have a fixed budget per user per week before they start clicking through reflexively. Spend the budget on the calls that actually need it.

The latency problem

Out-of-band confirmation introduces latency. An agent that has to wait for a human approval cannot complete its work in seconds. That latency is the cost of the control, and it has to be acknowledged honestly. Two patterns help. The first is to design agent workflows so that confirmation-required calls happen at natural pause points, where the user is already expecting to be involved. The second is to batch confirmations where it makes sense, so the user approves a coherent set of actions rather than being interrupted repeatedly.

The pattern to avoid is making the agent block silently while waiting for confirmation. If the user does not know an approval is pending, the work just stops. The agent should explicitly surface that it is waiting, in the in-band channel where the user is interacting with it, and the confirmation should arrive promptly through the out-of-band channel.

Confirmation, attestation, and signing

Confirmation can be a simple yes or no, or it can be richer. For high-value actions, the confirmation should be a signed attestation that captures who approved, when, and what they were shown. The signed attestation gets stored in the audit log alongside the tool call, and provides evidence months later that the action was approved by a real human looking at real data.

This is especially important for actions that touch regulated systems. Auditors will want to see not just that the action happened but that it was approved by an authorized person. A signed attestation answers that question cleanly. A check-the-box confirmation does not.

Failure modes to avoid

A few failure modes show up repeatedly in confirmation flows. The first is approval fatigue, where users approve everything because everything is asked. The fix is to keep the set of confirmation-required calls small and meaningful. The second is approval delegation, where users hand their approval credentials to an assistant or a script. The fix is to make confirmations require a real-time interaction with a device the user controls, rather than something that can be automated away.

The third is approval racing, where an attacker tries to trigger a legitimate-looking confirmation while the user is distracted. The fix is to give the user enough context in the prompt that they can recognize a confirmation that does not match their current activity, and to require an explicit cancel-or-approve action rather than a default approval after a timeout.

The fourth is confirmation through a channel the user has stopped checking. If your out-of-band channel is email, and the user has not read email in three days, your confirmations are not happening. The channel has to match where the user actually pays attention.

Designing for the agent's perspective

The agent also has to be designed for this. A well-designed agent surfaces its intent before triggering the confirmation. It explains what it plans to do and why, in a way the user can review. It does not treat the confirmation as a magic gate that produces an answer. If the agent's reasoning is bad, the user should be able to see that and stop the action before approving, rather than being forced to evaluate only the final tool call in isolation.

This pushes some of the work back onto the prompt and the agent design, which is the right place for it. Confirmation is the last line of defense, not the only one.

How Safeguard Helps

Safeguard ships out-of-band confirmation as a built-in policy primitive. You declare which tool calls require confirmation, which channel they should arrive on, and what context to display, and the platform handles the routing, the timeout behavior, and the signed attestation. Confirmations are linked to the tool-call audit record automatically, so every approved action has a traceable approver and approval context. The result is a confirmation flow that protects irreversible actions without forcing your team to build it themselves.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.