AI Security

Tool-Call Audit: The Missing AI Observability Layer

Most AI observability stacks log prompts and completions. The actual security signal is in the tool calls. Here is how to capture it.

Nayan Dey
Senior Security Engineer
7 min read

What current AI observability misses

The AI observability tools that became popular over the last two years are mostly built around the prompt and the completion. They log the user message, the assistant message, sometimes the system prompt, and a handful of token-level metrics. That is useful for debugging quality issues. It is not useful for answering security questions, and it is barely useful for answering operational questions when an agent does something unexpected.

The reason is that when an agent has tools, the interesting events do not happen in the prompt or the completion. They happen in the tool calls. A model can produce an entirely benign-looking completion while having called three tools that wrote to a database, exfiltrated a record, and triggered a webhook. If your observability stack only captures the text of the prompt and the text of the response, you have no evidence of any of that.

The tool-call audit record

The fix is a separate audit layer that captures every tool call as a structured record. The record has to include enough context to be reconstructable months later, when nobody remembers what the agent was doing. The fields that matter are the timestamp, the calling user, the calling agent, the model and version, the MCP server identity, the tool name, the arguments, the return value, the latency, and the policy decision that allowed or denied the call. If any of those fields is missing, the record is partially blind, and you will discover the gap during an incident rather than before one.

The arguments and the return value are the fields that take the most thought. They can contain sensitive data. They can be very large. They have to be redacted in some cases and stored in full in others. The pattern that has worked is to store the full record in a tightly controlled audit store, with hashes and structural metadata mirrored to the broader observability stack. Engineers debugging a normal issue can see the structural metadata. Investigators with the right access can pull the full record from the audit store.

Why structural logging beats text logging

The instinct of teams new to this is to log the tool call as a string, the way you would log an HTTP request. That works until you need to query it. Two days into an incident, when you are trying to find every tool call that touched a specific customer ID, a string log is useless. You end up writing brittle regular expressions and missing half the matches.

A structured record, with the arguments and return value as typed fields, is queryable from day one. You can ask which tool calls passed customer_id 12345 as an argument and get an answer in seconds. You can ask which agents called the deployment tool with the production cluster as an argument over the last week and get a clean list. The cost of building a structured logging path is real, but it is paid back the first time you need to actually use the logs.

Cardinality and storage

Tool-call audit logs are high cardinality. A busy agent platform can produce millions of tool calls per day, each with rich arguments and return values. Storing all of that in a generic logging system gets expensive fast. The pattern that works is a tiered storage model. Recent records, say the last thirty days, live in a hot store optimized for query. Older records get rolled into a cold store that supports long-range retrieval but at lower cost. Records that are part of an active investigation are pinned to the hot store regardless of age.

The retention policy needs to be deliberate. Too short, and you cannot answer questions about historical incidents. Too long, and you accumulate sensitive data that becomes its own liability. A common landing point is ninety days hot, two years cold, with regulated workloads on a longer schedule that matches the relevant compliance regime.

Linking audit to policy

The audit layer is most valuable when it is linked to the policy layer that decides whether tool calls are allowed in the first place. Each record should include the policy version, the rules that fired, and the decision. When a policy changes, the audit log should be queryable for the records that the new policy would have decided differently, so you can reason about the impact of the change before rolling it out.

This is the pattern that makes policy iteration safe. Without audit-linked policy reasoning, every policy change is a leap of faith. With it, you can ask what would have happened if this policy was in place for the last week and get a real answer based on real traffic.

Detecting anomalies

Once the audit log exists and is structured, anomaly detection becomes straightforward. The patterns to watch for are familiar from any access-log analysis. A user whose agent suddenly calls a tool it has never called before. A tool whose argument distribution shifts overnight. An MCP server whose call volume jumps by an order of magnitude. A model that starts calling tools in patterns that no longer match its system prompt.

None of these anomalies are necessarily malicious. They could be a new feature shipping, a benign behavior change, or a developer testing something. The audit log does not need to decide. It needs to surface the anomaly to a human who can investigate. The role of the audit layer is to make that investigation possible in minutes rather than days.

Privacy and consent

A tool-call audit log can capture a lot of personal data. If a tool reads a customer record, the customer record ends up in the audit. That has to be handled with the same care as any other system that processes personal data. The pattern is to treat the audit store as a sensitive system, restrict access through the same identity controls as production data, redact fields that should not be retained, and apply data subject deletion requests to the audit store as well as the production store.

The tradeoff is real. An audit log that is heavily redacted is less useful for investigation. An audit log that retains everything is a privacy risk. The right balance depends on the workload. Internal-facing agents that touch only employee data can retain more. Customer-facing agents that touch end-user data should redact aggressively.

What to do with the audit

The audit log is not just a passive record. It is the input to several active workflows. It feeds the anomaly detector. It feeds compliance reports for SOC 2 and similar frameworks. It feeds the post-incident analysis when something goes wrong. And it feeds the policy iteration loop, where you use historical traffic to decide whether a proposed policy change is safe to roll out.

A tool-call audit log that nobody looks at is a tool-call audit log that nobody will trust when an incident happens. The investment pays off only if the workflows around the log are real and run by real people.

How Safeguard Helps

Safeguard captures every tool call as a structured audit record automatically, with full attribution to the user, the agent, the model, and the MCP server, and with the policy decision attached. The audit store is queryable from the same console that engineers use for everything else, anomaly detection runs on the stream by default, and retention is configurable per workload to match your compliance needs. When an incident happens, the audit is already there, structured, queryable, and ready to support the investigation.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.