AI Security

LLM Output Filtering as a Security Control

Output filters are the last line before the user and the tool call. We cover when they work, when they fail, and how to measure them honestly in production.

A customer-facing agent at a SaaS vendor we reviewed in January 2026 leaked an AWS access key in a help response. The key belonged to an engineer who had pasted it into an internal ticket months earlier; the ticket had been crawled into the RAG corpus; the key surfaced when a user asked a vaguely related question. The team had an output filter in place — a regex library called detect-secrets — and it had matched the key. The filter logged the hit and let the response through because the integration was in "monitor mode" and had never been promoted to "block mode" after a false-positive scare three months earlier. The control existed, was working correctly, and was turned off at the blocking point. This is the usual shape of output-filter incidents. The controls aren't bad. The operating posture is.

When does output filtering actually prevent damage?

When it sits on a narrow output surface, runs deterministically, and is tuned against the specific categories you care about. The filter classes that demonstrate real efficacy in production: secrets detection (Trufflehog, Gitleaks, GitGuardian API), PII detection (Microsoft Presidio, AWS Comprehend PII), structured-output validation (the response must be valid JSON matching a schema), and function-call-argument validation (tool calls must match declared types). Each of these has measurable precision and recall and can be enforced at the gateway. The OpenAI Moderations API and broad "safety classifier" approaches work for consumer policy violations but are poor substitutes for domain-specific data-leakage controls in enterprise.

What categories of attack do output filters miss?

Anything the filter isn't specifically looking for. We audited a Claude 4 Opus deployment in late 2025 where the output filter caught credit card numbers and SSNs but didn't touch medical record numbers because the regex library shipped with U.S. healthcare patterns off by default. A successful prompt-injection test exfiltrated MRNs via the chatbot's summary feature. The filter logged a clean pass. Meta-analysis from the 2025 AIID (AI Incident Database) reports shows roughly 60% of confirmed LLM data-leakage incidents hit categories that the deployed filter was not configured to catch — not categories the filter can't handle. Configuration is where these fail.

How should output filters compose with input filters?

Asymmetrically. Input filters protect the model; output filters protect the user and downstream systems. A defense-in-depth stack we've deployed at three enterprises: input-side moderation on the user turn (to catch prompt attempts), retrieval-side injection scoring on RAG chunks (to block indirect injection), output-side DLP and tool-argument validation on the model response, and post-tool validation on any side-effect output. Each layer has a different error budget. The input filter can be tight because false positives just retry; the output filter before a tool call must be precise because blocking an action has higher business cost than blocking a text response. Treating "guardrails" as one pipeline stage that does everything is the common mistake.

What about structured output and function calling?

Structured output is the under-appreciated security primitive. When you constrain the model to emit JSON matching a declared schema — via OpenAI's response_format, Anthropic's tool-use schema, or llama.cpp's grammar constraints — you eliminate entire classes of injection and exfiltration attacks that depend on free-form text. The model literally cannot emit a paragraph leaking a key; it can only emit typed fields, and you validate the fields server-side. vLLM 0.6's guided decoding, Outlines, and Instructor have made this cheap. In a health-tech rollout we advised in November 2025, moving from free-form responses to strict JSON schemas reduced output-filter positives by 78% because the schema made most exfiltration paths structurally impossible.

How do you measure output filter efficacy honestly?

Three numbers, computed weekly against labeled eval data: true-positive rate on a curated attack battery, false-positive rate on a labeled benign traffic sample, and the percentage of production traffic that ran through the filter in blocking mode rather than monitoring mode. The third is the one nobody wants to report. At one fintech customer, the first two numbers looked excellent on the status deck — 94% TPR, 1.2% FPR — and the third was 11%. Eighty-nine percent of traffic ran through filters configured to log rather than block, because every team that had ever seen a false positive had their integration downgraded. We now treat "percent in enforcement" as the primary KPI for output filtering, and the other two as inputs to the tuning that gets you there.

What about the latency and cost of filtering?

Output filters add real latency. A full DLP scan on a 2K-token response with Presidio costs 30–80ms on a modern CPU; invoking Prompt Guard or a secrets scanner on the same response adds similar amounts. At p99 under load, these stack. Streaming responses complicate matters because the filter has to buffer or operate incrementally. The patterns that work: run cheap deterministic filters (regex, schema validation) in-line on the stream, run expensive model-based filters on the completed response before side effects commit, and for user-facing text accept the streaming-then-retroactive-redaction pattern where flagged responses are replaced after full-text analysis. This is not elegant but it matches the reality of user-facing latency budgets.

How Safeguard Helps

Safeguard's Griffin AI inventories every output-filter deployment as a control in the AI-BOM, mapping which categories it enforces, which applications it protects, and whether it runs in blocking or monitoring mode. Reachability analysis identifies agents and endpoints whose responses bypass the filter stack entirely, so gaps surface before incidents do. The eval harness measures TPR and FPR against a customer-specific attack battery on every deploy, and policy gates block releases that reduce filter coverage or downgrade controls from enforcement to monitoring without an approved exception.

LLM Security Output Filtering Guardrails DLP

Back to all articles