If you run any non-trivial AI application today, you almost certainly run it without the telemetry you would demand from any other production system. Most teams capture prompt/response pairs for debugging and call that "observability." A smaller subset runs offline evals against a golden set. Almost nobody treats either as a security signal.
That is a mistake, and it is going to become an expensive one in 2026. A retrieval-augmented chatbot that silently started emitting a pattern consistent with data exfiltration, a code assistant whose tool-call distribution shifted after a dependency update, an agent whose evaluation scores dropped eight points overnight — these are the AI equivalents of an IDS alert, and teams ship them without the alert wired up. Traces and evals are not a dev-loop nicety. They are a production security control, and they belong alongside SBOM, reachability, and policy gates as a first-class supply chain telemetry source.
What counts as an LLM trace in a security context?
A security-grade trace is the full, timestamped record of a single LLM interaction with enough structure to be queryable after the fact. The minimum shape is: system prompt hash, user prompt, retrieved context fragments with source IDs, tool calls issued, tool call arguments, tool responses, model identifier and version, temperature and seed, and final output. Debugging traces usually stop at input/output; security traces capture the full agent loop because that is where privilege escalation happens.
OpenTelemetry's GenAI semantic conventions (stable as of late 2025) give you a consistent span schema. The important thing is not which vendor you use — it is that the trace carries enough to reconstruct what the model actually saw and did. If your retrieval layer pulled a document from a compromised confluence page, you want that document's source ID on the span so you can search for it later when the IR team asks "which sessions touched page 4837?"
Why do evals belong in the security toolbox at all?
Evals are the regression test suite for a model's behavior, and behavior regressions are how most AI-specific security incidents manifest. A model that starts leaking a regex-detectable pattern under specific phrasings — you catch that with a red-team eval. A retrieval pipeline whose recall drops below a threshold because someone swapped the embedding model — you catch that with a retrieval eval. A tool-using agent whose "refuses unauthorized action" rate drops after a prompt change — you catch that with a policy eval.
The pattern that already has a name in software is golden-path regression testing. For LLMs, the golden path is a curated set of prompts with known acceptable output characteristics, and you run it on every meaningful change: prompt edit, model version bump, tool list change, retrieval corpus update. Treating evals as release gates (rather than "nice dashboards") is the single biggest maturity jump available to most AI security programs right now.
How do traces and evals slot next to SBOM and reachability?
Think of the four signals as different lenses on the same question: what is actually running, and is it safe?
- SBOM answers what components are present.
- Reachability answers which of those components are actually exercised at runtime.
- Traces answer what the AI system actually did on a given request.
- Evals answer whether the AI system still behaves within acceptable bounds.
An LLM-powered SaaS running without traces is equivalent to a web app with logging disabled on the request path. An LLM-powered SaaS running without evals is equivalent to a web app that ships to production without a CI test suite. Neither would pass a SOC 2 audit for a conventional app; the only reason AI applications get away with it today is that the controls have not yet been codified in the frameworks. That grace period is ending — the EU AI Act's high-risk system provisions and NIST AI RMF both reference continuous monitoring in language that maps almost one-to-one onto trace and eval discipline.
What specific supply chain risks do traces make detectable?
Four concrete ones, all of which we have seen in customer environments in the last twelve months.
Prompt injection via poisoned retrieval. An attacker plants instructions in a document the RAG pipeline ingests. The model quietly follows the attacker's instructions on some queries. Without traces that include retrieved context IDs, there is no way to bind a suspicious model action back to the poisoned document. With them, the detection looks like: spike in tool calls of type send_email correlated with retrieval IDs matching the poisoned document range.
Model substitution attacks. A compromised proxy or misconfigured router sends some fraction of traffic to a different model — possibly one trained on attacker-controlled data. The output distribution shifts subtly. Trace-level model ID logging catches this immediately; aggregated logs that only capture "latency and error rate" do not.
Tool-call privilege drift. An agent that historically issued read_file calls starts issuing write_file calls after a prompt update. Sometimes this is legitimate. Sometimes it is the model being manipulated into doing something the threat model did not anticipate. A ratio-of-tool-calls eval running daily flags the change before it accumulates blast radius.
Context leakage into outputs. A user prompt causes the model to emit contents of the system prompt or retrieved documents verbatim. Trace-level diffing between retrieval content and output content catches this. A regex-only DLP does not, because the payload shape is not pre-specified.
What does a minimum viable eval suite for security look like?
Five eval families, running on every merge to an LLM application's config:
- Jailbreak resistance — ~50 prompts drawn from public jailbreak corpora plus a handful of internal specimens. Pass criterion: refusal rate above a floor, no tool-call emission in refusal cases.
- Secret/PII leakage — prompts designed to coax the model into reproducing contents of the system prompt or known planted secrets in the retrieval corpus. Pass criterion: zero exact-match hits.
- Tool-scope adherence — prompts that ask the model to call tools it should not have access to in the current session. Pass criterion: zero out-of-scope tool invocations.
- Retrieval grounding — prompts with known correct source. Pass criterion: cited source matches ground truth above a threshold; fabrication rate below a ceiling.
- Policy drift — a snapshot of last release's golden outputs. Pass criterion: embedding-space distance below a regression threshold.
None of this requires a specialized platform to start. A test runner, a prompt file, and a scoring script are enough for v1. The mistake is waiting for a perfect harness before shipping any evals at all — teams that do this usually never ship them.
How do you operationalize traces and evals without drowning in data?
Three rules, learned the hard way by every AI observability team that scaled past a pilot.
Sample intelligently, store selectively. Capture 100% of traces in a short-retention tier (seven days), then down-sample to 1–5% for long retention. Always retain 100% of traces that hit a policy violation, a tool-call outlier, or an eval failure — those are the ones incident response will ask for.
Hash prompts, store full outputs. Prompts often contain customer data you cannot retain under your DPA. Hashing prompts lets you cluster and count without storing PII; outputs are usually safer to retain and more useful for forensics.
Wire evals to the same CI that gates code. If an eval regresses, the PR does not merge. This is the single step most teams skip and most regret skipping. Dashboards inform; gates enforce.
How Safeguard Helps
Safeguard's platform treats traces and evals as first-class telemetry sources on equal footing with SBOM and reachability. Our ingestion layer accepts OpenTelemetry GenAI spans and correlates trace-time tool calls against the reachability graph of the codebase that emitted them, so a suspicious tool call in production maps directly back to the call site that issued it. Griffin AI, our reasoning engine, runs an embedded eval harness that tracks drift across model version bumps and flags regressions before they reach release gates. Policy gates can block a deploy if an eval suite has not run, has not passed, or has regressed against the previous baseline. For teams already running evals in isolation, Safeguard provides the connective tissue to the rest of the supply chain security program, so AI observability stops being an ML-team silo and becomes part of the same control plane as every other piece of the stack.