Container Security

Runtime Container Drift: Supply Chain Implications

Runtime drift is the last honest witness in container supply chain defence. This post covers what drift signals tell you, how to instrument for them, and how to investigate without overwhelming on-call.

Nayan Dey
Senior Security Engineer
7 min read

Build-time supply chain controls are necessary and incomplete. They tell you what an artifact claimed to be when it left the build system. They do not tell you what the artifact actually does once it is running. The gap between those two is where modern supply chain attacks succeed. Malicious code activates only at runtime, only under specific conditions, or only after a delay long enough to make pre-deployment scanning irrelevant. Runtime drift detection is the last honest witness, and it is undervalued in most security programs.

This post walks the categories of drift that matter, the signals that surface them, and the investigation playbook that turns a drift alert into an actionable finding rather than another item in a queue.

What Drift Means In Container Context

Containers are immutable in theory and not in practice. The image is immutable. The running container is not. A container can write to its own root filesystem unless explicitly prevented. It can spawn arbitrary child processes. It can open arbitrary network connections. Each of these can be a legitimate part of the workload's behaviour, or it can be a sign that something has changed since the image was scanned.

Drift, in this context, is any deviation between the running container's behaviour and the expected behaviour as defined by the image and its declared purpose. The expected behaviour is established either through static analysis of the image at build time or through a learned baseline during a calibration period in production.

Filesystem Drift

Filesystem drift is the most reliable indicator. Most legitimate workloads do not write to their own root filesystem at runtime. Logs go to mounted volumes or stdout. Caches go to scratch directories that are explicitly mounted. Configuration is read from mounted secrets and config maps, not written to the filesystem.

When a container does write to its own root filesystem, the write is almost always one of three things. A misconfigured workload that should have been using a volume. An init script that is doing something the engineer did not realise was a write. Or an attacker who has dropped a tool, modified a binary, or written a backdoor.

Detecting filesystem drift well requires watching the container's write layer rather than the underlying disk. Kernel-level filesystem audit tools work well here, as does eBPF instrumentation that traces vfs_write calls scoped to the container's mount namespace. The signal is the path that was modified and the process that did it.

The first time you turn this on, expect noise. Many workloads write to /tmp at startup, write to /var/log internally, or have language runtimes that compile bytecode into the filesystem. Triage that noise against the workload's known-legitimate writes, build a per-image allowlist, and the signal-to-noise ratio improves quickly.

Process Drift

Process drift catches what filesystem drift misses. A workload that loads malicious code through eval, that injects into an existing process, or that uses a fileless technique can avoid filesystem drift entirely. It cannot avoid spawning a process, opening a thread, or executing a syscall.

The simplest useful process signal is the set of executables that have been exec'd inside the container. For a typical Go service, that set is one entry: the service binary itself. For a typical Python service, it is the Python interpreter and possibly a few subprocess calls to known utilities. Anything outside that set is a candidate finding.

The set has to be learned per workload, not assumed. Every workload has its own legitimate quirks, and a generic "no shell exec" rule produces too many false positives in environments where engineers run debug containers regularly. The right calibration is a few days of observation, a curated allowlist, and a structured review process for additions.

Process lineage matters. A new process that spawns from the workload's main entrypoint is suspect at a different level than a process spawned from a kubectl exec session. The first is workload behaviour. The second is human behaviour. Treat them differently in the alert pipeline.

Network Drift

Network drift is the noisiest of the three categories and the most informative when tuned correctly. Containers open network connections constantly. Most are legitimate. The interesting ones are connections to destinations the workload has never connected to before, or connections initiated outside of the workload's expected lifecycle phases.

The signal is the destination triple, the IP, the port, the protocol, combined with the process that initiated the connection. A baseline collected over a week of normal operation captures the legitimate destinations. New destinations after baseline are candidates for review.

Two refinements are essential. DNS resolution context, so that a destination identified by name rather than IP can be tracked across IP rotation. And outbound versus inbound classification, because almost all supply chain anomalies are outbound, while a long tail of inbound connections from health checks and probes are not interesting.

Network drift also exposes a class of attack that filesystem and process drift cannot. A workload that has been compromised at supply chain time may be configured to phone home to a command and control server only after a trigger. The phone-home is a network event that no other signal captures.

Investigation Playbook

A drift alert without a playbook is just noise. The playbook we run has four stages.

Stage one. Triage the alert. Look at the workload, the image digest, the time of the alert, and the specific signal. Decide whether the signal is consistent with a benign explanation (a misconfigured workload, a rare but legitimate operation) or whether it is consistent with a malicious explanation.

Stage two. Capture evidence before the workload is touched. The running container's filesystem state, the process list, the network connection table, and any logs. If the workload is going to be killed, this evidence has to be preserved first.

Stage three. Compare against the supply chain record for the image digest. If the image was built from a known commit by a known builder with verifiable provenance, the failure mode is more likely to be runtime compromise than supply chain compromise. If the image's provenance is weak, treat it as both.

Stage four. Decide on containment. The default is to isolate the workload by network policy rather than to kill it, so that further investigation is possible. Killing the workload is a last resort and is reserved for cases where containment is otherwise impossible.

The playbook is owned by the security team and reviewed quarterly. The most common revisions are to the triage criteria, as new categories of legitimate drift surface and need to be incorporated.

Tuning For Signal-To-Noise

The biggest barrier to runtime drift detection is alert fatigue. A poorly tuned drift detector produces hundreds of alerts a day and gets disabled within a quarter. A well-tuned one produces a handful and survives.

The tuning levers we use. Per-workload baselines rather than fleet-wide thresholds. Confidence scoring on each alert based on the strength of the deviation from baseline. Suppression windows for known-noisy operations like deployment rollouts. And aggregation, so that a hundred alerts from the same workload at the same time become one finding with a count rather than a hundred items in a queue.

The metric to track is the false-positive rate, expressed as findings per workload per week. Below one is healthy. Above five is unsustainable. The detector should be tuned continuously toward the lower end, with security and platform engineering treating tuning as a recurring task rather than a one-time setup.

How Safeguard Helps

Safeguard ships a runtime drift detector that integrates with the rest of the supply chain stack. The detector watches filesystem writes, process exec, and network connections through eBPF instrumentation, and builds per-workload baselines automatically during a calibration period. Findings are correlated against the supply chain record for the image digest, so that an alert arrives with the image's provenance, SBOM, and admission history attached. The investigation workflow captures forensic evidence before containment actions, and containment defaults to network isolation rather than termination. Tuning controls, including suppression windows, baseline refresh, and false-positive scoring, are surfaced in the same UI as the alert queue. The result is runtime drift detection that earns its place in the on-call rotation rather than being silenced after the first noisy week.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.