Container Security

eBPF Security Controls: A Production Experience Report

Field notes on running Tetragon, Falco, and Cilium eBPF controls in production Kubernetes clusters, with observed overhead, policy traps, and kernel constraints.

Shadab Khan
Security Engineer
5 min read

Running eBPF-based security controls across four clusters (two on AWS EKS 1.29, one on GKE 1.30, one on-prem Talos 1.7) for 14 months surfaced a distinct set of engineering tradeoffs that the marketing material around eBPF does not discuss honestly. eBPF is genuinely powerful: we replaced seccomp-bpf profiles, a sidecar-based syscall tracer, and a userspace audit pipeline with a single Tetragon 1.2 deployment and measured real reductions in mean-time-to-detect for supply-chain-adjacent behavior like unexpected process execution and outbound connections from build runners. But the failure modes are unusual, the kernel constraints are real, and the policy semantics are subtle enough that a misauthored TracingPolicy can cost more than it saves. This post covers what we deployed, what we saw, and where we pushed back on the "eBPF everywhere" default. Numbers are real, cluster names anonymized.

What eBPF security stacks did you actually run?

Three: Falco 0.39 for general runtime detection, Tetragon 1.2 for policy-enforced kernel hook points, and Cilium 1.16 for network policy, plus its Hubble observability layer. Falco's modern_ebpf driver (default in 0.37+) uses CO-RE (BPF Compile Once, Run Everywhere) via libbpf and BTF, replacing the older kmod driver. Tetragon's TracingPolicy CRDs express selectors on tp_btf and kprobe attach points, letting us write policies like "block execve of /tmp/* from any pod in namespace build-runners" with an enforcement action, not just alerting. Cilium 1.16 enforces NetworkPolicies and CiliumNetworkPolicies using per-endpoint BPF maps for L3/L4 and envoy for L7. Kernel versions mattered: Ubuntu 22.04's 5.15 LTS kernel lacked some BPF features used by Tetragon 1.2's generic tracepoint policies, so we moved those nodes to 6.1.

What was the measured overhead?

Under 3 percent CPU for Tetragon and Cilium combined on control-plane nodes, with outliers during policy compilation. On our busiest EKS node group (c6i.4xlarge, 16 vCPU, Linux 6.1), Tetragon's tetragon agent averaged 180 millicores with 24 active TracingPolicies watching execve, openat, and tcp_connect. Falco's falco daemon averaged 120 millicores with the default ruleset plus 18 custom rules. The real cost was in memory: Tetragon held 280 MiB RSS per node, Falco 140 MiB, and Cilium's cilium-agent 450 MiB because of its BPF map sizing for a cluster with 3,800 endpoints. Latency impact on tracked syscalls was under 2 microseconds at p50 and 14 microseconds at p99, measured via perf sched and Tetragon's own metrics. Our Envoy-L7 policies added 0.8 milliseconds p99 to service-to-service traffic; acceptable, but worth measuring per cluster.

Where did the eBPF verifier actually get in the way?

During policy evolution, not initial deploy. Early on we wrote a Tetragon TracingPolicy that matched process-argv strings longer than 512 bytes; the generated BPF program exceeded the verifier's instruction limit (1 million instructions on Linux 5.15+, but branch complexity was the bottleneck) and refused to load with BPF_PROG_LOAD returning -E2BIG. We split the check into two policies with simpler matchers. A separate issue: Falco's modern_ebpf driver on kernel 5.15 lacked BPF_CORE_READ for certain task_struct offsets, causing field relocations to fail on some AMI variants; upgrading to Amazon Linux 2023 (kernel 6.1) resolved it. The practical discipline is that every TracingPolicy needs a CI test against each kernel version in the fleet, ideally with libbpf-tools/testenv or a staged canary node.

What supply-chain scenarios did eBPF actually catch?

Three real incidents over 14 months. One: a CI runner pod executed curl | bash against an unexpected host after a dependency pulled in a malicious postinstall npm script, caught by a Tetragon policy blocking egress from build-runners namespace to any IP not on a CIDR allowlist. Two: a container started by a scheduled job spawned busybox nc listening on a high port; Falco's existing rule "Unexpected listen on non-standard port" fired and the kill action killed the process. Three: a kubectl exec from a developer laptop into a production namespace outside business hours, caught via Cilium Hubble flow logs feeding a SIEM rule. None of these required novel tooling; the value of eBPF was the combination of low overhead and enforce-not-observe. Without enforce mode, we would have had detection without prevention, which in the first case would have meant a second-hop lateral-movement attempt from the CI runner.

What would you do differently?

Stage rollouts more aggressively and write policies for what-you-allow, not what-you-block. Our initial Tetragon policies were "block execve of these bad binaries," which is a losing game because the attacker picks the binary. Switching to "allow execve only of these binaries within /usr/bin and /app/bin, log and block everything else" in CI runners cut false-negative surface dramatically. We also regret not starting with Cilium's IdentityAware policies earlier; pod-identity-based NetworkPolicies eliminated the fragile IP-based allowlists. Finally, treat kernel upgrades as a first-class part of eBPF policy lifecycle: when node pools upgrade from 5.15 to 6.1, re-run the full policy test matrix against the new BTF.

How Safeguard Helps

Safeguard complements runtime eBPF controls by providing the build-time signal they need to enforce. Signed SBOMs attested with Sigstore cosign 2.4 and in-toto 1.0 statements let Tetragon and Kyverno 1.12 policies make enforcement decisions based on artifact provenance, not just behavior. Safeguard's policy gates can block images without matching SLSA v1.0 provenance from reaching clusters where eBPF policies expect signed workloads, closing the loop between static and runtime controls. Findings from runtime eBPF alerts can be correlated with component-level vulnerability data to prioritize response by known exploitability.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.