Container Security

Container Runtime Security Comparison: runc, gVisor, Kata, and Firecracker

Your container runtime determines the strength of your isolation boundary. Here is an honest comparison of runc, gVisor, Kata Containers, and Firecracker from a security perspective.

Bob
Infrastructure Security Lead
7 min read

The container runtime sits between your workload and the host kernel. It defines the isolation boundary. A weak runtime means a compromised container is one kernel exploit away from owning the host. A strong runtime means you can run untrusted workloads with confidence.

Most Kubernetes clusters run runc. Most Kubernetes clusters are one CVE away from cluster-wide compromise. Here is why, and what the alternatives actually offer.

runc: The Default Standard

runc is the OCI reference implementation. Docker, containerd, and CRI-O all use runc by default. It creates containers using Linux kernel primitives: namespaces for isolation, cgroups for resource limits, seccomp for syscall filtering, and AppArmor or SELinux for mandatory access control.

Security Model

runc containers share the host kernel. Every container on a node executes syscalls against the same kernel. This is efficient—no duplication, minimal overhead—but it creates a shared attack surface.

The kernel exposes over 300 syscalls. A default Docker container with seccomp filtering blocks roughly 50 of them. That leaves 250+ syscalls available, each one a potential entry point for kernel exploits.

Known Escape Vectors

  • CVE-2019-5736: Overwrite host runc binary through /proc/self/exe
  • CVE-2020-15257: containerd shim API accessible from container network namespace
  • CVE-2022-0185: Heap overflow in filesystem context handling
  • CVE-2024-21626: runc working directory breakout via leaked file descriptors

Each of these gave container-to-host escape with root privileges. They are patched, but they illustrate the fundamental problem: shared kernel means shared risk.

When runc Is Acceptable

runc is fine for trusted first-party workloads in environments with:

  • Pod Security Standards at Restricted level
  • Seccomp profiles tuned per workload
  • AppArmor or SELinux enforcement
  • Regular kernel patching (days, not weeks)
  • Runtime security monitoring (Falco, Tetragon)

gVisor: Application Kernel Sandboxing

gVisor implements a user-space kernel called Sentry that intercepts container syscalls before they reach the host kernel. It acts as a compatibility layer, reimplementing Linux syscall behavior in a memory-safe language (Go).

Security Model

When a container process makes a syscall, it hits the Sentry process, not the host kernel. Sentry handles the syscall in user space. For operations that require actual kernel interaction (file I/O, networking), Sentry makes a limited set of "gofer" calls to a restricted host process.

The result: even if an attacker achieves code execution inside the container and triggers a kernel exploit payload, the payload hits Sentry, not the real kernel. Sentry is a much smaller, auditable codebase compared to the full Linux kernel.

What gVisor Blocks

  • Direct kernel syscall exploitation
  • /proc and /sys based information leaks
  • Kernel module loading
  • Raw socket access
  • Most file-based kernel interactions

Performance Impact

gVisor adds measurable overhead:

  • Syscall-heavy workloads: 2-10x slower (file I/O, process creation)
  • Network throughput: 20-40% reduction
  • Memory: Additional overhead per sandbox (50-100MB)
  • Compute-bound workloads: Minimal impact (<5%)

Compatibility Limitations

gVisor does not implement every Linux syscall. Workloads that depend on:

  • io_uring (not supported)
  • Certain ioctl operations
  • GPU direct access
  • Kernel modules

will either fail or require fallback to runc.

Deployment in Kubernetes

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc
---
apiVersion: v1
kind: Pod
spec:
  runtimeClassName: gvisor
  containers:
  - name: untrusted-workload
    image: user-submitted-code:latest

GKE offers gVisor as a first-class option via GKE Sandbox.

Kata Containers: MicroVM Isolation

Kata Containers runs each container (or pod) inside a lightweight virtual machine. Instead of sharing the host kernel, each workload gets its own kernel running inside a VM managed by QEMU/KVM or Cloud Hypervisor.

Security Model

Full hardware-assisted virtualization. The isolation boundary is the hypervisor (KVM), not the kernel's namespace mechanism. This is the same isolation model that separates AWS EC2 instances from each other.

A kernel exploit inside a Kata container compromises only that VM's kernel. The host kernel is untouched. Breaking out requires a hypervisor escape, which is a fundamentally harder class of vulnerability.

Performance Profile

  • Boot time: 100-200ms (vs. ~50ms for runc)
  • Memory: 20-50MB overhead per VM (kernel + agent)
  • I/O: 5-15% overhead for storage and networking
  • CPU: Near-native performance (hardware virtualization is efficient)

Kata Containers are significantly faster than gVisor for I/O-intensive workloads because they run a real kernel, just an isolated one.

Nested Virtualization

Kata requires hardware virtualization support. Running Kata inside a VM (nested virtualization) is possible but adds overhead and may not be available in all cloud environments. AWS Bare Metal instances, GCP nested virtualization, and Azure DCsv3 all support it.

Deployment in Kubernetes

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: kata
handler: kata-qemu
---
apiVersion: v1
kind: Pod
spec:
  runtimeClassName: kata
  containers:
  - name: sensitive-workload
    image: payment-processor:latest

Firecracker: Serverless MicroVMs

Firecracker is Amazon's microVM manager, built for Lambda and Fargate. It strips down the VM to the absolute minimum: no BIOS, no PCI bus, no USB, no legacy devices. Just a virtio network, virtio block device, serial console, and a minimal keyboard controller for the reboot command.

Security Model

Same hypervisor-level isolation as Kata, but with a drastically reduced attack surface. Firecracker exposes roughly 30 device emulation surfaces compared to QEMU's hundreds.

Each Firecracker microVM:

  • Has its own kernel
  • Runs under a jailer process with seccomp, cgroup, and namespace restrictions
  • Cannot access the host filesystem
  • Has no hardware passthrough

Performance Profile

  • Boot time: <125ms (often under 50ms)
  • Memory overhead: 5MB minimum per microVM
  • Density: 4000+ microVMs on a single host
  • I/O: Near-native with virtio

Limitations

Firecracker is purpose-built for short-lived, stateless workloads. It does not support:

  • GPU passthrough
  • Persistent storage (by design)
  • Live migration
  • Arbitrary device attachment

For Kubernetes integration, projects like Kata Containers with the Firecracker VMM backend bridge the gap.

Comparison Matrix

| Feature | runc | gVisor | Kata (QEMU) | Firecracker | |---|---|---|---|---| | Isolation | Kernel namespaces | User-space kernel | Hardware VM | Hardware microVM | | Escape difficulty | Kernel exploit | Sentry + kernel | Hypervisor escape | Hypervisor escape | | Boot time | ~50ms | ~100ms | 100-200ms | <125ms | | Memory overhead | ~0 | 50-100MB | 20-50MB | 5MB | | Syscall compat | Full | ~70% | Full | Full | | GPU support | Yes | No | Limited | No | | Best for | Trusted code | Untrusted code | Multi-tenant | Serverless |

Choosing the Right Runtime

Use runc for internal, first-party workloads where you control the code, patch regularly, and have runtime monitoring.

Use gVisor for workloads that execute untrusted code: CI/CD build jobs, user-uploaded functions, code sandboxes. Accept the syscall compatibility and performance tradeoffs.

Use Kata Containers for multi-tenant platforms where different organizations share a cluster and strong isolation is a business requirement.

Use Firecracker for serverless and function-as-a-service platforms where thousands of short-lived workloads need maximum isolation with minimal overhead.

Most production clusters should run multiple runtimes simultaneously, using RuntimeClass to assign the appropriate runtime per workload based on trust level.

How Safeguard.sh Helps

Safeguard.sh inventories your container runtimes across every cluster and maps workloads to their isolation boundaries. The platform identifies high-risk workloads running on runc that should be sandboxed with gVisor or Kata, flags clusters with a single runtime for all trust levels, and provides runtime-specific vulnerability tracking. When new container escape CVEs drop, Safeguard.sh immediately identifies which workloads are affected based on their runtime configuration and prioritizes remediation accordingly.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.