Container Security

gVisor Runtime Security Deep Dive

gVisor intercepts syscalls in userspace and implements a minimal kernel in Go. It is a genuinely different approach, with genuinely different trade-offs.

gVisor does something weird. Instead of relying on the Linux kernel to enforce container isolation, it intercepts every syscall from the workload and services most of them in a user-space kernel written in Go. The workload thinks it is talking to Linux. It is actually talking to a program called Sentry.

Google has been running gVisor in production since at least 2018, initially inside App Engine and Cloud Run and eventually as the basis for GKE Sandbox. The open-source project has been quietly shipping monthly releases for years, and by early 2024 it is stable enough that evaluating it on its merits — rather than its novelty — is reasonable.

This is a review of the security properties, written for operators who are deciding whether gVisor belongs in their stack.

The Architecture Is the Security Model

A gVisor pod has two key processes. Sentry is the user-space kernel that implements syscalls. Gofer is a separate process that handles filesystem I/O by speaking the 9P protocol to Sentry, so Sentry itself has no direct access to the host filesystem.

When the workload calls read(2), the call is trapped by either ptrace (the Systrap platform, default since gVisor 2023) or by running Sentry's code in the same process via KVM. Sentry executes the read by mediating through Gofer, returns the result, and the workload continues as if it had actually talked to Linux.

The host kernel sees only a very narrow band of syscalls from Sentry itself, and Sentry is sandboxed with seccomp-bpf to an allowlist of roughly 55 syscalls. Even if a vulnerability in Sentry were exploited, the attacker would find themselves in a heavily restricted userspace program without the ability to make most interesting syscalls.

This is a different security boundary from Kata's. Kata gives you a hypervisor between workload and host. gVisor gives you a second userspace implementation of the kernel API. Both reduce the exposure of the host kernel to workload code, but they do it through different mechanisms with different performance and compatibility profiles.

What Sentry Implements and Does Not

Sentry does not implement every Linux syscall. It supports around 280 of the roughly 380 syscalls that Linux provides, covering the ones that normal applications use. A workload that depends on unusual syscalls — userfaultfd, modify_ldt, most things related to loadable kernel modules, eBPF — will fail.

This is usually fine. Most applications stay well within the supported set. But some specific categories bite: CRIU-based checkpoint-restore tools do not work, certain database engines that use unusual I/O mechanisms may have issues, and anything that depends on ptrace inside the container will behave strangely because gVisor itself may be using ptrace.

Compatibility has improved substantially since 2022. The gVisor team added io_uring support incrementally through 2023, seccomp inside containers became usable in late 2023, and filesystem extended attributes were tightened in the January 2024 release. As of early 2024, the list of workloads that do not run cleanly is short and shrinking.

CVEs and the User-Space Kernel Reality

gVisor has had its own CVEs, because implementing a kernel in any language is hard and you will get some wrong.

CVE-2023-40175, disclosed in August 2023, was a privilege escalation through a race condition in Sentry's handling of setgroups(2). CVE-2022-41722 was a path traversal in the platform layer. CVE-2021-46939 was a use-after-free in socket handling that required a carefully crafted sequence of syscalls.

These are not unusual bugs for a kernel implementation. What is unusual is their scope: because Sentry runs in userspace, the blast radius of a Sentry compromise is Sentry itself. The attacker has to chain a Sentry bug with a host kernel escape from the seccomp-restricted syscall surface, which is dramatically narrower than what an attacker inside a runc container has available.

In the entire history of gVisor, no CVE has demonstrated a full sandbox escape that bypasses both Sentry and the host seccomp filter. That is a remarkable track record for a production security boundary, and it is the main reason GKE Sandbox is used to isolate untrusted workloads at significant scale.

Performance Through 2023 and Early 2024

gVisor's performance has been its most persistent criticism and its most improved dimension. The Systrap platform, announced in 2023 and default in current releases, significantly reduced the ptrace overhead that plagued earlier versions.

Network throughput with the netstack networking backend is now typically 70-90% of host networking for TCP workloads. The --network=host mode, where gVisor bypasses netstack and uses the host's network stack with seccomp filtering, approaches line rate but gives up some isolation. For most workloads, netstack is the right choice and the cost is acceptable.

Syscall-heavy workloads still feel the overhead. A benchmark that does nothing but tight loops of getpid() will run 5-10x slower inside gVisor. Realistic workloads that spend most of their time in application code and make syscalls intermittently typically see 5-15% overhead.

Startup time is closer to runc than Kata — gVisor pods start in 200-500 ms, only modestly slower than native containers. For serverless workloads this matters.

Where gVisor Shines

gVisor is particularly strong for workloads that execute untrusted code. Cloud Run, AWS App Runner, and several build-as-a-service platforms use gVisor or gVisor-like sandboxing for exactly this reason. When a workload's job is to run whatever the customer uploaded, the compatibility compromises are worth the defense-in-depth.

It is also strong for dense multi-tenant environments where you want isolation but cannot afford Kata's per-pod VM overhead. The memory footprint of a gVisor pod is 30-60 MB lower than an equivalent Kata pod, and you can pack more of them onto a node.

Where gVisor struggles is in storage-heavy, syscall-intensive workloads: databases, file servers, anything that does a lot of IPC. For these, Kata's virtualisation boundary with virtio-fs typically performs better, and plain runc performs better still if you can accept the weaker isolation.

Operational Realities

gVisor is enabled at the container runtime level. The runsc binary is the gVisor OCI runtime, and you configure containerd or CRI-O to route specific workloads to it via RuntimeClass. Kubernetes 1.20 and later make this straightforward. GKE Sandbox abstracts it entirely behind a pod annotation.

Debugging is harder than with runc. When an application misbehaves, distinguishing "this is a gVisor compatibility issue" from "this is an application bug" requires tooling the project has been adding, notably the runsc debug commands and better tracing integration. Plan for a learning curve.

The upgrade cadence is fast. gVisor releases monthly and those releases include both features and security fixes. Pinning a version from eighteen months ago is not a supported operating mode in any meaningful sense.

How Safeguard Helps

Safeguard identifies gVisor-sandboxed workloads in cluster scans and adjusts risk scoring to account for the reduced host kernel exposure they provide. We track runsc versions against the monthly release cadence so you can see which nodes have drifted past the current security patch level. For teams running mixed sandboxed and unsandboxed workloads, our policy engine can require RuntimeClass gVisor for workloads matching specific labels — untrusted-code processors, multi-tenant customer data handlers — and alert when a workload that should have landed in gVisor ended up in runc instead. We also surface CVEs that are exploitable in runc containers but blocked by gVisor's syscall filter, so you can triage patching priorities based on actual exposure rather than raw CVSS scores.

gvisor sandboxing container-security kubernetes runtime-security

Back to all articles