AI Security

Sandboxing LLM Agent Code Execution: Patterns

If your agent can execute code, something it reads from the internet can execute code. Pick your sandbox before the agent picks one for you.

Any LLM agent with a code execution tool is an arbitrary code execution vulnerability waiting for the right prompt. That is not a pejorative; it is the design. The question is not whether the agent will run something you didn't intend. It is where the blast radius stops. This post covers the sandboxing patterns that actually hold up, drawn from what worked and what broke in 2025.

What is the real threat model for agent code execution?

The agent will execute code written by a model steered by content the attacker controls. That content can come through direct user input, indirect injection via retrieved documents, or poisoned tool outputs. The code that gets executed is not necessarily malicious looking. It might be a curl to an attacker-controlled URL followed by eval, or it might be a subtle data exfiltration through a plausible-looking analytics call. Your sandbox has to assume all of it is hostile by default.

The execution boundary should enforce three properties: code cannot affect anything outside the sandbox filesystem, code cannot reach anything on the network except explicitly allowed destinations, and code cannot persist between sessions unless you opted it in. If your current setup fails any one of these, you do not have a sandbox. You have a polite request.

Which isolation technology is strong enough?

MicroVMs are the default answer for anything that touches the public internet or runs untrusted code per-session. Firecracker, Cloud Hypervisor, and the Kata Containers wrapper around them give you real hardware-level isolation with sub-second startup. Per-session microVM means the agent's execution environment is a fresh, throwaway kernel each time, and there is no persistent state for an attacker to pivot through.

For internal deployments where the code is somewhat more trusted (say, a code assistant running in a CI context with some provenance), rootless containers with a strong seccomp profile and a user namespace remapping are usually sufficient. gVisor is also a reasonable middle ground: it intercepts syscalls in userspace and gives you many of the microVM guarantees with lighter overhead. Plain Docker with default settings is not a sandbox, and any guide that tells you otherwise is out of date.

What does not work, regardless of how carefully you configure it, is running agent-executed code as a subprocess of the agent itself with os.exec. This is how a number of early frameworks shipped and how a number of embarrassing compromises happened. The isolation has to be at least a container; ideally a VM.

How should the filesystem be scoped?

Read-only everywhere except an explicit scratch directory, with the workspace bind-mounted only when the agent genuinely needs write access to the user's code. Most agent code execution is ephemeral: run a test, check the output, discard. For that, a fresh tmpfs mounted at a known path (say /workspace) is the right scope. The agent cannot write anywhere else, so even if it generates code that tries to touch /etc or the user's home directory, the write fails.

When the agent does need to modify the user's code (because it is, for instance, a coding assistant applying a patch), the mount should be limited to the specific repo directory and should avoid anything below a certain depth that contains secrets or build outputs. Mounting the entire home directory so the agent can "pick the right file" is the pattern that got several teams compromised by errant shell commands in 2025.

The other rule: no mounting of cloud credential files. Ever. ~/.aws, ~/.kube, ~/.docker/config.json, and their equivalents should not appear inside the sandbox. If the agent needs cloud access, it gets it through a scoped, short-lived token minted by a gateway, not through inherited credentials on disk.

What about network egress?

Default-deny, with an explicit allowlist of destinations the agent needs. This is the control that most teams skip because it is inconvenient, and it is the control that matters most when something goes wrong. An agent sandbox that can reach the open internet is one crafted prompt away from being a data exfiltration channel. That crafted prompt can come from a document the model retrieved; it does not need to come from your user.

The allowlist that works in practice is: your own internal APIs through an authenticated proxy, a package registry mirror you control, and a short list of known-good public endpoints (PyPI, npm, the Python docs). Everything else gets dropped at the network boundary. When the agent needs to fetch something new, that becomes a ticket, not a silent allowance.

DNS is often the overlooked leg. If your egress filter works at L3/L4 but DNS is unrestricted, an attacker can exfiltrate data through DNS queries to an attacker-controlled authoritative server. Block DNS except to a resolver you control, and log the queries.

How do we handle persistent state?

Opt-in, scoped, and immutable by default. Persistent state is where a compromise turns into a campaign. If an attacker can drop a file in a location the agent will read next time, they have a foothold that survives session boundaries. The default should be that sessions leave nothing behind. When you do need persistence (caches, vector stores, learned preferences), the stored data is schema-validated on read, not trusted blindly.

The same applies to package caches. A poisoned PyPI package installed during one session should not silently persist into the next. Either rebuild the sandbox base image per session, or maintain the package cache outside the sandbox and mount it read-only. Ultralytics-style compromises showed that a single bad install can backdoor everything a process does, and caching those installs across sessions is an amplifier.

What does a reasonable default look like?

Firecracker or Kata per session, rootless mount of the specific working directory, tmpfs for scratch, network egress through an authenticated proxy with an allowlist, no inherited credentials, and a 15-minute wall-clock timeout that kills the VM regardless of state. Logging everything the agent did to an append-only sink outside the sandbox. Rebuild the base image weekly with the current vulnerability scan feeding into the build so you are not running a two-month-old glibc.

Most of this is the same hardening you would apply to any untrusted code execution service. The novelty in 2026 is that the untrusted code is written by a model rather than submitted by a user, which people sometimes use to argue for softer controls. The blast radius does not care where the code came from.

What about developer-facing sandboxes versus production sandboxes?

The controls should be the same. A common failure pattern in 2025 was two-tier sandboxing: a loose sandbox for developer-facing agents (because "they're just helping me code") and a tight sandbox for production (because "that one handles customer data"). The problem is that the developer-facing sandbox has access to source code, credentials, and internal systems that are strictly more sensitive than what the production sandbox touches. Treating the developer sandbox as lower risk inverts the actual threat model.

If anything, developer sandboxes warrant tighter controls because the content surface is richer (repos, docs, tickets, logs, local files) and the credential surface is larger. The "it's just my laptop" argument is the same argument that gave us years of dev-laptop-origin supply chain compromises, including several high-profile npm incidents in 2023 and 2024 that started with a developer running untrusted scripts locally. The Ultralytics PyPI compromise spread in part because developers treated their Python environment as low risk. Do not repeat that.

How Safeguard.sh Helps

Safeguard.sh applies reachability analysis to the base images that back agent sandboxes, removing 60 to 80 percent of the CVE noise that would otherwise block weekly rebuilds. Griffin AI flags risky patterns in the package installations an agent performs at runtime, so a PyTorch nightly-style compromise or a repeat of the Ultralytics PyPI incident surfaces in hours rather than weeks. SBOM generation extends to each rebuilt sandbox image, with 100-level dependency depth ensuring transitive compromises in the Python or Node trees are visible. Container self-healing automates the weekly rebuild pipeline so that your sandbox base images ship the current patch level without developer intervention.

ai-security llm sandboxing agents containers

Back to all articles

More on #ai-security

View all →

AI Security

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.

Sandboxing LLM Agent Code Execution: Patterns

What is the real threat model for agent code execution?

Which isolation technology is strong enough?

How should the filesystem be scoped?

What about network egress?

How do we handle persistent state?

What does a reasonable default look like?

What about developer-facing sandboxes versus production sandboxes?

How Safeguard.sh Helps

More on #ai-security

API Surface Reviewed: Griffin AI vs Mythos

Real-World Deployment: Griffin AI vs Mythos

Scaling Across Repos: Griffin AI vs Mythos

Tool-Call Hijacking: Griffin AI vs Mythos

Related articles in AI Security

Building an Eval Suite for Your Security LLM Workflows

Zero-Day Discovery With LLM-Augmented Reachability: A Safeguard Engine Walkthrough

Frontier LLM Vendors Are Not Your Supply Chain Security Vendor

Never miss an update

Product

Solutions

Compare

Resources

Company

Legal

Developers