AI Security

Open-Weight Model Sandboxing Patterns

Running an open-weight model inside an enterprise perimeter seems safer than calling a hosted API. It is, and it isn't. The sandboxing patterns that actually produce the safety properties.

Nayan Dey
Senior Security Engineer
6 min read

The wave of organizations moving workloads from hosted frontier APIs to self-hosted open-weight models accelerated through 2025 for a predictable mix of reasons — cost, data residency, latency, and the belief that "it's inside our perimeter so it's safer." The first three reasons are often defensible on the specifics; the fourth is only true if you actually build the safety properties you are crediting yourself with. Open weights inside your perimeter is a necessary but not sufficient condition for safety. The sandboxing patterns that produce the safety properties people assume they are buying are specific, and they are not automatic from the deployment choice. This post lays out the sandboxing patterns we see work in customer deployments and the gaps in the patterns people commonly skip.

Why isn't "it's inside the perimeter" the answer by itself?

Because the threat model doesn't match that framing. Four problems the perimeter alone doesn't solve:

  • Prompt injection is a content-plane attack, not a network-plane attack. It crosses your perimeter in the normal flow of requests.
  • Tool-call misuse happens inside the perimeter, where the model has the most privilege.
  • Model weight integrity isn't established by the fact that weights live on your hardware; it's established by verified signatures and reproducible builds of the inference stack.
  • Output handling bugs manifest regardless of where the model is hosted.

The perimeter helps with a narrow set of concerns (data egress to a vendor, vendor availability dependencies) but leaves the AI-specific risks intact.

What is the target sandboxing posture?

A reasonable posture for a self-hosted open-weight deployment has three layers:

Weight layer. The model weights are from a verified source (checksum matched against a published reference, ideally signed by the publisher) and are stored in a tamper-evident location.

Inference layer. The inference server (vLLM, TGI, Ollama, or custom) runs in a container with a minimal privilege profile — no persistent disk write outside expected paths, no outbound network except to specific allowed endpoints, no access to host-level secrets.

Tool layer. Any tools the model can invoke run in a separate process/container with their own least-privilege profile. The tool layer sees the model's tool-call request and enforces authorization before executing.

The key property: compromise of any single layer should not compromise the others. Weight integrity failure should not grant arbitrary execution; tool misuse should not grant access to weights; inference compromise should not grant tool invocation without the tool layer's checks.

How do you verify weight integrity?

Three concrete steps:

  • Hash verification on download. Use the publisher's checksums (if available) or your organization's mirror with verified checksums. Never trust a model downloaded by a convenience script without checksum verification.
  • Signature verification where available. Hugging Face and other model hubs are adding signing; use it where present. Otherwise, treat the download as the trust boundary and re-sign internally after verification.
  • Reproducible fine-tuning metadata. If you fine-tune, keep the base model checksum, the training data manifest, and the fine-tuning hyperparameters as attestable artifacts. You should be able to answer "what's in this weight file?" without guessing.

What does inference server hardening look like?

Five-item checklist:

  1. Container image. Minimal base image, no unnecessary tools, reproducible build.
  2. Network policy. Outbound allowed only to a narrow set of endpoints (e.g., internal telemetry). No internet egress by default.
  3. Filesystem posture. Model weights read-only mounted; no writable persistent paths except approved log destinations.
  4. GPU isolation. If GPUs are shared, isolation between tenants via MIG or equivalent. If not shared, dedicated GPUs with exclusive access.
  5. Runtime observation. Process-level monitoring for unexpected syscalls, outbound connections, and filesystem writes.

How do you sandbox tool invocation?

This is where most teams underinvest. The model should not have direct access to real APIs, shell commands, or databases. A well-sandboxed tool layer does:

  • Receives the model's tool-call request as structured JSON with declared name and arguments.
  • Looks up the tool definition and validates arguments against the schema.
  • Evaluates authorization — is this session/user allowed to invoke this tool with these arguments?
  • Executes the tool with a tightly-scoped identity (not the user's full credentials).
  • Logs the invocation and returns the result to the model.

Tools that have irreversible side effects (sending email, modifying production data, financial transactions) should require out-of-band confirmation — ideally on a different channel than the chat interface.

What about retrieval and context contamination?

If the deployment includes retrieval (RAG), the index is a supply chain component and needs its own governance:

  • Ingest policy. What sources can be indexed? Is the source list verified? Can users add sources ad-hoc, and if so, what review happens?
  • Content sanitization. Stripping or flagging content that looks like prompt injection. Imperfect but not zero-value.
  • Source attribution. Every retrieval result carries a source ID; every model output that cites retrieval carries the citation. When something goes wrong, you can trace it.

Retrieval is where most self-hosted deployments encounter their first prompt injection incident. Governance here pays off faster than any other single investment.

What observability is essential?

Three signal types:

Request traces. Full context of every LLM request — prompt, retrieved context with source IDs, model version, tool calls issued, tool results, final output. Hash sensitive fields if retention policy requires.

Tool invocation audit. Every tool call with full arguments and result status. This is equivalent to command audit on a server; treat it with the same rigor.

Drift signals. Output distribution over time. An inference server whose response length distribution changes after a weights update or a dependency update is a change worth investigating.

What does the end-to-end posture actually look like in production?

The posture that works has these properties:

  • Model weights are verified, signed internally, and mounted read-only.
  • Inference runs in a minimal container with locked-down network.
  • Tools are invoked through a separate authorization layer with scoped identities.
  • Retrieval index has governed ingestion and source attribution.
  • Traces and tool invocation audits flow to a separate security-owned destination.
  • Drift and regression monitoring runs continuously against a golden eval set.

That is not a small project. It is more like "running a production database" than "deploying a container," and the teams that succeed with open-weight deployments resource it that way.

How Safeguard Helps

Safeguard extends its supply chain security controls to open-weight model deployments: the SBOM module tracks model weights, fine-tuning data, and the inference stack as supply chain components with their own provenance. Policy gates enforce the sandboxing posture described above — verified weight integrity, locked-down inference containers, scoped tool invocation identities. Griffin AI runs drift monitoring and eval regression against the deployed models, flagging behavior changes before they reach production traffic. For organizations running open-weight models inside their perimeter, Safeguard provides the governance layer that makes "it's inside the perimeter" actually mean what the team thinks it means.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.