AI Security

vLLM CVE-2025-62164: Tensor Deserialization RCE

vLLM 0.10.2-0.11.0 deserialized user-supplied PyTorch tensors via torch.load() in the Completions API. Memory corruption, potential RCE.

Shadab Khan
Security Engineer
5 min read

In November 2025, the vLLM maintainers disclosed CVE-2025-62164, a critical memory-corruption vulnerability (CVSS 8.8) affecting vLLM versions 0.10.2 through 0.11.0. The root cause is one of the oldest hazards in the Python ML ecosystem: unsafe deserialization via torch.load(). vLLM's Completions API accepted user-supplied PyTorch tensors as the value of a prompt_embeds request parameter and round-tripped them through torch.load() without validation. An attacker who could reach the inference server could trigger memory corruption and, in researchers' demonstrations, get remote code execution under the vLLM process. The vulnerability is one of at least five 2025 CVEs against vLLM that together signal the maturity gap between inference-server security and conventional API-server security.

What vLLM accepts and why it's exposed

vLLM (pip install vllm) is the highest-throughput open-source inference server for transformer LLMs, built on PagedAttention. Its OpenAI-compatible HTTP server exposes the Completions, Chat Completions, and Embeddings endpoints. Sometime in the 0.10 series, the Completions endpoint added an optional prompt_embeds parameter so callers could supply precomputed embeddings instead of token IDs — useful for soft-prompt research and certain agent topologies. The server accepts the parameter as a base64-encoded serialized PyTorch tensor, decodes it, and passes it directly to torch.load(). Inside torch.load(), the file is treated as a pickle stream during the metadata phase. The pickle stream can include arbitrary __reduce__ payloads. The result is the same general pattern that plagued ML model loading for years, transplanted into a network API surface.

Reachability and blast radius

In a typical vLLM deployment, the inference server is reachable from the application layer over a private network. If your model gateway is properly segmented, the attacker has to first compromise the application before reaching vLLM. But vLLM is also commonly deployed in self-hosted research environments without robust ingress controls — academic clusters, on-prem POCs, internet-exposed development boxes. ZeroPath researchers ran a quick scan after the disclosure and found a non-trivial number of vLLM servers exposed on the public internet. For those deployments, CVE-2025-62164 is a one-shot RCE: send a malicious tensor, get code execution under whatever account vLLM runs as.

The full 2025 vLLM CVE list

CVE-2025-62164 is part of a cluster of high-severity findings against vLLM in 2025:

CVE-ID            Description                                Severity
-------------------------------------------------------------------
CVE-2025-62164    torch.load on prompt_embeds (memory corr)  Critical
CVE-2025-66448    auto_map config-driven RCE                 Critical
CVE-2025-59425    API key timing-attack auth bypass          High
CVE-2025-48956    HTTP header memory exhaustion DoS          High
CVE-2025-62426    chat_template_kwargs DoS                   High
CVE-2025-32444    Earlier pre-0.10 pickle issue              High

Five 2025 CVEs is more than most equivalent OSS components see in a year. The structural reason is that vLLM's surface area combines (a) high-performance Python with C++ extensions, (b) a deserialization-heavy model-loading path, (c) an HTTP server that has accreted features quickly, and (d) a configuration system that pulls arbitrary code from model repos via auto_map and trust_remote_code=True. Each of those four layers contributed to one or more of the 2025 findings.

Patches and what 0.11.1 changes

vLLM 0.11.0 introduced an early-stage validator on the prompt_embeds payload, but the validation was incomplete and CVE-2025-62164 still applied. The full fix landed in 0.11.1, which (1) deprecated direct user-supplied tensor deserialization on the Completions endpoint, (2) added a server-side flag (--allow-prompt-embeds) that defaults to off, gating the feature behind explicit opt-in, and (3) where the feature is enabled, switched the deserialization path to a constrained loader that rejects pickle payloads. CVE-2025-66448, the auto_map RCE, was addressed in the same release by tightening which model configurations are accepted when trust_remote_code is not explicitly set. CVE-2025-59425 was fixed in 0.11.0rc2 by switching API key comparison to a constant-time function.

Hardening a vLLM deployment

If you operate vLLM at any scale, the post-2025 baseline configuration looks like this:

# Minimum vLLM deployment hardening
vllm:
  version: ">=0.11.1"
  serve:
    flags:
      - "--api-key-from-file /run/secrets/vllm-key"
      - "--allowed-origins https://api.internal.corp"
      - "--max-headers-size 8192"
      - "--max-num-batched-tokens 32768"
  features:
    allow_prompt_embeds: false      # CVE-2025-62164
    trust_remote_code: false        # CVE-2025-66448 class
    custom_chat_template: signed-only
  network:
    bind: 127.0.0.1
    egress_via_proxy: required
    no_public_ingress: true
  observability:
    audit_log: /var/log/vllm/audit.jsonl
    structured_logs: true

The most important single change for most operators is rebinding vLLM to localhost and putting an authenticated reverse proxy in front. The second is disabling trust_remote_code for any production deployment; if you must run a model that requires it (some Nemotron, some Phi variants), pin the exact model revision and review the custom code as you would any third-party Python dependency.

Why this is structurally a supply-chain problem

vLLM is a downstream of PyTorch, which is a downstream of CPython, which is a downstream of the operating system libraries. A vulnerability that exposes torch.load() to user input is a vulnerability whose root cause lives in pickle semantics, but whose blast radius lives in the inference server. From a vendor management standpoint, the implication is that "this is the model server we run" carries a transitive risk that the standard CVE feeds for an OS distribution will not surface. You need an SBOM for the inference plane that explicitly tracks vLLM, the PyTorch version it bundles, and the model artifacts it loads. When CVE-2025-62164 hits, that SBOM is what tells you which environments are exposed and which already moved past 0.11.0.

How Safeguard Helps

Safeguard ingests inference-server SBOMs (vLLM, TensorRT-LLM, TGI, Ollama) and cross-references them against the CISA KEV catalog, GitHub Security Advisories, and the OpenSSF AI/ML working group's emerging advisory feed. When CVE-2025-62164 or CVE-2025-66448 land, every product that ships a vLLM-backed inference plane is surfaced in the AIBOM along with its specific version. Policy gates block deployments that include vLLM below the minimum patched release, that enable trust_remote_code, or that accept prompt_embeds without an explicit policy waiver. Griffin AI performs reachability analysis on the inference plane, telling you which vLLM instances are localhost-bound versus exposed to broader networks. VEX statements from the vLLM maintainers are ingested automatically to suppress non-exploitable findings, giving incident responders a clean prioritized view during fast-moving disclosure cycles.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.