AI Security

ShadowMQ: 30+ RCE Flaws Across AI Inference Engines

Oligo Security disclosed ShadowMQ in November 2025: ZeroMQ-and-pickle deserialization patterns copy-pasted across vLLM, Meta Llama, TensorRT-LLM, and others. We dissect the pattern.

In November 2025, Oligo Security disclosed ShadowMQ — a class of vulnerabilities affecting over 30 critical surfaces across the major AI inference engines. The bugs share a common origin: insecure ZeroMQ (ZMQ) sockets combined with Python's pickle deserialization, copied between projects sometimes line-for-line. Affected systems include vLLM, NVIDIA TensorRT-LLM, Meta Llama LLM (the meta-llama/llama-cookbook serving code), Modular Max Server, Microsoft Sarathi-Serve, and SGLang. The root vulnerability was originally patched in Meta's llama repository as CVE-2024-50050, but the same dangerous pattern had already been propagated by code reuse into a half-dozen other projects. This post analyzes the pattern, the affected versions, and the defender response.

What is the vulnerable pattern?

A common architectural choice in modern inference engines is to use ZeroMQ for inter-process communication: a request-router process distributes work to GPU worker processes via ZMQ sockets, and worker processes return outputs the same way. ZeroMQ is a generic transport — it does not specify a serialization format — so the projects layered Python's pickle on top because pickle is convenient and supports arbitrary Python objects. The vulnerable pattern is: a ZMQ socket bound to a network interface (sometimes 0.0.0.0 by default) accepts incoming messages and deserializes them with pickle.loads() without authentication. Anyone who can reach the socket can send a crafted pickle payload and get arbitrary code execution in the inference process.

Why did this pattern propagate?

The honest explanation is that AI inference engines move fast, share architecture, and copy code. Meta published a serving implementation, vLLM borrowed pieces, NVIDIA TensorRT-LLM borrowed pieces, SGLang borrowed pieces, and so on. None of the projects had a security review process strict enough to catch a pickle-over-network pattern. Meta initially patched the issue in their llama repository as CVE-2024-50050 by replacing pickle with a safe JSON-based serialization, but the propagated copies did not receive the fix until Oligo's coordinated disclosure forced each project to patch independently. ShadowMQ is the kind of vulnerability you only find when one researcher reads the source code of multiple competing projects and notices the same dangerous block appearing in each.

Which versions are affected?

Per Oligo's disclosure and the ensuing advisories: Meta Llama LLM serving code in all versions prior to 0.0.41; vLLM in versions 0.5.2 through 0.8.5.post1 and all versions prior to 0.10.0; NVIDIA TensorRT-LLM in all versions prior to 0.18.2; Modular Max Server in all versions prior to 25.6 when the --experimental-enable-kvcache-agent flag is used; Microsoft Sarathi-Serve in all released versions as of December 2025 (unpatched); SGLang in all released versions as of December 2025. The unpatched status of Sarathi-Serve and SGLang at disclosure time is the most concerning operational fact — defenders running those frameworks needed mitigations before vendor patches arrived.

What is the exploitation pattern?

The straight exploit is to scan for exposed ZMQ ports on inference clusters, send a crafted pickle payload, and gain RCE in the inference process. Privileges are typically high: the process has GPU access, model weights in memory, potentially API keys for downstream services, and network reach to other infrastructure. The Oligo write-up describes a proof-of-concept that achieves not just RCE but the ability to install a GPU-based cryptominer, exfiltrate model weights, or modify outputs on the fly. The latter is particularly dangerous in agentic contexts: an attacker who can modify inference outputs can redirect tool calls, exfiltrate context-window contents, or inject jailbreaks into otherwise trusted model responses.

How do we mitigate before all patches are available?

For frameworks with patches available (Meta Llama, vLLM, TensorRT-LLM, Modular Max), upgrade. For frameworks without patches (Sarathi-Serve, SGLang as of disclosure), the mitigations are network-level:

# Network policy for AI inference cluster — block external reach to ZMQ ports
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: inference-zmq-restrict
  namespace: ai-inference
spec:
  podSelector:
    matchLabels:
      tier: inference-worker
  policyTypes: ["Ingress", "Egress"]
  ingress:
    - from:
        - podSelector:
            matchLabels:
              tier: inference-router
              app: trusted-router
      ports:
        - protocol: TCP
          port: 5555     # ZMQ socket — only from trusted router
        - protocol: TCP
          port: 5556
  egress:
    - to:
        - podSelector:
            matchLabels:
              tier: model-storage
      ports: [{ protocol: TCP, port: 443 }]
    # NOTE: no general internet egress — exploitation requires C2 channel

The network-policy approach gives you defense-in-depth even when a patch is unavailable: ShadowMQ requires reaching the ZMQ port, and a Kubernetes NetworkPolicy that constrains ingress to the router pod removes the attack surface entirely. Pair this with seccomp profiles that deny socket() calls from inference workers (they should only ever inherit pre-bound sockets from the router) and you reduce blast radius even if a patched framework develops a new ZMQ-class bug in the future.

What is the broader supply-chain lesson?

ShadowMQ is the AI-inference analog of Log4Shell: a single insecure pattern propagated through copy-and-paste reuse across an ecosystem, then disclosed by a single researcher with the right cross-project context. The defender response cannot be "patch this CVE and move on" — the response has to be "audit the propagation network." If your ML platform team uses any project derived from Meta's llama reference code, you should assume there is at least one more ShadowMQ-class bug waiting to be found, and your hardening should treat the inference layer as untrusted regardless of which framework you run. AIBOM at this level — recording the inference framework, its version, and its ancestry — is the inventory you need.

How should we audit our inference stack for similar bugs?

The right post-disclosure audit is a code-review pass against three patterns in any inference framework you operate or have forked. First, any pickle.loads(), cloudpickle.loads(), or dill.loads() call on data received over a network socket — that is the literal ShadowMQ pattern. Second, any ZeroMQ socket bound to a non-loopback interface without authentication, regardless of what is being deserialized on the other side. Third, any framework-internal RPC that uses pickle or any pickle-compatible serializer as its wire format. Most defender teams will not have the bandwidth to audit every dependency, but you can require the audit for any inference framework you build on directly, and you can use Safeguard or similar tooling to surface third-party dependencies that match these patterns. The deeper inference architectural change underway in 2026 is that frameworks are migrating from pickle-based IPC to MessagePack, JSON, or Apache Arrow — formats that do not have arbitrary-code-execution deserialization gadgets. Track that migration in your dependency upgrade plans.

What about the Sarathi-Serve and SGLang gap?

For frameworks that remained unpatched at disclosure (Sarathi-Serve, SGLang), the practical defender response was twofold: aggressive network isolation as described above, and replacement candidacy. SGLang is a popular research-grade serving framework but its lack of immediate patch response is a procurement signal. If you depend on SGLang in production, build a migration plan to vLLM 0.11.1+ or TensorRT-LLM 0.18.2+. Sarathi-Serve is primarily a Microsoft research artifact; production usage was always rare, and the right answer is to switch.

How Safeguard Helps

Safeguard tracks all AI inference frameworks (vLLM, TensorRT-LLM, SGLang, Sarathi-Serve, Modular Max, llama.cpp, Meta Llama serving) in your AIBOM with version pinning and continuous CVE matching. The ShadowMQ disclosure was flagged in customer tenants within hours of the Oligo write-up. Griffin AI maps your inference framework usage to the affected version ranges and produces a prioritized upgrade plan, and where patches are not yet available, generates Kubernetes NetworkPolicy and seccomp templates to mitigate via network and syscall restriction. Policy gates block any inference deployment that exposes ZMQ ports to broader cluster ingress than necessary, and the deployment-level egress allowlist catches exploitation attempts that try to reach external command-and-control. The result: a cross-project CVE family becomes a coordinated remediation effort rather than a list of disconnected patches.

shadowmq vllm tensorrt-llm meta-llama inference-platform ml-supply-chain

Back to all articles

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.

ShadowMQ: 30+ RCE Flaws Across AI Inference Engines

What is the vulnerable pattern?

Why did this pattern propagate?

Which versions are affected?

What is the exploitation pattern?

How do we mitigate before all patches are available?

What is the broader supply-chain lesson?

How should we audit our inference stack for similar bugs?

What about the Sarathi-Serve and SGLang gap?

How Safeguard Helps

Related articles in AI Security

NIST SP 800-218A: Operationalizing AI Secure Development in 2026

Ollama CVE-2026-7482 'Bleeding Llama': Out-of-Bounds Read

Building an Eval Suite for Your Security LLM Workflows

Never miss an update

Product

Solutions

Compare

Resources

Company

Legal

Developers