Critical remote code execution (RCE) flaws have surfaced in popular AI inference servers, exposing systems from giants like Meta, NVIDIA, and Microsoft to severe attacks.
These vulnerabilities stem from unsafe use of ZeroMQ (ZMQ) for communication and Python’s pickle module for deserialization, allowing attackers to run malicious code remotely.
Discovered by Oligo Security’s team, the issues highlight how code copying in the fast-paced AI world spreads risks across projects.
As AI adoption surges, securing these back-end engines is crucial to prevent data breaches or cryptomining on GPU clusters.
The problems affect frameworks handling AI model inference processing user queries on powerful servers.
Over the past year, Oligo has uncovered identical bugs in Meta’s Llama Stack, NVIDIA’s TensorRT-LLM, Microsoft’s Sarathi-Serve, and open-source tools such as vLLM and SGLang.
At the core is ZMQ’s recv_pyobj() function, which receives network data and deserializes it using pickle.loads().

Pickle is convenient for Python objects but dangerous: it can execute arbitrary code embedded in the data, like spawning processes or stealing secrets.
When used over unauthenticated TCP sockets, it turns a simple message into an RCE gateway.
The Discovery: Unsafe Deserialization In ZMQ
Oligo’s investigation began in 2024 with Meta’s Llama Stack. Researchers spotted unauthenticated ZMQ sockets pulling in untrusted data and feeding it straight to pickle.
The code snippet looked like this:
def recv_pyobj(self, flags: int = 0) -> Any:
msg = self.recv(flags)
return self._deserialize(msg, pickle.loads)
This setup invites attacks: an attacker sends a pickled payload over the network, and the server blindly executes it.
Oligo reported it as CVE-2024-50050, prompting Meta to switch to secure JSON serialization by October 2024.
Digging further revealed the flaw wasn’t unique. Scanning other tools showed near-identical code in NVIDIA’s TensorRT-LLM, vLLM, SGLang, and Modular’s Max Server.
The Spread, Impact, and Responses
Code reuse amplified the danger. SGLang, used by xAI, AMD, Google Cloud, AWS, and universities like Stanford, copied vLLM’s flaws directly.
Unpatched, a single vulnerable node in a GPU cluster could let attackers escalate privileges, exfiltrate model data, or install miners like those in the ShadowRay campaign.
Disclosures varied, with most vendors patching quickly. Microsoft’s Sarathi-Serve and SGLang remain vulnerable, despite alerts, and lack complete fixes.
| CVE ID | Affected Framework | Disclosure Date | Severity | CVSS Score | Description | Fixed Version | Status |
|---|---|---|---|---|---|---|---|
| CVE-2024-50050 | Meta Llama Stack | October 2024 | Critical | 9.8 | Unsafe pickle deserialization over unauthenticated ZMQ sockets enabling RCE. | ≥ v0.0.41 | Patched |
| CVE-2025-30165 | vLLM | May 2025 | Critical | 8.0 | Pickle deserialization in multi-node V0 engine allowing RCE via ZMQ. | ≥ v0.8.0 (V1 default) | Mitigated |
| CVE-2025-23254 | NVIDIA TensorRT-LLM | May 2025 | Critical | 9.3 | Data validation issue in Python executor with ZMQ pickle use for RCE. | ≥ v0.18.2 | Patched |
| CVE-2025-60455 | Modular Max Server | June 2025 | Critical | 9.8 | Inherited unsafe ZMQ pickle deserialization from vLLM/SGLang enabling RCE. | ≥ v25.6 | Patched |





