Thursday, April 16, 2026

Major RCE Vulnerabilities In AI Inference Engines Put Meta, Nvidia, and Microsoft Frameworks At Risk

Critical remote code execution (RCE) flaws have surfaced in popular AI inference servers, exposing systems from giants like Meta, NVIDIA, and Microsoft to severe attacks.

These vulnerabilities stem from unsafe use of ZeroMQ (ZMQ) for communication and Python’s pickle module for deserialization, allowing attackers to run malicious code remotely.

Discovered by Oligo Security’s team, the issues highlight how code copying in the fast-paced AI world spreads risks across projects.

As AI adoption surges, securing these back-end engines is crucial to prevent data breaches or cryptomining on GPU clusters.​

The problems affect frameworks handling AI model inference processing user queries on powerful servers.

Over the past year, Oligo has uncovered identical bugs in Meta’s Llama Stack, NVIDIA’s TensorRT-LLM, Microsoft’s Sarathi-Serve, and open-source tools such as vLLM and SGLang.

At the core is ZMQ’s recv_pyobj() function, which receives network data and deserializes it using pickle.loads().

RCE Vulnerabilities Expose AI Frameworks
RCE Vulnerabilities Expose AI Frameworks

Pickle is convenient for Python objects but dangerous: it can execute arbitrary code embedded in the data, like spawning processes or stealing secrets.

When used over unauthenticated TCP sockets, it turns a simple message into an RCE gateway.

The Discovery: Unsafe Deserialization In ZMQ

Oligo’s investigation began in 2024 with Meta’s Llama Stack. Researchers spotted unauthenticated ZMQ sockets pulling in untrusted data and feeding it straight to pickle.

The code snippet looked like this:

def recv_pyobj(self, flags: int = 0) -> Any:
    msg = self.recv(flags)
    return self._deserialize(msg, pickle.loads)

This setup invites attacks: an attacker sends a pickled payload over the network, and the server blindly executes it.

Oligo reported it as CVE-2024-50050, prompting Meta to switch to secure JSON serialization by October 2024.​

Digging further revealed the flaw wasn’t unique. Scanning other tools showed near-identical code in NVIDIA’s TensorRT-LLM, vLLM, SGLang, and Modular’s Max Server.

The Spread, Impact, and Responses

Code reuse amplified the danger. SGLang, used by xAI, AMD, Google Cloud, AWS, and universities like Stanford, copied vLLM’s flaws directly.

Unpatched, a single vulnerable node in a GPU cluster could let attackers escalate privileges, exfiltrate model data, or install miners like those in the ShadowRay campaign.

Disclosures varied, with most vendors patching quickly. Microsoft’s Sarathi-Serve and SGLang remain vulnerable, despite alerts, and lack complete fixes.

CVE IDAffected FrameworkDisclosure DateSeverityCVSS ScoreDescriptionFixed VersionStatus
CVE-2024-50050Meta Llama StackOctober 2024Critical9.8Unsafe pickle deserialization over unauthenticated ZMQ sockets enabling RCE.≥ v0.0.41Patched
CVE-2025-30165vLLMMay 2025Critical8.0Pickle deserialization in multi-node V0 engine allowing RCE via ZMQ.≥ v0.8.0 (V1 default)Mitigated
CVE-2025-23254NVIDIA TensorRT-LLMMay 2025Critical9.3Data validation issue in Python executor with ZMQ pickle use for RCE.≥ v0.18.2Patched
CVE-2025-60455Modular Max ServerJune 2025Critical9.8Inherited unsafe ZMQ pickle deserialization from vLLM/SGLang enabling RCE.≥ v25.6Patched
Varshini
Varshini
Varshini is a Cyber Security expert in Threat Analysis, Vulnerability Assessment, and Research. Passionate about staying ahead of emerging Threats and Technologies..

Recent News

Recent News