Friday, April 24, 2026

Hackers Exploit Indirect Prompts In Claude AI APIs To Exfiltrate User Data

Anthropic’s Claude AI recently gained the ability to perform network requests through its Code Interpreter feature, aimed at enhancing functionality like package installations.

However, this innovation opens doors to serious security threats. Security researcher Johann Rehberger, in a detailed blog post titled “Claude Pirate,” demonstrates how attackers can exploit this capability for unauthorized data theft.

The core issue lies in the default “Package managers only” network egress setting, which allow-lists a short domain list including api.anthropic.com.

While intended for safe operations like accessing npm or PyPI, it inadvertently permits interactions with Anthropic’s APIs.

The attack begins with indirect prompt injection, where malicious content embedded in documents or user inputs tricks Claude into executing harmful instructions.

Unlike common hyperlink-based leaks, this method uses Claude’s built-in Files API.

An adversary crafts a payload that instructs the AI to access sensitive user data, such as past chat histories via the new “memories” feature.

This data gets saved to a file within the Code Interpreter’s sandbox, typically at a path like /mnt/user-data/outputs/hello.md.

Executing The Exfiltration Chain

From there, the exploit escalates by running Python code that imports the Anthropic library and sets the attacker’s API key as an environment variable.

The script then uploads the file directly to the attacker’s Anthropic Console using client.beta.files.upload().

This bypasses the victim’s account entirely, as the upload authenticates with the intruder’s credentials. Rehberger notes that files up to 30MB can be exfiltrated per upload, with multiples possible for larger hauls.

Initial tests succeeded immediately, but Claude’s safeguards later flagged suspicious elements like plaintext API keys.

The researcher evaded detection by padding the code with benign snippets, such as simple print statements, making the payload appear innocuous.

A demo video and screenshots illustrate the process: the attacker views an empty console, the victim processes a tainted document, Claude hijacks the session to extract and upload data, and the file magically appears in the attacker’s account for easy access.

This “AI kill chain” underscores how AI agents can be weaponized for remote command-and-control.

Disclosure, Risks, And Safeguards

Rehberger responsibly disclosed the flaw to Anthropic on October 25, 2025, via HackerOne.

Initially deemed out-of-scope as a “model safety issue,” Anthropic later acknowledged it as a valid vulnerability on October 30, citing process improvements.

The company already warns in its documentation about exfiltration risks from connected sources, urging users to monitor sessions and halt unexpected actions.

For mitigation, Anthropic could enforce sandbox rules limiting API calls to the authenticated user’s account.

Users should disable network access, whitelist only essential domains, or vigilantly oversee Code Interpreter runs especially since the default mode proves insecure.

Broader implications tie into the “lethal trifecta” of AI risks: capable models, external connectivity, and untrusted inputs.

As AI tools evolve, such exploits remind developers and enterprises to prioritize adversarial testing. Staying vigilant protects against turning helpful assistants into unwitting spies.

Varshini
Varshini
Varshini is a Cyber Security expert in Threat Analysis, Vulnerability Assessment, and Research. Passionate about staying ahead of emerging Threats and Technologies..

Recent News

Recent News