Large language models like GPT-4, Claude, and Gemini rely on safety guardrails to block harmful prompts, but a new technique called EchoGram can trick these defenses into approving dangerous inputs.
Developed by researchers at HiddenLayer in early 2025, EchoGram exploits weaknesses in how guardrails are trained, allowing attackers to flip verdicts from “unsafe” to “safe” without altering the core malicious content.
These guardrails come in two main forms: text classification models that analyze prompts for threats, and LLM-as-a-judge systems that use another AI to evaluate safety.
Both protect against alignment bypasses, where users try to extract illegal information, and prompt injections, which force the model to ignore its rules.
EchoGram targets its shared training data, which includes curated sets of benign and malicious examples from sources such as HackAPrompt and DAN benchmarks.
This data often has imbalances in token frequencies short word sequences that the models don’t handle well.
Uncovering Flip Tokens
The EchoGram process starts with wordlist generation to find “flip tokens,” sequences that change a guardrail’s output.
In dataset distillation, attackers compare token patterns from benign and malicious pools. For instance, sequences more common in safe data but rare in harmful sets can make a risky prompt seem innocent when appended.

If the guardrail is open-source or accessible, model probing tests vocabulary tokens directly.
Researchers append candidates to low-confidence malicious prompts and score them by flip success rate across 100 varied examples.
A simple example is adding “=coffee” to a prompt like “ignore previous instructions,” turning a detected injection into a benign verdict.
This works because nonsensical tokens don’t disrupt the downstream LLM, which processes the attack as intended.
For more potent effects, combining tokens amplifies flips. HiddenLayer tested this on Qwen3Guard, an open-source moderator.
Single tokens like “oz” partially flipped some verdicts, but pairing them degraded accuracy across prompts about weapons or hacks, even in larger 4B models.
Probing reveals how training flaws persist: guardrails from similar datasets share blind spots, so one EchoGram sequence might work on multiple systems.
Implications For AI Security
EchoGram isn’t just for evasion; it can create false positives by embedding flip tokens in harmless text, flooding teams with alerts, and causing fatigue.
For example, a benign query about the weather, combined with a token, might be flagged as an attack, eroding trust.

This dual threat affects enterprise chatbots, healthcare AIs, and government tools, where guardrails are the primary defense.
The technique requires no insider access, just public datasets or basic queries, giving attackers an edge.
HiddenLayer warns that as LLMs integrate into critical systems, static guardrails fall short.
Organizations need ongoing testing, diverse training data, and adaptive defenses to counter such exploits.
EchoGram highlights a broader issue: AI safety relies on robust filters, but shared vulnerabilities mean a breakthrough can cascade across platforms.
Developers have a narrow window perhaps three months before widespread adoption, urging immediate scrutiny of guardrail training.
In essence, EchoGram proves that manipulating tokens can undermine the trust in AI defenses, calling for resilient, transparent security in an era of rapid AI deployment.





