In-Context Representation Hijacking

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies Doublespeak, a novel context-based representation hijacking vulnerability in large language models (LLMs). The attack exploits in-context learning: by injecting benign paraphrased examples—e.g., substituting “carrot” for “bomb”—into the prompt prefix, adversaries induce layerwise semantic overwriting in the model’s internal representation space, causing harmless tokens to be remapped onto harmful semantics and thereby evading safety alignment mechanisms. Crucially, Doublespeak requires no gradient computation, fine-tuning, or model access—only prompt engineering via few-shot demonstration and prefix injection—and exhibits strong cross-model transferability. This study is the first to systematically uncover and empirically validate the interpretable, hierarchical propagation of semantic hijacking within the representational space. On Llama-3.3-70B-Instruct, a single-sentence context achieves a 74% attack success rate; efficacy extends across both open- and closed-weight models, fundamentally challenging the assumption of semantic stability underpinning current safety alignment approaches.

Technology Category

Application Category

📝 Abstract
We introduce extbf{Doublespeak}, a simple emph{in-context representation hijacking} attack against large language models (LLMs). The attack works by systematically replacing a harmful keyword (e.g., extit{bomb}) with a benign token (e.g., extit{carrot}) across multiple in-context examples, provided a prefix to a harmful request. We demonstrate that this substitution leads to the internal representation of the benign token converging toward that of the harmful one, effectively embedding the harmful semantics under a euphemism. As a result, superficially innocuous prompts (e.g., ``How to build a carrot?'') are internally interpreted as disallowed instructions (e.g., ``How to build a bomb?''), thereby bypassing the model's safety alignment. We use interpretability tools to show that this semantic overwrite emerges layer by layer, with benign meanings in early layers converging into harmful semantics in later ones. Doublespeak is optimization-free, broadly transferable across model families, and achieves strong success rates on closed-source and open-source systems, reaching 74% ASR on Llama-3.3-70B-Instruct with a single-sentence context override. Our findings highlight a new attack surface in the latent space of LLMs, revealing that current alignment strategies are insufficient and should instead operate at the representation level.
Problem

Research questions and friction points this paper is trying to address.

Doublespeak hijacks LLM representations via keyword substitution
It bypasses safety alignment by embedding harmful semantics in benign tokens
The attack reveals insufficient alignment strategies at the representation level
Innovation

Methods, ideas, or system contributions that make the work stand out.

Replace harmful keywords with benign tokens in context
Hijack internal representations to embed harmful semantics
Bypass safety alignment through semantic overwrite in layers
🔎 Similar Papers
No similar papers found.