Obfuscated Activations Bypass LLM Latent-Space Defenses

📅 2024-12-12
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current mainstream latent-space monitoring defenses—such as sparse autoencoders, representation probing, and out-of-distribution (OOD) detection—are fundamentally bypassable against large language model (LLM) jailbreaking attacks. Method: The authors systematically design and empirically validate a novel class of hidden-state obfuscation attacks: leveraging adversarial optimization and task-conditioned activation reparameterization, these attacks precisely perturb latent representations while preserving functional behavior (e.g., maintaining ~90% jailbreak success rate). Contribution/Results: Experiments demonstrate that the attack reduces recall rates of diverse detectors from 100% to 0%, and its generalizability is validated on complex tasks such as SQL generation. Crucially, this work proves that behavioral functionality and harmful activation signatures are decoupled in latent space, exposing an inherent robustness limitation in activation-pattern-based defense paradigms.

Technology Category

Application Category

📝 Abstract
Recent latent-space monitoring techniques have shown promise as defenses against LLM attacks. These defenses act as scanners that seek to detect harmful activations before they lead to undesirable actions. This prompts the question: Can models execute harmful behavior via inconspicuous latent states? Here, we study such obfuscated activations. We show that state-of-the-art latent-space defenses -- including sparse autoencoders, representation probing, and latent OOD detection -- are all vulnerable to obfuscated activations. For example, against probes trained to classify harmfulness, our attacks can often reduce recall from 100% to 0% while retaining a 90% jailbreaking rate. However, obfuscation has limits: we find that on a complex task (writing SQL code), obfuscation reduces model performance. Together, our results demonstrate that neural activations are highly malleable: we can reshape activation patterns in a variety of ways, often while preserving a network's behavior. This poses a fundamental challenge to latent-space defenses.
Problem

Research questions and friction points this paper is trying to address.

Detecting harmful LLM activations via latent-space defenses
Effectiveness of obfuscated activations in bypassing defenses
Impact of obfuscation on model performance and security
Innovation

Methods, ideas, or system contributions that make the work stand out.

Obfuscated activations bypass LLM defenses
Sparse autoencoders vulnerable to obfuscation
Neural activations malleable, preserving network behavior
🔎 Similar Papers
No similar papers found.