Latent-space Attacks for Refusal Evasion in Language Models

๐Ÿ“… 2026-05-20
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

199K/year
๐Ÿค– AI Summary
Current safety-aligned language models employ refusal mechanisms that are easily circumvented and lack theoretical grounding in latent space transformations. This work addresses this gap by formulating refusal suppression as a latent-space evasion attack targeting linear probes. It introduces a controllable beyond-boundary projection strategy that projects internal model representations into compliant regions beyond the decision boundaryโ€”rather than merely onto itโ€”to more effectively bypass refusals. By integrating minimal-confidence evasion with confidence-optimized controlled projection, the proposed method achieves state-of-the-art attack success rates across 15 instruction-tuned, multimodal, and reasoning models, significantly outperforming existing refusal-ablation baselines and specialized jailbreaking approaches.
๐Ÿ“ Abstract
Safety-aligned language models are trained to refuse harmful requests, yet refusal behavior can be suppressed by steering their internal representations. Existing methods do so by ablating a refusal direction from model activations, aiming to remove refusal from the model's residual stream. Despite their empirical success, these methods lack a principled account of the latent-space transformation they induce and why it suppresses refusal. In this work, we recast refusal suppression as a latent-space evasion attack against linear probes trained to separate refused from answered prompts. Under this view, prior work's difference-in-means direction naturally defines such a probe, and its ablation is exactly a projection onto its decision boundary, i.e., a minimum-confidence evasion attack. This perspective not only explains the empirical success of prior work but also admits a key limitation: evasion stops at the decision boundary, motivating the need to push representations further into the compliant region, i.e., where the model answers. We leverage this by proposing a Controlled Latent-space Evasion attack that projects representations past the boundary with an optimized confidence. We achieve state-of-the-art attack success rate across 15 instruction-tuned, multimodal, and reasoning models, outperforming existing refusal-ablation baselines and specialized jailbreak attacks.
Problem

Research questions and friction points this paper is trying to address.

latent-space attacks
refusal evasion
language models
safety alignment
evasion attacks
Innovation

Methods, ideas, or system contributions that make the work stand out.

latent-space evasion
refusal suppression
linear probe
decision boundary
controlled projection
๐Ÿ”Ž Similar Papers