Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Aligned large language models remain vulnerable to jailbreaking attacks, yet the underlying structural mechanisms of this fragility are not well understood. This work introduces the concept of “Refusal Escape Direction” (RED), revealing that minimal perturbations near harmful inputs can shift a model’s response from refusal to compliance while preserving semantic harmfulness. Through a combination of theoretical analysis and multi-model experiments, the study decomposes RED for the first time into three operator-level constraint sources: normalization, residual connections, and terminal output layers, identifying the terminal source as the primary driver of jailbreaking behavior. Furthermore, the research demonstrates that expanding token embedding dimensions effectively exposes RED, uncovering a conditional trade-off between model safety and utility.

📝 Abstract

Aligned large language models (LLMs) remain vulnerable to jailbreak attacks. Recent mechanistic studies have identified latent features and representation shifts associated with jailbreak success, but they leave a more fundamental question open: why do aligned LLMs remain jailbreakable, and what structural vulnerabilities in the model make this possible? We study this question through a continuous input-transformation view. Our theoretical finding is that aligned models can still exhibit Refusal-Escape Directions (RED): local perturbation directions around a harmful input that shift the model's behavior from refusal to answering while preserving the model's harmful-semantics interpretation. From this perspective, a jailbreak is not only a successful discrete prompt construction, but can also be understood as a refusal-to-answer behavior transition induced by continuously perturbing a harmful input along RED. We then prove that RED can be exactly decomposed into contributions from operator-level sources across the model's operator structure, and identify normalization, residual-wiring, and terminal sources as analytically constrained operator-level sources. To eliminate RED, the shared expressive modules -- self-attention and MLP -- must eliminate the contributions from these analytically constrained sources while preserving the mechanisms that support benign responses. These competing requirements give rise to a conditional safety-utility trade-off. Experiments across multiple models and attack methods empirically analyze RED from two complementary perspectives and show that added token dimensions can expose RED, while successful jailbreaks exhibit refusal-to-answer shifts largely aligned with terminal-source contributions.

Problem

Research questions and friction points this paper is trying to address.

jailbreakability

aligned LLMs

structural vulnerabilities

Refusal-Escape Directions

safety-utility trade-off

Innovation

Methods, ideas, or system contributions that make the work stand out.

Refusal-Escape Directions

operator-level sources

safety-utility trade-off