🤖 AI Summary
This work challenges the prevailing view of safety alignment in large language models as a monolithic property, revealing a fundamental decoupling between harm recognition and refusal execution through the success of jailbreak attacks. We propose the Decoupled Safety Hypothesis (DSH), formally decomposing safety mechanisms into distinct “Knowing” (recognition) and “Acting” (refusal) axes. Through geometric analysis, we uncover an evolutionary pattern wherein these axes transition from shallow entanglement to deep separation. Leveraging double-difference extraction and adaptive causal intervention, we construct AmbiguityBench—a benchmark enabling causal disentanglement of “knowing without refusing”—and introduce Refusal Erasure Attacks (REA). REA achieves state-of-the-art attack performance on Llama3.1 and Qwen2.5, exposing critical differences in their explicit semantic and implicit distributional architectures underlying safety control.
📝 Abstract
Safety alignment is often conceptualized as a monolithic process wherein harmfulness detection automatically triggers refusal. However, the persistence of jailbreak attacks suggests a fundamental mechanistic decoupling. We propose the \textbf{\underline{D}}isentangled \textbf{\underline{S}}afety \textbf{\underline{H}}ypothesis \textbf{(DSH)}, positing that safety computation operates on two distinct subspaces: a \textit{Recognition Axis} ($\mathbf{v}_H$, ``Knowing'') and an \textit{Execution Axis} ($\mathbf{v}_R$, ``Acting''). Our geometric analysis reveals a universal ``Reflex-to-Dissociation''evolution, where these signals transition from antagonistic entanglement in early layers to structural independence in deep layers. To validate this, we introduce \textit{Double-Difference Extraction} and \textit{Adaptive Causal Steering}. Using our curated \textsc{AmbiguityBench}, we demonstrate a causal double dissociation, effectively creating a state of ``Knowing without Acting.''Crucially, we leverage this disentanglement to propose the \textbf{Refusal Erasure Attack (REA)}, which achieves State-of-the-Art attack success rates by surgically lobotomizing the refusal mechanism. Furthermore, we uncover a critical architectural divergence, contrasting the \textit{Explicit Semantic Control} of Llama3.1 with the \textit{Latent Distributed Control} of Qwen2.5. The code and dataset are available at https://anonymous.4open.science/r/DSH.