Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models

📅 2026-03-06

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work challenges the prevailing view of safety alignment in large language models as a monolithic property, revealing a fundamental decoupling between harm recognition and refusal execution through the success of jailbreak attacks. We propose the Decoupled Safety Hypothesis (DSH), formally decomposing safety mechanisms into distinct “Knowing” (recognition) and “Acting” (refusal) axes. Through geometric analysis, we uncover an evolutionary pattern wherein these axes transition from shallow entanglement to deep separation. Leveraging double-difference extraction and adaptive causal intervention, we construct AmbiguityBench—a benchmark enabling causal disentanglement of “knowing without refusing”—and introduce Refusal Erasure Attacks (REA). REA achieves state-of-the-art attack performance on Llama3.1 and Qwen2.5, exposing critical differences in their explicit semantic and implicit distributional architectures underlying safety control.

Technology Category

Application Category

📝 Abstract

Safety alignment is often conceptualized as a monolithic process wherein harmfulness detection automatically triggers refusal. However, the persistence of jailbreak attacks suggests a fundamental mechanistic decoupling. We propose the \textbf{\underline{D}}isentangled \textbf{\underline{S}}afety \textbf{\underline{H}}ypothesis \textbf{(DSH)}, positing that safety computation operates on two distinct subspaces: a \textit{Recognition Axis} ($\mathbf{v}_H$, ``Knowing'') and an \textit{Execution Axis} ($\mathbf{v}_R$, ``Acting''). Our geometric analysis reveals a universal ``Reflex-to-Dissociation''evolution, where these signals transition from antagonistic entanglement in early layers to structural independence in deep layers. To validate this, we introduce \textit{Double-Difference Extraction} and \textit{Adaptive Causal Steering}. Using our curated \textsc{AmbiguityBench}, we demonstrate a causal double dissociation, effectively creating a state of ``Knowing without Acting.''Crucially, we leverage this disentanglement to propose the \textbf{Refusal Erasure Attack (REA)}, which achieves State-of-the-Art attack success rates by surgically lobotomizing the refusal mechanism. Furthermore, we uncover a critical architectural divergence, contrasting the \textit{Explicit Semantic Control} of Llama3.1 with the \textit{Latent Distributed Control} of Qwen2.5. The code and dataset are available at https://anonymous.4open.science/r/DSH.

Problem

Research questions and friction points this paper is trying to address.

safety alignment

jailbreak attacks

mechanistic decoupling

refusal mechanism

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Disentangled Safety Hypothesis

Geometric Analysis

Causal Double Dissociation