🤖 AI Summary
This study investigates whether refusal behaviors in large language models are governed by a single activation direction. Through geometric analysis of the activation space and linear intervention techniques, the authors systematically examine the representational structure underlying eleven categories of refusal and noncompliant behaviors. The findings reveal that these behaviors correspond to multiple distinct activation directions, indicating directional diversity in refusal mechanisms. Nevertheless, manipulating any one of these directions produces a similar trade-off between appropriate refusal and over-refusal, suggesting the presence of a shared one-dimensional control mechanism. This work is the first to demonstrate that refusal behavior is co-regulated by multiple directions, with different directions primarily modulating how refusal is expressed rather than whether it occurs, thereby significantly advancing the understanding of model safety mechanisms.
📝 Abstract
Prior work argues that refusal in large language models is mediated by a single activation-space direction, enabling effective steering and ablation. We show that this account is incomplete. Across eleven categories of refusal and non-compliance, including safety, incomplete or unsupported requests, anthropomorphization, and over-refusal, we find that these refusal behaviors correspond to geometrically distinct directions in activation space. Yet despite this diversity, linear steering along any refusal-related direction produces nearly identical refusal to over-refusal trade-offs, acting as a shared one-dimensional control knob. The primary effect of different directions is not whether the model refuses, but how it refuses.