🤖 AI Summary
This work investigates the intrinsic mechanisms by which adversarial attacks circumvent safety alignment in large language models (LLMs). Focusing on the core safety behavior—refusing harmful requests—we introduce the novel concept of *representation independence*, distinguishing linear orthogonality from causal independence. Leveraging gradient-based representation engineering, directional intervention analysis, and concept cone modeling, we discover that the activation space contains multiple orthogonal, functionally independent, and mechanistically separable refusal subspaces, collectively forming a high-dimensional concept cone—contrary to the conventional assumption of a single refusal axis. Empirical results demonstrate that LLM refusal emerges from the synergistic interplay of multiple distinct mechanisms; each refusal direction is both interpretable and intervenable. This work establishes the first geometrically grounded, decomposable, and intervenable mechanistic foundation for safety alignment in LLMs.
📝 Abstract
The safety alignment of large language models (LLMs) can be circumvented through adversarially crafted inputs, yet the mechanisms by which these attacks bypass safety barriers remain poorly understood. Prior work suggests that a single refusal direction in the model's activation space determines whether an LLM refuses a request. In this study, we propose a novel gradient-based approach to representation engineering and use it to identify refusal directions. Contrary to prior work, we uncover multiple independent directions and even multi-dimensional concept cones that mediate refusal. Moreover, we show that orthogonality alone does not imply independence under intervention, motivating the notion of representational independence that accounts for both linear and non-linear effects. Using this framework, we identify mechanistically independent refusal directions. We show that refusal mechanisms in LLMs are governed by complex spatial structures and identify functionally independent directions, confirming that multiple distinct mechanisms drive refusal behavior. Our gradient-based approach uncovers these mechanisms and can further serve as a foundation for future work on understanding LLMs.