🤖 AI Summary
This work investigates the multidimensional representational mechanisms underlying safety alignment behaviors—such as refusal of harmful queries—in large language models (LLMs). To address this, we propose the first multidimensional vector space model for safety alignment, integrating safety-finetuned representations from Llama-3-8B, orthogonal direction decomposition, activation projection interventions, and attribution-based measurement. Our analysis identifies a dominant refusal direction alongside multiple semantically interpretable orthogonal auxiliary directions—including hypothetical narration and role-playing. Crucially, we find that these auxiliary directions modulate the primary refusal behavior through synergistic enhancement or suppression, revealing an interactional fragility: systematic removal of specific trigger words significantly degrades the multidimensional safety representation and markedly reduces refusal rates. These findings establish a novel theoretical framework and provide empirical grounding for understanding—and robustly enhancing—LLM safety alignment.
📝 Abstract
Large Language Models' safety-aligned behaviors, such as refusing harmful queries, can be represented by linear directions in activation space. Previous research modeled safety behavior with a single direction, limiting mechanistic understanding to an isolated safety feature. In this work, we discover that safety-aligned behavior is jointly controlled by multi-dimensional directions. Namely, we study the vector space of representation shifts during safety fine-tuning on Llama 3 8B for refusing jailbreaks. By studying orthogonal directions in the space, we first find that a dominant direction governs the model's refusal behavior, while multiple smaller directions represent distinct and interpretable features like hypothetical narrative and role-playing. We then measure how different directions promote or suppress the dominant direction, showing the important role of secondary directions in shaping the model's refusal representation. Finally, we demonstrate that removing certain trigger tokens in harmful queries can mitigate these directions to bypass the learned safety capability, providing new insights on understanding safety alignment vulnerability from a multi-dimensional perspective. Code and artifacts are available at https://github.com/BMPixel/safety-residual-space.