🤖 AI Summary
This work investigates whether safety alignment in large language models (LLMs) is geometrically localized in separable subspaces of weight or activation space—i.e., whether safety can be isolated and controlled via low-dimensional, orthogonal directions.
Method: Leveraging five open-source LLMs, the authors conduct systematic analyses including parameter-space projection, activation similarity measurement, controllable fine-tuning perturbations, and representation disentanglement evaluation.
Contribution/Results: The study finds that safe and harmful behaviors coexist within the same low-dimensional subspace; no independent, controllable “safety direction” exists. Perturbations that enhance safety simultaneously amplify harmful behavior. This multi-model empirical result is the first to robustly refute the “separable safety subspace” hypothesis. It demonstrates that safety alignment emerges from entangled, high-impact representations shaped by global learning dynamics—not from local geometric structure—thereby exposing a fundamental limitation of subspace-based safety interventions.
📝 Abstract
Large Language Models (LLMs) rely on safety alignment to produce socially acceptable responses. This is typically achieved through instruction tuning and reinforcement learning from human feedback. However, this alignment is known to be brittle: further fine-tuning, even on benign or lightly contaminated data, can degrade safety and reintroduce harmful behaviors. A growing body of work suggests that alignment may correspond to identifiable geometric directions in weight space, forming subspaces that could, in principle, be isolated or preserved to defend against misalignment. In this work, we conduct a comprehensive empirical study of this geometric perspective. We examine whether safety-relevant behavior is concentrated in specific subspaces, whether it can be separated from general-purpose learning, and whether harmfulness arises from distinguishable patterns in internal representations. Across both parameter and activation space, our findings are consistent: subspaces that amplify safe behaviors also amplify unsafe ones, and prompts with different safety implications activate overlapping representations. We find no evidence of a subspace that selectively governs safety. These results challenge the assumption that alignment is geometrically localized. Rather than residing in distinct directions, safety appears to emerge from entangled, high-impact components of the model's broader learning dynamics. This suggests that subspace-based defenses may face fundamental limitations and underscores the need for alternative strategies to preserve alignment under continued training. We corroborate these findings through multiple experiments on five open-source LLMs. Our code is publicly available at: https://github.com/CERT-Lab/safety-subspaces.