Decoupling Safety into Orthogonal Subspace: Cost-Efficient and Performance-Preserving Alignment for Large Language Models

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing safety alignment methods for large language models require iterative tuning of the ratio between safety-specific and general-domain data, incurring substantial computational overhead and often degrading general-purpose capabilities. Method: We propose an efficient LoRA-based safety alignment framework. For the first time, we identify and exploit the property that LoRA confines safety-related updates to low-rank subspaces orthogonal to the model’s primary functional subspace—enabling effective refusal training using safety data alone, without mixing general-domain data. Contribution/Results: Our approach delivers a plug-and-play safety patch that significantly enhances model safety with zero degradation in general-domain performance. It eliminates the need for costly hyperparameter search over data mixing ratios, reducing training cost by up to an order of magnitude. The method exhibits strong scalability and deployment flexibility, making it suitable for resource-constrained and production environments.

Technology Category

Application Category

📝 Abstract
Safety alignment is essential for building trustworthy artificial intelligence, yet it remains challenging to enhance model safety without degrading general performance. Current approaches require computationally expensive searches for the optimal proportion of safety-critical and general-purpose data to balance safety and general performance, incurring high costs with limited gains. In this work, we show that LoRA-based Refusal-training enables performance-preserving safety alignment even when trained solely on safety data, demonstrating that LoRA serves as cost-efficient, performance-preserving, and plug-and-play safety patches. Beyond empirical findings, we provide both theoretical and experimental evidence that LoRA effectively decouples safety into a low-rank subspace largely orthogonal to the model's intrinsic transformation space, ensuring that safety enhancements do not interfere with inherent capabilities.
Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM safety without degrading general performance capabilities
Reducing computational costs of safety alignment through efficient methods
Decoupling safety into orthogonal subspace to prevent capability interference
Innovation

Methods, ideas, or system contributions that make the work stand out.

LoRA-based refusal training for safety alignment
Decouples safety into orthogonal low-rank subspace
Plug-and-play safety patches preserve model performance
🔎 Similar Papers
No similar papers found.
Yutao Mou
Yutao Mou
Peking University
AI SafetyLLM Alignment
X
Xiaoling Zhou
National Engineering Research Center for Software Engineering, Peking University, China
Y
Yuxiao Luo
National Engineering Research Center for Software Engineering, Peking University, China
Shikun Zhang
Shikun Zhang
北京大学
W
Wei Ye
National Engineering Research Center for Software Engineering, Peking University, China