A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Fine-tuning large language models (LLMs) — e.g., via LoRA — often degrades pre-trained safety alignment, leading to harmful outputs. This paper proposes GuardSpace, a framework that preserves fine-tuning safety without compromising task performance. Methodologically, GuardSpace innovatively couples a safety-sensitive subspace with an interference-resilient nullspace, enabling dual protection through weight freezing and output-space constraints. It employs covariance-preconditioned SVD to decompose model weights, initializing low-rank adapters exclusively in safety-irrelevant components; additionally, it introduces a nullspace projector to suppress response shifts under adversarial prompts. Experiments on Llama-2-7B-Chat show GuardSpace reduces harmful response rate from 14.4% (AsFT baseline) to 3.6%, while improving task accuracy to 28.0%. These results demonstrate superior joint optimization of safety and utility over state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have achieved remarkable success in diverse tasks, yet their safety alignment remains fragile during adaptation. Even when fine-tuning on benign data or with low-rank adaptation, pre-trained safety behaviors are easily degraded, leading to harmful responses in the fine-tuned models. To address this challenge, we propose GuardSpace, a guardrail framework for preserving safety alignment throughout fine-tuning, composed of two key components: a safety-sensitive subspace and a harmful-resistant null space. First, we explicitly decompose pre-trained weights into safety-relevant and safety-irrelevant components using covariance-preconditioned singular value decomposition, and initialize low-rank adapters from the safety-irrelevant ones, while freezing safety-relevant components to preserve their associated safety mechanism. Second, we construct a null space projector that restricts adapter updates from altering safe outputs on harmful prompts, thereby maintaining the original refusal behavior. Experiments with various pre-trained models on multiple downstream tasks demonstrate that GuardSpace achieves superior performance over existing methods. Notably, for Llama-2-7B-Chat fine-tuned on GSM8K, GuardSpace outperforms the state-of-the-art method AsFT, reducing the average harmful score from 14.4% to 3.6%, while improving the accuracy from from 26.0% to 28.0%.
Problem

Research questions and friction points this paper is trying to address.

Preserving LLM safety alignment during fine-tuning processes
Preventing degradation of safety behaviors with benign data
Maintaining refusal capabilities while adapting to new tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposes weights into safety-relevant and irrelevant components
Freezes safety-relevant components to preserve safety mechanisms
Uses null space projector to maintain refusal behavior
B
Bingjie Zhang
School of Artificial Intelligence, Jilin University
Y
Yibo Yang
University of Oxford
R
Renzhe
School of Artificial Intelligence, Jilin University
D
Dandan Guo
School of Artificial Intelligence, Jilin University
Jindong Gu
Jindong Gu
Google Research & DeepMind, University of Oxford
Trustworthy AIAI SafetyMultimodal AI
Philip Torr
Philip Torr
Professor, University of Oxford
Department of Engineering
Bernard Ghanem
Bernard Ghanem
Professor, King Abdullah University of Science and Technology
computer visionmachine learning