A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Fine-tuning large language models (LLMs) — e.g., via LoRA — often degrades pre-trained safety alignment, leading to harmful outputs. This paper proposes GuardSpace, a framework that preserves fine-tuning safety without compromising task performance. Methodologically, GuardSpace innovatively couples a safety-sensitive subspace with an interference-resilient nullspace, enabling dual protection through weight freezing and output-space constraints. It employs covariance-preconditioned SVD to decompose model weights, initializing low-rank adapters exclusively in safety-irrelevant components; additionally, it introduces a nullspace projector to suppress response shifts under adversarial prompts. Experiments on Llama-2-7B-Chat show GuardSpace reduces harmful response rate from 14.4% (AsFT baseline) to 3.6%, while improving task accuracy to 28.0%. These results demonstrate superior joint optimization of safety and utility over state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have achieved remarkable success in diverse tasks, yet their safety alignment remains fragile during adaptation. Even when fine-tuning on benign data or with low-rank adaptation, pre-trained safety behaviors are easily degraded, leading to harmful responses in the fine-tuned models. To address this challenge, we propose GuardSpace, a guardrail framework for preserving safety alignment throughout fine-tuning, composed of two key components: a safety-sensitive subspace and a harmful-resistant null space. First, we explicitly decompose pre-trained weights into safety-relevant and safety-irrelevant components using covariance-preconditioned singular value decomposition, and initialize low-rank adapters from the safety-irrelevant ones, while freezing safety-relevant components to preserve their associated safety mechanism. Second, we construct a null space projector that restricts adapter updates from altering safe outputs on harmful prompts, thereby maintaining the original refusal behavior. Experiments with various pre-trained models on multiple downstream tasks demonstrate that GuardSpace achieves superior performance over existing methods. Notably, for Llama-2-7B-Chat fine-tuned on GSM8K, GuardSpace outperforms the state-of-the-art method AsFT, reducing the average harmful score from 14.4% to 3.6%, while improving the accuracy from from 26.0% to 28.0%.

Problem

Research questions and friction points this paper is trying to address.

Preserving LLM safety alignment during fine-tuning processes

Preventing degradation of safety behaviors with benign data

Maintaining refusal capabilities while adapting to new tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposes weights into safety-relevant and irrelevant components

Freezes safety-relevant components to preserve safety mechanisms

Uses null space projector to maintain refusal behavior

🔎 Similar Papers

Safe Reinforcement Learning in Black-Box Environments via Adaptive Shielding

2024-05-28arXiv.orgCitations: 0

Authors to Follow