LoRA is All You Need for Safety Alignment of Reasoning LLMs

📅 2025-07-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Safety-aligned fine-tuning of large language models (LLMs) for inference often incurs a substantial “safety tax”—a significant degradation in core reasoning capabilities. This work proposes a LoRA-based rejection dataset fine-tuning method that preserves safety while mitigating performance deterioration. Our key innovation lies in constraining safety-related parameter updates to a low-rank subspace via LoRA, thereby minimizing interference with the original high-capacity reasoning pathways by reducing overlap between safety and base inference weights. We further enhance training stability through targeted regularization and post-fine-tuning weight merging. Experiments across four rigorous benchmarks spanning mathematics, science, and programming demonstrate that our approach achieves safety alignment comparable to full-parameter fine-tuning, while fully retaining the model’s original reasoning performance—marking the first instance of efficient, synergistic optimization of both safety and inference capability.

Technology Category

Application Category

📝 Abstract
Reasoning LLMs have demonstrated remarkable breakthroughs in solving complex problems that were previously out of reach. To ensure LLMs do not assist with harmful requests, safety alignment fine-tuning is necessary in the post-training phase. However, safety alignment fine-tuning has recently been shown to significantly degrade reasoning abilities, a phenomenon known as the "Safety Tax". In this work, we show that using LoRA for SFT on refusal datasets effectively aligns the model for safety without harming its reasoning capabilities. This is because restricting the safety weight updates to a low-rank space minimizes the interference with the reasoning weights. Our extensive experiments across four benchmarks covering math, science, and coding show that this approach produces highly safe LLMs -- with safety levels comparable to full-model fine-tuning -- without compromising their reasoning abilities. Additionally, we observe that LoRA induces weight updates with smaller overlap with the initial weights compared to full-model fine-tuning. We also explore methods that further reduce such overlap -- via regularization or during weight merging -- and observe some improvement on certain tasks. We hope this result motivates designing approaches that yield more consistent improvements in the reasoning-safety trade-off.
Problem

Research questions and friction points this paper is trying to address.

Safety alignment fine-tuning degrades reasoning abilities in LLMs
LoRA minimizes interference between safety and reasoning weights
Achieving high safety without compromising reasoning capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

LoRA for safety alignment fine-tuning
Low-rank updates minimize reasoning interference
Regularization reduces weight overlap
🔎 Similar Papers
No similar papers found.
Y
Yihao Xue
Department of Computer Science, University of California, Los Angeles
Baharan Mirzasoleiman
Baharan Mirzasoleiman
UCLA
Machine LearningOptimizationSubmodularityML SustainabilityData-quality