LoRA is All You Need for Safety Alignment of Reasoning LLMs

📅 2025-07-22

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Safety-aligned fine-tuning of large language models (LLMs) for inference often incurs a substantial “safety tax”—a significant degradation in core reasoning capabilities. This work proposes a LoRA-based rejection dataset fine-tuning method that preserves safety while mitigating performance deterioration. Our key innovation lies in constraining safety-related parameter updates to a low-rank subspace via LoRA, thereby minimizing interference with the original high-capacity reasoning pathways by reducing overlap between safety and base inference weights. We further enhance training stability through targeted regularization and post-fine-tuning weight merging. Experiments across four rigorous benchmarks spanning mathematics, science, and programming demonstrate that our approach achieves safety alignment comparable to full-parameter fine-tuning, while fully retaining the model’s original reasoning performance—marking the first instance of efficient, synergistic optimization of both safety and inference capability.

Technology Category

Application Category

📝 Abstract

Reasoning LLMs have demonstrated remarkable breakthroughs in solving complex problems that were previously out of reach. To ensure LLMs do not assist with harmful requests, safety alignment fine-tuning is necessary in the post-training phase. However, safety alignment fine-tuning has recently been shown to significantly degrade reasoning abilities, a phenomenon known as the "Safety Tax". In this work, we show that using LoRA for SFT on refusal datasets effectively aligns the model for safety without harming its reasoning capabilities. This is because restricting the safety weight updates to a low-rank space minimizes the interference with the reasoning weights. Our extensive experiments across four benchmarks covering math, science, and coding show that this approach produces highly safe LLMs -- with safety levels comparable to full-model fine-tuning -- without compromising their reasoning abilities. Additionally, we observe that LoRA induces weight updates with smaller overlap with the initial weights compared to full-model fine-tuning. We also explore methods that further reduce such overlap -- via regularization or during weight merging -- and observe some improvement on certain tasks. We hope this result motivates designing approaches that yield more consistent improvements in the reasoning-safety trade-off.

Problem

Research questions and friction points this paper is trying to address.

Safety alignment fine-tuning degrades reasoning abilities in LLMs

LoRA minimizes interference between safety and reasoning weights

Achieving high safety without compromising reasoning capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

LoRA for safety alignment fine-tuning

Low-rank updates minimize reasoning interference

Regularization reduces weight overlap

🔎 Similar Papers

No similar papers found.

Authors to Follow