How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
While large reasoning models (LRMs) exhibit enhanced reasoning capabilities, their safety performance often lags or even degrades. This paper systematically investigates supervised fine-tuning (SFT)-based safety alignment for LRMs. We identify three fundamental failure modes underlying safety response distillation and propose three key innovations: (1) replacing complex chain-of-thought reasoning with concise, templated reasoning—preserving safety while substantially improving training efficiency and generalization stability; (2) first identifying and rectifying a critical failure mechanism in the distillation process; and (3) introducing a mathematical data mixing training paradigm to effectively mitigate over-refusal. Experiments on SafeBench demonstrate a 23.6% improvement in safety performance and an 18.4% reduction in over-refusal rate. All code and data are publicly released.

Technology Category

Application Category

📝 Abstract
Large Reasoning Models (LRMs) have achieved remarkable success on reasoning-intensive tasks such as mathematics and programming. However, their enhanced reasoning capabilities do not necessarily translate to improved safety performance-and in some cases, may even degrade it. This raises an important research question: how can we enhance the safety of LRMs? In this paper, we present a comprehensive empirical study on how to enhance the safety of LRMs through Supervised Fine-Tuning (SFT). Our investigation begins with an unexpected observation: directly distilling safe responses from DeepSeek-R1 fails to significantly enhance safety. We analyze this phenomenon and identify three key failure patterns that contribute to it. We then demonstrate that explicitly addressing these issues during the data distillation process can lead to substantial safety improvements. Next, we explore whether a long and complex reasoning process is necessary for achieving safety. Interestingly, we find that simply using short or template-based reasoning process can attain comparable safety performance-and are significantly easier for models to learn than more intricate reasoning chains. These findings prompt a deeper reflection on the role of reasoning in ensuring safety. Finally, we find that mixing math reasoning data during safety fine-tuning is helpful to balance safety and over-refusal. Overall, we hope our empirical study could provide a more holistic picture on enhancing the safety of LRMs. The code and data used in our experiments are released in https://github.com/thu-coai/LRM-Safety-Study.
Problem

Research questions and friction points this paper is trying to address.

Enhancing safety of Large Reasoning Models (LRMs) despite their advanced reasoning capabilities
Investigating failure patterns in supervised fine-tuning for safety improvement
Balancing safety and over-refusal by mixing math reasoning data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Supervised Fine-Tuning for safety enhancement
Short template-based reasoning for safety
Mixing math data to balance safety
Zhexin Zhang
Zhexin Zhang
Tsinghua University, CoAI Group
NLPAI Safety & Alignment
X
Xian Qi Loye
The Conversational AI (CoAI) group, DCST, Tsinghua University
V
Victor Shea-Jay Huang
The Conversational AI (CoAI) group, DCST, Tsinghua University
Junxiao Yang
Junxiao Yang
Tsinghua University
NLPAI SafetyTrustworthy AI
Q
Qi Zhu
Shiyao Cui
Shiyao Cui
Tsinghua University
Fei Mi
Fei Mi
Huawei Noah's Ark Lab
LLM Post Training
Lifeng Shang
Lifeng Shang
Huawei Noah's Ark Lab
Machine LearningComputer VisionPattern ReconitionNatural Language Processing
Y
Yingkang Wang
The Conversational AI (CoAI) group, DCST, Tsinghua University
Hongning Wang
Hongning Wang
Associate Professor, Department of Computer Science and Technology, Tsinghua University
Machine LearningInformation RetrievalLarge Language Models
M
Minlie Huang
The Conversational AI (CoAI) group, DCST, Tsinghua University