Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large reasoning models (LRMs) often harbor harmful content within their chain-of-thought (CoT) reasoning—even when final outputs are safe—undermining trustworthiness and enabling misuse. This work pioneers *safe reasoning alignment*, proposing a corrective intervention paradigm: identifying critical safety-triggering reasoning steps and compliance-indicative cues, then replacing unsafe reasoning paths to achieve process-level safety control. We further introduce Intervened Preference Optimization (IPO), an end-to-end preference learning method that constructs high-signal preference pairs incorporating both safety and reasoning quality objectives. IPO integrates process supervision, trigger identification, corrective intervention, and dual-objective preference learning. Experiments on jailbreaking and adversarial safety benchmarks show IPO reduces harmfulness by over 30% compared to supervised fine-tuning (SFT) and reinforcement learning (RL) baselines, while preserving multi-task reasoning performance.

Technology Category

Application Category

📝 Abstract
Although Large Reasoning Models (LRMs) have progressed in solving complex problems, their chain-of-thought (CoT) reasoning often contains harmful content that can persist even when the final responses appear safe. We show that this issue still remains in existing methods which overlook the unique significance of safe reasoning, undermining their trustworthiness and posing potential risks in applications if unsafe reasoning is accessible for and exploited by malicious users. We therefore shift our focus to aligning the safety of reasoning itself in this paper and explore process supervision as the solution. However, simply rewarding safe reasoning proves inadequate due to low rollout diversity and limited training signals. To tackle this challenge, we first delve into the characteristics of safe reasoning and uncover several critical insights that 1) safe reasoning is often consolidated by a few critical steps of safety triggers; 2) compliance cues strongly correlate with unsafe continuations; and 3) corrective interventions reliably steer unsafe trajectories towards safer traces. Motivated by these, we propose Intervened Preference Optimization (IPO), an alignment method that enforces safe reasoning by substituting compliance steps with safety triggers and constructing pairs for preference learning with strong signals. Experiments on jailbreak and adversarial safety benchmarks demonstrate that IPO remarkably improves overall safety regarding both reasoning and responses, outperforming SFT-based and RL-based baselines with a relative reduction of over 30% in harmfulness, while preserving excellent performance across diverse reasoning tasks. The results highlight the importance of explicit alignment for reasoning and provide a practical path to safer LRMs.
Problem

Research questions and friction points this paper is trying to address.

LRMs generate harmful content in reasoning chains
Existing methods fail to ensure safe reasoning processes
Corrective intervention needed to align reasoning safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

Intervened Preference Optimization for safe reasoning
Substituting compliance steps with safety triggers
Constructing preference pairs for strong learning signals
🔎 Similar Papers
No similar papers found.
Y
Yichi Zhang
THU
Y
Yue Ding
CASIA
J
Jingwen Yang
THU
T
Tianwei Luo
THU
D
Dongbai Li
THU
Ranjie Duan
Ranjie Duan
Alibaba Group
AIAI 安全AI推动共同富裕
Q
Qiang Liu
CASIA
H
Hang Su
THU
Yinpeng Dong
Yinpeng Dong
Tsinghua University
Machine LearningDeep LearningAI Safety
J
Jun Zhu
THU