🤖 AI Summary
Large reasoning models (LRMs) often harbor harmful content within their chain-of-thought (CoT) reasoning—even when final outputs are safe—undermining trustworthiness and enabling misuse. This work pioneers *safe reasoning alignment*, proposing a corrective intervention paradigm: identifying critical safety-triggering reasoning steps and compliance-indicative cues, then replacing unsafe reasoning paths to achieve process-level safety control. We further introduce Intervened Preference Optimization (IPO), an end-to-end preference learning method that constructs high-signal preference pairs incorporating both safety and reasoning quality objectives. IPO integrates process supervision, trigger identification, corrective intervention, and dual-objective preference learning. Experiments on jailbreaking and adversarial safety benchmarks show IPO reduces harmfulness by over 30% compared to supervised fine-tuning (SFT) and reinforcement learning (RL) baselines, while preserving multi-task reasoning performance.
📝 Abstract
Although Large Reasoning Models (LRMs) have progressed in solving complex problems, their chain-of-thought (CoT) reasoning often contains harmful content that can persist even when the final responses appear safe. We show that this issue still remains in existing methods which overlook the unique significance of safe reasoning, undermining their trustworthiness and posing potential risks in applications if unsafe reasoning is accessible for and exploited by malicious users. We therefore shift our focus to aligning the safety of reasoning itself in this paper and explore process supervision as the solution. However, simply rewarding safe reasoning proves inadequate due to low rollout diversity and limited training signals. To tackle this challenge, we first delve into the characteristics of safe reasoning and uncover several critical insights that 1) safe reasoning is often consolidated by a few critical steps of safety triggers; 2) compliance cues strongly correlate with unsafe continuations; and 3) corrective interventions reliably steer unsafe trajectories towards safer traces. Motivated by these, we propose Intervened Preference Optimization (IPO), an alignment method that enforces safe reasoning by substituting compliance steps with safety triggers and constructing pairs for preference learning with strong signals. Experiments on jailbreak and adversarial safety benchmarks demonstrate that IPO remarkably improves overall safety regarding both reasoning and responses, outperforming SFT-based and RL-based baselines with a relative reduction of over 30% in harmfulness, while preserving excellent performance across diverse reasoning tasks. The results highlight the importance of explicit alignment for reasoning and provide a practical path to safer LRMs.