π€ AI Summary
Large reasoning models (LRMs) employ explicit chain-of-thought (CoT) prompting to enhance mathematical and logical reasoning, yet this introduces latent safety risksβunsafe behaviors often manifest within intermediate reasoning steps, while final answers may appear benign. Existing supervised fine-tuning (SFT) approaches leveraging safety-annotated long-CoT datasets suffer from instability, degraded reasoning performance, and poor generalization.
Method: We propose the first reinforcement learning framework for CoT safety alignment, featuring token-level reward modeling and multi-model collaborative training to directly suppress unsafe token generation during reasoning while preserving deep reflective capabilities.
Contribution/Results: Our method achieves significant safety improvements (+12.7% average safety rate) across multiple model families and benchmarks, with zero degradation in reasoning accuracy. It demonstrates strong cross-model generalization and consistent safety enforcement throughout the CoT trajectory.
π Abstract
Large reasoning models (LRMs) extend large language models by generating explicit chain-of-thought (CoT) reasoning, significantly improving mathematical and logical problem solving. However, this explicit reasoning process also introduces new safety risks, as unsafe behaviors often emerge within intermediate reasoning trajectories, even when final answers appear harmless. Existing safety alignment approaches primarily rely on supervised fine-tuning (SFT) over safety-oriented long CoT datasets. While intuitive, we find that SFT produces inconsistent safety improvements, degrades reasoning ability, and generalizes poorly across model families. These limitations suggest that purely supervised approaches are insufficient for robust safety alignment in LRMs. To address this, we investigate reinforcement learning (RL) as a complementary optimization framework for LRM safety training. Unlike SFT, RL directly optimizes model policies with reward feedback, enabling more adaptive and stable alignment. Extensive experiments across multiple model families and benchmarks show that RL achieves stronger and more consistent safety gains while maintaining reasoning competence. Further analysis of reflection dynamics and token-level entropy reveals that RL suppresses unsafe exploratory reasoning while preserving reflective depth, leading to safer and more reliable reasoning processes.