Beyond SFT: Reinforcement Learning for Safer Large Reasoning Models with Better Reasoning Ability

📅 2025-12-01

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Large reasoning models (LRMs) employ explicit chain-of-thought (CoT) prompting to enhance mathematical and logical reasoning, yet this introduces latent safety risks—unsafe behaviors often manifest within intermediate reasoning steps, while final answers may appear benign. Existing supervised fine-tuning (SFT) approaches leveraging safety-annotated long-CoT datasets suffer from instability, degraded reasoning performance, and poor generalization. Method: We propose the first reinforcement learning framework for CoT safety alignment, featuring token-level reward modeling and multi-model collaborative training to directly suppress unsafe token generation during reasoning while preserving deep reflective capabilities. Contribution/Results: Our method achieves significant safety improvements (+12.7% average safety rate) across multiple model families and benchmarks, with zero degradation in reasoning accuracy. It demonstrates strong cross-model generalization and consistent safety enforcement throughout the CoT trajectory.

Technology Category

Application Category

📝 Abstract

Large reasoning models (LRMs) extend large language models by generating explicit chain-of-thought (CoT) reasoning, significantly improving mathematical and logical problem solving. However, this explicit reasoning process also introduces new safety risks, as unsafe behaviors often emerge within intermediate reasoning trajectories, even when final answers appear harmless. Existing safety alignment approaches primarily rely on supervised fine-tuning (SFT) over safety-oriented long CoT datasets. While intuitive, we find that SFT produces inconsistent safety improvements, degrades reasoning ability, and generalizes poorly across model families. These limitations suggest that purely supervised approaches are insufficient for robust safety alignment in LRMs. To address this, we investigate reinforcement learning (RL) as a complementary optimization framework for LRM safety training. Unlike SFT, RL directly optimizes model policies with reward feedback, enabling more adaptive and stable alignment. Extensive experiments across multiple model families and benchmarks show that RL achieves stronger and more consistent safety gains while maintaining reasoning competence. Further analysis of reflection dynamics and token-level entropy reveals that RL suppresses unsafe exploratory reasoning while preserving reflective depth, leading to safer and more reliable reasoning processes.

Problem

Research questions and friction points this paper is trying to address.

Enhances safety in large reasoning models' intermediate steps

Addresses inconsistent safety and reasoning degradation from supervised methods

Uses reinforcement learning for stable safety without losing reasoning ability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning optimizes safety with reward feedback

RL suppresses unsafe reasoning while maintaining reflective depth

RL achieves consistent safety gains without degrading reasoning ability

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting