R1-ACT: Efficient Reasoning Model Safety Alignment by Activating Safety Knowledge

📅 2025-08-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large reasoning models (LRMs) excel at complex reasoning tasks but remain vulnerable to harmful instructions, posing critical safety risks. This paper identifies that LRMs intrinsically possess safety knowledge—yet it is frequently inactive during inference. To address this, we propose R1-Act, a post-training alignment method that explicitly activates safety-aware response mechanisms via structured reasoning prompts. R1-Act requires only ~1,000 training samples and 90 minutes of single-GPU training, without modifying model architecture or relying on external supervision. It is thus lightweight, scalable, and deployment-friendly. Evaluated across multiple state-of-the-art LRMs, R1-Act achieves substantial safety improvements—averaging a +28.6% increase in harmful response rejection rate—while preserving core reasoning capabilities (performance degradation <0.8%). Moreover, it demonstrates superior robustness and generalization compared to existing alignment approaches.

Technology Category

Application Category

📝 Abstract
Although large reasoning models (LRMs) have demonstrated impressive capabilities on complex tasks, recent studies reveal that these models frequently fulfill harmful user instructions, raising significant safety concerns. In this paper, we investigate the underlying cause of LRM safety risks and find that models already possess sufficient safety knowledge but fail to activate it during reasoning. Based on this insight, we propose R1-Act, a simple and efficient post-training method that explicitly triggers safety knowledge through a structured reasoning process. R1-Act achieves strong safety improvements while preserving reasoning performance, outperforming prior alignment methods. Notably, it requires only 1,000 training examples and 90 minutes of training on a single RTX A6000 GPU. Extensive experiments across multiple LRM backbones and sizes demonstrate the robustness, scalability, and practical efficiency of our approach.
Problem

Research questions and friction points this paper is trying to address.

LRMs often fulfill harmful instructions despite having safety knowledge
Current methods fail to activate safety knowledge during reasoning
Need efficient alignment without compromising reasoning performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Activates safety knowledge via structured reasoning
Requires minimal training data and resources
Preserves reasoning performance while enhancing safety
🔎 Similar Papers
No similar papers found.