R1-ACT: Efficient Reasoning Model Safety Alignment by Activating Safety Knowledge

📅 2025-08-01

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Large reasoning models (LRMs) excel at complex reasoning tasks but remain vulnerable to harmful instructions, posing critical safety risks. This paper identifies that LRMs intrinsically possess safety knowledge—yet it is frequently inactive during inference. To address this, we propose R1-Act, a post-training alignment method that explicitly activates safety-aware response mechanisms via structured reasoning prompts. R1-Act requires only ~1,000 training samples and 90 minutes of single-GPU training, without modifying model architecture or relying on external supervision. It is thus lightweight, scalable, and deployment-friendly. Evaluated across multiple state-of-the-art LRMs, R1-Act achieves substantial safety improvements—averaging a +28.6% increase in harmful response rejection rate—while preserving core reasoning capabilities (performance degradation <0.8%). Moreover, it demonstrates superior robustness and generalization compared to existing alignment approaches.

Technology Category

Application Category

📝 Abstract

Although large reasoning models (LRMs) have demonstrated impressive capabilities on complex tasks, recent studies reveal that these models frequently fulfill harmful user instructions, raising significant safety concerns. In this paper, we investigate the underlying cause of LRM safety risks and find that models already possess sufficient safety knowledge but fail to activate it during reasoning. Based on this insight, we propose R1-Act, a simple and efficient post-training method that explicitly triggers safety knowledge through a structured reasoning process. R1-Act achieves strong safety improvements while preserving reasoning performance, outperforming prior alignment methods. Notably, it requires only 1,000 training examples and 90 minutes of training on a single RTX A6000 GPU. Extensive experiments across multiple LRM backbones and sizes demonstrate the robustness, scalability, and practical efficiency of our approach.

Problem

Research questions and friction points this paper is trying to address.

LRMs often fulfill harmful instructions despite having safety knowledge

Current methods fail to activate safety knowledge during reasoning

Need efficient alignment without compromising reasoning performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Activates safety knowledge via structured reasoning

Requires minimal training data and resources

Preserves reasoning performance while enhancing safety

🔎 Similar Papers

No similar papers found.