SaRO: Enhancing LLM Safety through Reasoning-based Alignment

📅 2025-04-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address two core challenges in large language model (LLM) safety alignment—insufficient generalization against novel jailbreaking attacks and over-alignment leading to erroneous rejection of benign instructions—this paper proposes a two-stage reasoning-enhanced alignment framework. Methodologically, it couples long-chain reasoning internalization with safety-aware reflection optimization, establishing two paradigms: *Reasoning-style Warmup* and *Safety-oriented Reasoning Process Optimization*, thereby mitigating safety blind spots arising from semantic ambiguity. The approach integrates supervised fine-tuning (SFT), direct preference optimization (DPO), and safety-policy-guided reasoning modeling, augmented by embedding-space semantic analysis. Experimental results demonstrate a 12.7% improvement in harmful-input rejection rate and a 23.4% increase in benign-instruction acceptance rate on mainstream jailbreaking benchmarks, significantly outperforming RLHF and DPO baselines.

Technology Category

Application Category

📝 Abstract
Current safety alignment techniques for large language models (LLMs) face two key challenges: (1) under-generalization, which leaves models vulnerable to novel jailbreak attacks, and (2) over-alignment, which leads to the excessive refusal of benign instructions. Our preliminary investigation reveals semantic overlap between jailbreak/harmful queries and normal prompts in embedding space, suggesting that more effective safety alignment requires a deeper semantic understanding. This motivates us to incorporate safety-policy-driven reasoning into the alignment process. To this end, we propose the Safety-oriented Reasoning Optimization Framework (SaRO), which consists of two stages: (1) Reasoning-style Warmup (RW) that enables LLMs to internalize long-chain reasoning through supervised fine-tuning, and (2) Safety-oriented Reasoning Process Optimization (SRPO) that promotes safety reflection via direct preference optimization (DPO). Extensive experiments demonstrate the superiority of SaRO over traditional alignment methods.
Problem

Research questions and friction points this paper is trying to address.

Address under-generalization in LLM safety against novel attacks
Reduce over-alignment causing excessive refusal of benign inputs
Enhance semantic understanding for effective safety alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Incorporates safety-policy-driven reasoning into alignment
Uses Reasoning-style Warmup for internalizing long-chain reasoning
Applies Safety-oriented Reasoning Process Optimization via DPO
🔎 Similar Papers
No similar papers found.
Yutao Mou
Yutao Mou
Peking University
AI SafetyLLM Alignment
Y
Yuxiao Luo
National Engineering Research Center for Software Engineering, Peking University, China
Shikun Zhang
Shikun Zhang
北京大学
W
Wei Ye
National Engineering Research Center for Software Engineering, Peking University, China