SaRO: Enhancing LLM Safety through Reasoning-based Alignment

📅 2025-04-13

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

To address two core challenges in large language model (LLM) safety alignment—insufficient generalization against novel jailbreaking attacks and over-alignment leading to erroneous rejection of benign instructions—this paper proposes a two-stage reasoning-enhanced alignment framework. Methodologically, it couples long-chain reasoning internalization with safety-aware reflection optimization, establishing two paradigms: *Reasoning-style Warmup* and *Safety-oriented Reasoning Process Optimization*, thereby mitigating safety blind spots arising from semantic ambiguity. The approach integrates supervised fine-tuning (SFT), direct preference optimization (DPO), and safety-policy-guided reasoning modeling, augmented by embedding-space semantic analysis. Experimental results demonstrate a 12.7% improvement in harmful-input rejection rate and a 23.4% increase in benign-instruction acceptance rate on mainstream jailbreaking benchmarks, significantly outperforming RLHF and DPO baselines.

Technology Category

Application Category

📝 Abstract

Current safety alignment techniques for large language models (LLMs) face two key challenges: (1) under-generalization, which leaves models vulnerable to novel jailbreak attacks, and (2) over-alignment, which leads to the excessive refusal of benign instructions. Our preliminary investigation reveals semantic overlap between jailbreak/harmful queries and normal prompts in embedding space, suggesting that more effective safety alignment requires a deeper semantic understanding. This motivates us to incorporate safety-policy-driven reasoning into the alignment process. To this end, we propose the Safety-oriented Reasoning Optimization Framework (SaRO), which consists of two stages: (1) Reasoning-style Warmup (RW) that enables LLMs to internalize long-chain reasoning through supervised fine-tuning, and (2) Safety-oriented Reasoning Process Optimization (SRPO) that promotes safety reflection via direct preference optimization (DPO). Extensive experiments demonstrate the superiority of SaRO over traditional alignment methods.

Problem

Research questions and friction points this paper is trying to address.

Address under-generalization in LLM safety against novel attacks

Reduce over-alignment causing excessive refusal of benign inputs

Enhance semantic understanding for effective safety alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Incorporates safety-policy-driven reasoning into alignment

Uses Reasoning-style Warmup for internalizing long-chain reasoning

Applies Safety-oriented Reasoning Process Optimization via DPO

🔎 Similar Papers

No similar papers found.