🤖 AI Summary
To address two core challenges in large language model (LLM) safety alignment—insufficient generalization against novel jailbreaking attacks and over-alignment leading to erroneous rejection of benign instructions—this paper proposes a two-stage reasoning-enhanced alignment framework. Methodologically, it couples long-chain reasoning internalization with safety-aware reflection optimization, establishing two paradigms: *Reasoning-style Warmup* and *Safety-oriented Reasoning Process Optimization*, thereby mitigating safety blind spots arising from semantic ambiguity. The approach integrates supervised fine-tuning (SFT), direct preference optimization (DPO), and safety-policy-guided reasoning modeling, augmented by embedding-space semantic analysis. Experimental results demonstrate a 12.7% improvement in harmful-input rejection rate and a 23.4% increase in benign-instruction acceptance rate on mainstream jailbreaking benchmarks, significantly outperforming RLHF and DPO baselines.
📝 Abstract
Current safety alignment techniques for large language models (LLMs) face two key challenges: (1) under-generalization, which leaves models vulnerable to novel jailbreak attacks, and (2) over-alignment, which leads to the excessive refusal of benign instructions. Our preliminary investigation reveals semantic overlap between jailbreak/harmful queries and normal prompts in embedding space, suggesting that more effective safety alignment requires a deeper semantic understanding. This motivates us to incorporate safety-policy-driven reasoning into the alignment process. To this end, we propose the Safety-oriented Reasoning Optimization Framework (SaRO), which consists of two stages: (1) Reasoning-style Warmup (RW) that enables LLMs to internalize long-chain reasoning through supervised fine-tuning, and (2) Safety-oriented Reasoning Process Optimization (SRPO) that promotes safety reflection via direct preference optimization (DPO). Extensive experiments demonstrate the superiority of SaRO over traditional alignment methods.