🤖 AI Summary
To address insufficient scenario coverage and vulnerability to adversarial attacks in large language model (LLM) safety alignment, this paper proposes Proactive Reasoning Preference Optimization (PRPO). PRPO introduces a novel “pre-reasoning” paradigm that explicitly embeds structured safety rules into chain-of-thought (CoT) reasoning paths to enable proactive safety assessment. It further designs a length-controllable iterative preference optimization strategy, integrating rule-guided supervised fine-tuning (SFT) with direct preference optimization (DPO) to jointly enhance safety and inference efficiency. Experiments across multiple open-source LLMs demonstrate that PRPO improves adversarial robustness by 18.7% on average, retains over 92% of the original response efficiency, and incurs less than 5% additional inference latency—significantly outperforming existing alignment methods.
📝 Abstract
Recent advancements in large language models (LLMs) have accelerated progress toward artificial general intelligence, yet their potential to generate harmful content poses critical safety challenges. Existing alignment methods often struggle to cover diverse safety scenarios and remain vulnerable to adversarial attacks. In this work, we propose Ex-Ante Reasoning Preference Optimization (ERPO), a novel safety alignment framework that equips LLMs with explicit preemptive reasoning through Chain-of-Thought and provides clear evidence for safety judgments by embedding predefined safety rules. Specifically, our approach consists of three stages: first, equipping the model with Ex-Ante reasoning through supervised fine-tuning (SFT) using a constructed reasoning module; second, enhancing safety, usefulness, and efficiency via Direct Preference Optimization (DPO); and third, mitigating inference latency with a length-controlled iterative preference optimization strategy. Experiments on multiple open-source LLMs demonstrate that ERPO significantly enhances safety performance while maintaining response efficiency.