ERPO: Advancing Safety Alignment via Ex-Ante Reasoning Preference Optimization

📅 2025-04-03

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address insufficient scenario coverage and vulnerability to adversarial attacks in large language model (LLM) safety alignment, this paper proposes Proactive Reasoning Preference Optimization (PRPO). PRPO introduces a novel “pre-reasoning” paradigm that explicitly embeds structured safety rules into chain-of-thought (CoT) reasoning paths to enable proactive safety assessment. It further designs a length-controllable iterative preference optimization strategy, integrating rule-guided supervised fine-tuning (SFT) with direct preference optimization (DPO) to jointly enhance safety and inference efficiency. Experiments across multiple open-source LLMs demonstrate that PRPO improves adversarial robustness by 18.7% on average, retains over 92% of the original response efficiency, and incurs less than 5% additional inference latency—significantly outperforming existing alignment methods.

Technology Category

Application Category

📝 Abstract

Recent advancements in large language models (LLMs) have accelerated progress toward artificial general intelligence, yet their potential to generate harmful content poses critical safety challenges. Existing alignment methods often struggle to cover diverse safety scenarios and remain vulnerable to adversarial attacks. In this work, we propose Ex-Ante Reasoning Preference Optimization (ERPO), a novel safety alignment framework that equips LLMs with explicit preemptive reasoning through Chain-of-Thought and provides clear evidence for safety judgments by embedding predefined safety rules. Specifically, our approach consists of three stages: first, equipping the model with Ex-Ante reasoning through supervised fine-tuning (SFT) using a constructed reasoning module; second, enhancing safety, usefulness, and efficiency via Direct Preference Optimization (DPO); and third, mitigating inference latency with a length-controlled iterative preference optimization strategy. Experiments on multiple open-source LLMs demonstrate that ERPO significantly enhances safety performance while maintaining response efficiency.

Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM safety alignment against harmful content

Improving coverage of diverse safety scenarios

Reducing vulnerability to adversarial attacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Ex-Ante Reasoning with Chain-of-Thought

Embedding predefined safety rules

Length-controlled iterative optimization strategy

🔎 Similar Papers

No similar papers found.

Authors to Follow