Robust Preference Optimization via Dynamic Target Margins

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

To address the insufficient robustness of large language models (LLMs) to preference data noise during alignment, this paper proposes γ-PO, a dynamic target-boundary preference optimization method. Its core innovation is a pairwise-level dynamic reward margin mechanism that adaptively calibrates instance-specific margins, thereby amplifying optimization weights for high-confidence preference pairs while suppressing noise from ambiguous samples. γ-PO is plug-and-play and compatible with various reward-margin-based DPO variants. Evaluated on benchmarks including AlpacaEval 2 and Arena-Hard, it achieves an average improvement of 4.4%, significantly outperforming existing methods; it requires only minimal code modifications and incurs negligible training overhead. γ-PO is the first approach to enable noise-aware, fine-grained preference optimization, thereby enhancing the safety and reliability of LLM alignment.

Technology Category

Application Category

📝 Abstract

The alignment of Large Language Models (LLMs) is crucial for ensuring their safety and reliability in practical applications. Direct Preference Optimization (DPO) has emerged as an efficient method that directly optimizes models using preference pairs, significantly reducing resource demands. However, the effectiveness of DPO heavily depends on the data quality, which is frequently compromised by noise. In this work, we propose $gamma$-PO, a dynamic target margin preference optimization algorithm that adjust reward margins at the pairwise level. By introducing instance-specific margin calibration, $gamma$-PO strategically prioritizes high-confidence pairs (those demonstrating higher reward margins) while suppressing potential noise from ambiguous pairs. Moreover, $gamma$-PO is a plug-and-play method, compatible with variants of DPO that rely on reward margin between preference pairs. Across benchmarks such as AlpacaEval2 and Arena-Hard, $gamma$-PO achieves an average 4.4% improvement over other baselines, setting new benchmarks for state-of-the-art performance. Additionally, $gamma$-PO requires minimal code changes and has a negligible impact on training efficiency, making it a robust solution for enhancing LLMs alignment. Our codes are available at href{https://github.com/sunjie279/gammaPO}{https://github.com/sunjie279/gammaPO}.

Problem

Research questions and friction points this paper is trying to address.

Improves LLM alignment via dynamic target margins

Reduces noise impact in preference optimization

Enhances DPO performance with minimal efficiency loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic target margin preference optimization algorithm

Instance-specific margin calibration for noise suppression

Plug-and-play method compatible with DPO variants

🔎 Similar Papers

Segment Discovery: Enhancing E-commerce Targeting