On Symmetric Losses for Robust Policy Optimization with Noisy Preferences

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Label noise—arising from annotation errors and subjective biases—in human preference data degrades reward modeling accuracy and impairs policy optimization in preference-based reinforcement learning. Method: We propose Symmetric Preference Optimization (SymPO), which reformulates preference learning as a binary classification task and introduces, for the first time, a symmetric loss function. Contribution/Results: We theoretically prove that SymPO preserves strict reward ranking consistency and policy improvability under arbitrary label noise—establishing the first general framework for RLHF and DPO with provable noise robustness. Empirical evaluation on both synthetic and real-world preference datasets demonstrates that SymPO significantly outperforms standard DPO and RLHF baselines, achieving up to 40% improvement in noise robustness while maintaining high policy alignment accuracy.

Technology Category

Application Category

📝 Abstract
Optimizing policies based on human preferences is key to aligning language models with human intent. This work focuses on reward modeling, a core component in reinforcement learning from human feedback (RLHF), and offline preference optimization, such as direct preference optimization. Conventional approaches typically assume accurate annotations. However, real-world preference data often contains noise due to human errors or biases. We propose a principled framework for robust policy optimization under noisy preferences, viewing reward modeling as a classification problem. This allows us to leverage symmetric losses, known for their robustness to label noise in classification, leading to our Symmetric Preference Optimization (SymPO) method. We prove that symmetric losses enable successful policy optimization even under noisy labels, as the resulting reward remains rank-preserving -- a property sufficient for policy improvement. Experiments on synthetic and real-world tasks demonstrate the effectiveness of SymPO.
Problem

Research questions and friction points this paper is trying to address.

Robust policy optimization with noisy human preferences
Addressing label noise in reward modeling for RLHF
Ensuring rank-preserving rewards under noisy annotations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Symmetric losses for robust reward modeling
Classification approach to handle noisy preferences
Rank-preserving rewards under label noise