🤖 AI Summary
Label noise—arising from annotation errors and subjective biases—in human preference data degrades reward modeling accuracy and impairs policy optimization in preference-based reinforcement learning. Method: We propose Symmetric Preference Optimization (SymPO), which reformulates preference learning as a binary classification task and introduces, for the first time, a symmetric loss function. Contribution/Results: We theoretically prove that SymPO preserves strict reward ranking consistency and policy improvability under arbitrary label noise—establishing the first general framework for RLHF and DPO with provable noise robustness. Empirical evaluation on both synthetic and real-world preference datasets demonstrates that SymPO significantly outperforms standard DPO and RLHF baselines, achieving up to 40% improvement in noise robustness while maintaining high policy alignment accuracy.
📝 Abstract
Optimizing policies based on human preferences is key to aligning language models with human intent. This work focuses on reward modeling, a core component in reinforcement learning from human feedback (RLHF), and offline preference optimization, such as direct preference optimization. Conventional approaches typically assume accurate annotations. However, real-world preference data often contains noise due to human errors or biases. We propose a principled framework for robust policy optimization under noisy preferences, viewing reward modeling as a classification problem. This allows us to leverage symmetric losses, known for their robustness to label noise in classification, leading to our Symmetric Preference Optimization (SymPO) method. We prove that symmetric losses enable successful policy optimization even under noisy labels, as the resulting reward remains rank-preserving -- a property sufficient for policy improvement. Experiments on synthetic and real-world tasks demonstrate the effectiveness of SymPO.