🤖 AI Summary
This work addresses the challenge of reward modeling for large language models under complex noisy preference data, where existing approaches often rely on unrealistic assumptions of homogeneous noise or suffer from overfitting. The authors propose SelectiveRM, a novel framework that integrates optimal transport with a quality relaxation mechanism. By employing a partial transport strategy, SelectiveRM automatically filters out semantically inconsistent noisy samples, while leveraging joint consistency discrepancy to align the predicted distribution with the true preference distribution. Theoretically, the method yields a tighter upper bound on the clean risk. Extensive experiments demonstrate that SelectiveRM significantly outperforms state-of-the-art baselines across multiple benchmarks, confirming its effectiveness and robustness in handling complex noise scenarios.
📝 Abstract
Reward models are fundamental to Reinforcement Learning from Human Feedback (RLHF), yet real-world datasets are inevitably corrupted by noisy preference. Conventional training objectives tend to overfit these errors, while existing denoising approaches often rely on homogeneous noise assumptions that fail to capture the complexity of linguistic preferences. To handle these challenges, we propose SelectiveRM, a framework grounded in optimal transport. We first devise a Joint Consistency Discrepancy to align the distribution of model predictions with preference data. Furthermore, to address the limitation of strict mass conservation which compels the model to fit outliers, we incorporate a Mass Relaxation mechanism via partial transport. This enables the autonomous exclusion of samples with noisy preference that contradict semantic consistency. Theoretically, we demonstrate that SelectiveRM optimizes a tighter upper bound on the unobserved clean risk. Extensive experiments validate that our approach significantly outperforms state-of-the-art baselines across diverse benchmarks.