🤖 AI Summary
Noisy human preference feedback induces generalization errors in reward models (RMs), undermining the alignment of large language models (LLMs). To address this, we propose Collaborative Reward Modeling (CRM), a novel framework featuring a dual-RM online mutual evaluation mechanism that dynamically identifies and filters noisy preference data. CRM further incorporates a difficulty-aware curriculum learning strategy to progressively optimize reward modeling across stages. Crucially, CRM requires no modifications to downstream policy training and is natively compatible with implicit alignment methods. Empirical evaluation under 40% label noise demonstrates that CRM improves RewardBench accuracy by 9.94 points over single-RM baselines and static filtering approaches, significantly enhancing RM robustness and generalization capacity.
📝 Abstract
Reward models (RMs) are essential for aligning large language models (LLMs) with human values. However, noisy preferences in human feedback often lead to reward misgeneralization, where RMs overfit to spurious patterns and provide misleading signals during policy optimization. We systematically analyze the training dynamics of preference pairs and identify that noisy examples are harder to fit and introduce instability. Empirical evidence shows that LLMs optimized using reward models trained on full noisy datasets perform worse than those trained on filtered, high-quality preferences. To address this, we propose Collaborative Reward Modeling (CRM), an online framework that enhances robustness by combining peer review and curriculum learning. Two reward models are trained in parallel and assess each other's data selections to filter out potential noise. Curriculum learning structures the preference data from easy to hard, ensuring synchronized training and stable feedback. Extensive experiments demonstrate that CRM improves generalization, with up to 9.94 points of accuracy gain on RewardBench under 40 percent label noise. CRM is also compatible with implicit-reward alignment methods, offering a practical and versatile strategy for robust alignment.