Two Minds Better Than One: Collaborative Reward Modeling for LLM Alignment

📅 2025-05-15

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Noisy human preference feedback induces generalization errors in reward models (RMs), undermining the alignment of large language models (LLMs). To address this, we propose Collaborative Reward Modeling (CRM), a novel framework featuring a dual-RM online mutual evaluation mechanism that dynamically identifies and filters noisy preference data. CRM further incorporates a difficulty-aware curriculum learning strategy to progressively optimize reward modeling across stages. Crucially, CRM requires no modifications to downstream policy training and is natively compatible with implicit alignment methods. Empirical evaluation under 40% label noise demonstrates that CRM improves RewardBench accuracy by 9.94 points over single-RM baselines and static filtering approaches, significantly enhancing RM robustness and generalization capacity.

Technology Category

Application Category

📝 Abstract

Reward models (RMs) are essential for aligning large language models (LLMs) with human values. However, noisy preferences in human feedback often lead to reward misgeneralization, where RMs overfit to spurious patterns and provide misleading signals during policy optimization. We systematically analyze the training dynamics of preference pairs and identify that noisy examples are harder to fit and introduce instability. Empirical evidence shows that LLMs optimized using reward models trained on full noisy datasets perform worse than those trained on filtered, high-quality preferences. To address this, we propose Collaborative Reward Modeling (CRM), an online framework that enhances robustness by combining peer review and curriculum learning. Two reward models are trained in parallel and assess each other's data selections to filter out potential noise. Curriculum learning structures the preference data from easy to hard, ensuring synchronized training and stable feedback. Extensive experiments demonstrate that CRM improves generalization, with up to 9.94 points of accuracy gain on RewardBench under 40 percent label noise. CRM is also compatible with implicit-reward alignment methods, offering a practical and versatile strategy for robust alignment.

Problem

Research questions and friction points this paper is trying to address.

Addressing reward misgeneralization in LLM alignment due to noisy human feedback

Improving robustness of reward models by filtering noisy preference data

Enhancing generalization and stability in reward modeling via collaborative training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Collaborative Reward Modeling enhances robustness

Peer review filters noisy data selections

Curriculum learning structures easy-to-hard data

🔎 Similar Papers

No similar papers found.