Energy-Based Reward Models for Robust Language Model Alignment

📅 2025-04-17

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Existing reward models (RMs) struggle to accurately capture complex human preferences, exhibiting poor generalization and susceptibility to annotation noise and reward hacking. To address these limitations, we propose the Energy-Based Reward Model (EBRM), the first RM framework that incorporates an energy function to explicitly model uncertainty in the reward distribution. EBRM introduces a fine-tuning-free post-hoc enhancement framework enabling plug-and-play deployment across diverse large language models (LLMs). It integrates hybrid initialization, label-noise-aware contrastive learning, and a data conflict detection mechanism. Extensive evaluation across multiple benchmarks demonstrates that EBRM significantly improves robustness and generalization—achieving up to a 5.97% gain on safety-critical alignment tasks. Moreover, in RLHF settings, EBRM effectively mitigates reward hacking and substantially enhances alignment stability.

Technology Category

Application Category

📝 Abstract

Reward models (RMs) are essential for aligning Large Language Models (LLMs) with human preferences. However, they often struggle with capturing complex human preferences and generalizing to unseen data. To address these challenges, we introduce Energy-Based Reward Model (EBRM), a lightweight post-hoc refinement framework that enhances RM robustness and generalization. EBRM models the reward distribution explicitly, capturing uncertainty in human preferences and mitigating the impact of noisy or misaligned annotations. It achieves this through conflict-aware data filtering, label-noise-aware contrastive training, and hybrid initialization. Notably, EBRM enhances RMs without retraining, making it computationally efficient and adaptable across different models and tasks. Empirical evaluations on RM benchmarks demonstrate significant improvements in both robustness and generalization, achieving up to a 5.97% improvement in safety-critical alignment tasks compared to standard RMs. Furthermore, reinforcement learning experiments confirm that our refined rewards enhance alignment quality, effectively delaying reward hacking. These results demonstrate our approach as a scalable and effective enhancement for existing RMs and alignment pipelines. The code is available at EBRM.

Problem

Research questions and friction points this paper is trying to address.

Enhance reward model robustness and generalization

Capture uncertainty in human preferences effectively

Mitigate noisy or misaligned annotations impact

Innovation

Methods, ideas, or system contributions that make the work stand out.

Energy-Based Reward Model enhances robustness

Conflict-aware filtering improves generalization

Hybrid initialization avoids retraining costs

🔎 Similar Papers

HAF-RM: A Hybrid Alignment Framework for Reward Model Training