HAF-RM: A Hybrid Alignment Framework for Reward Model Training

📅 2024-07-04
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address weak alignment and poor robustness in large language model (LLM) reward modeling, this paper proposes HAF-RM—a novel framework that jointly optimizes token-level policy probability constraints and sequence-level reward regression for the first time, establishing a dual-granularity hybrid supervision mechanism that decouples preference modeling from reward mapping. Methodologically, HAF-RM integrates contrastive learning, policy gradient constraints, implicit probability regularization, and preference-data fine-tuning. Theoretically, it guarantees consistency between the two granularities of supervision. Empirically, HAF-RM achieves significant improvements across five benchmark datasets in reward accuracy, generalization, noise robustness, and downstream alignment performance. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
The reward model has become increasingly important in alignment, assessment, and data construction for large language models (LLMs). Most existing researchers focus on enhancing reward models through data improvements, following the conventional training framework for reward models that directly optimizes the predicted rewards. In this paper, we propose a hybrid alignment framework HaF-RM for reward model training by introducing an additional constraint on token-level policy probabilities in addition to the reward score. It can simultaneously supervise the internal preference model at the token level and optimize the mapping layer of the reward model at the sequence level. Experiment results on five datasets sufficiently show the validity and effectiveness of our proposed hybrid framework for training a high-quality reward model. By decoupling the reward modeling procedure and incorporating hybrid supervision, our HaF-RM framework offers a principled and effective approach to enhancing the performance and alignment of reward models, a critical component in the responsible development of powerful language models. We release our code at https://haf-rm.github.io.
Problem

Research questions and friction points this paper is trying to address.

Reward Modeling
Language Model Enhancement
Data Set Construction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Alignment Framework
Token-level Optimization
Mixed Supervision and Stepwise Training
🔎 Similar Papers
No similar papers found.
S
Shujun Liu
Fudan University
Xiaoyu Shen
Xiaoyu Shen
Eastern Institute of Technology, Ningbo
language modelmulti-modal learningreasoning
Yuhang Lai
Yuhang Lai
City University of Hong Kong
Natural Language ProcessingLarge Language Models
S
Siyuan Wang
University of Southern California
S
Shengbin Yue
Fudan University
Zengfeng Huang
Zengfeng Huang
Fudan University
AlgorithmsGraphsStreamingLearningTheory
X
Xuanjing Huang
Fudan University
Z
Zhongyu Wei
Fudan University