Self-Evolved Reward Learning for LLMs

📅 2024-11-01
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
To address the heavy reliance of reward models (RMs) in RLHF on costly, biased human annotations, this paper proposes a self-evolving RM framework. It leverages large language models (e.g., Mistral, Llama 3) to autonomously generate high-quality preference data and employs iterative reward modeling with self-feedback optimization, establishing a closed-loop, self-supervised training paradigm. This is the first approach to achieve end-to-end autonomous evolution of RMs, eliminating the need for ongoing human feedback. Evaluated on HH-RLHF and UltraFeedback benchmarks, the method attains RM accuracy significantly surpassing mainstream baselines using ≤1% human annotations, while also improving downstream LLM alignment performance. The core contribution lies in introducing a novel RM training paradigm—“self-generation, self-evaluation, self-optimization”—providing a low-cost, robust pathway for alignment learning.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning from Human Feedback (RLHF) is a crucial technique for aligning language models with human preferences, playing a pivotal role in the success of conversational models like GPT-4, ChatGPT, and Llama 2. A core challenge in employing RLHF lies in training a reliable reward model (RM), which relies on high-quality labels typically provided by human experts or advanced AI system. These methods can be costly and may introduce biases that affect the language model's responses. As language models improve, human input may become less effective in further enhancing their performance. In this paper, we propose Self-Evolved Reward Learning (SER), a novel approach where the RM generates additional training data to iteratively improve itself. We conducted extensive experiments on multiple datasets such as HH-RLHF and UltraFeedback, using models like Mistral and Llama 3, and compare SER against various baselines. Our results demonstrate that even with limited human-annotated data, learning from self-feedback can robustly enhance RM performance, thereby boosting the capabilities of large language models (LLMs).
Problem

Research questions and friction points this paper is trying to address.

Challenges in training reliable reward models for RLHF
High costs and biases in human-provided reward labels
Enhancing reward model performance with limited human data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Evolved Reward Learning (SER) for LLMs
Generates additional training data iteratively
Enhances reward model with limited human data
🔎 Similar Papers
No similar papers found.
Chenghua Huang
Chenghua Huang
Fudan University
Large Language ModelReinforcement Learning
Z
Zhizhen Fan
School of Computer Science, Peking University
L
Lu Wang
Microsoft
F
Fangkai Yang
Microsoft
P
Pu Zhao
Microsoft
Zeqi Lin
Zeqi Lin
Microsoft
Code GenerationMachine Reasoning
Qingwei Lin
Qingwei Lin
Microsoft
Dongmei Zhang
Dongmei Zhang
Microsoft Research
Software EngineeringMachine LearningInformation Visualization
S
S. Rajmohan
Microsoft
Q
Qi Zhang
Microsoft