Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback

📅 2025-03-28

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses reward hacking and diminished response diversity in RLHF caused by data augmentation. Methodologically, it introduces a hybrid RTV+GenRM reward mechanism to improve feedback quality; proposes a Pre-PPO prompt filtering framework to optimize prompt dataset construction; and incorporates curriculum-based task scheduling, empirically validating the critical role of前置 training on mathematical and coding tasks for convergence. Experiments demonstrate that the approach substantially mitigates reward hacking, enhances response diversity, achieves significant RLHF performance gains on small- and medium-scale models, and enables rapid modeling of fine-grained task preferences. The core contributions include: (1) a hybrid reward modeling framework integrating RTV and GenRM; (2) a prompt-level data selection paradigm grounded in Pre-PPO filtering; and (3) the discovery of task sequencing sensitivity—specifically, the empirical finding that prioritizing reasoning-intensive tasks early in training markedly improves optimization dynamics and final policy quality.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning large language models with human preferences. While recent research has focused on algorithmic improvements, the importance of prompt-data construction has been overlooked. This paper addresses this gap by exploring data-driven bottlenecks in RLHF performance scaling, particularly reward hacking and decreasing response diversity. We introduce a hybrid reward system combining reasoning task verifiers (RTV) and a generative reward model (GenRM) to mitigate reward hacking. We also propose a novel prompt-selection method, Pre-PPO, to maintain response diversity and enhance learning effectiveness. Additionally, we find that prioritizing mathematical and coding tasks early in RLHF training significantly improves performance. Experiments across two model sizes validate our methods' effectiveness and scalability. Results show that RTV is most resistant to reward hacking, followed by GenRM with ground truth, and then GenRM with SFT Best-of-N responses. Our strategies enable rapid capture of subtle task-specific distinctions, leading to substantial improvements in overall RLHF performance. This work highlights the importance of careful data construction and provides practical methods to overcome performance barriers in RLHF.

Problem

Research questions and friction points this paper is trying to address.

Addresses data-driven bottlenecks in RLHF performance scaling

Mitigates reward hacking and maintains response diversity

Improves RLHF training with mathematical and coding tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid reward system with RTV and GenRM

Novel prompt-selection method Pre-PPO

Prioritize math and coding tasks early

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL