Curiosity-Driven Reinforcement Learning from Human Feedback

📅 2025-01-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Reinforcement Learning from Human Feedback (RLHF) often leads to monotonous outputs in large language models, struggling to simultaneously satisfy diversity and human preference alignment. To address this, we propose CD-RLHF—a novel framework that introduces intrinsic curiosity rewards into RLHF for the first time, establishing a dual-reward co-optimization paradigm: sparse human feedback ensures alignment fidelity, while dense curiosity-driven rewards—based on state novelty estimation—enhance exploratory behavior and output diversity. Our method integrates reinforcement learning–based intrinsic motivation modeling, policy gradient optimization, and human feedback modeling. Evaluated on summarization and instruction-following tasks, CD-RLHF achieves significant improvements in diversity metrics (e.g., +12.3%–18.7% in n-gram coverage and BERTScore diversity), while maintaining human preference win rates statistically equivalent to standard RLHF (p > 0.05). This demonstrates effective decoupling and synergistic optimization of alignment and diversity.

Technology Category

Application Category

📝 Abstract
Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but often at the cost of reduced output diversity. This trade-off between diversity and alignment quality remains a significant challenge. Drawing inspiration from curiosity-driven exploration in reinforcement learning, we introduce curiosity-driven RLHF (CD-RLHF), a framework that incorporates intrinsic rewards for novel states, alongside traditional sparse extrinsic rewards, to optimize both output diversity and alignment quality. We demonstrate the effectiveness of CD-RLHF through extensive experiments on a range of tasks, including text summarization and instruction following. Our approach achieves significant gains in diversity on multiple diversity-oriented metrics while maintaining alignment with human preferences comparable to standard RLHF. We make our code publicly available at https://github.com/ernie-research/CD-RLHF.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning with Human Feedback (RLHF)
Language Model Output Diversity
Human Preference Preservation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Curiosity-Driven
Reinforcement Learning with Human Feedback
Richness and Consistency
🔎 Similar Papers
No similar papers found.
H
Haoran Sun
Baidu Inc.
Yekun Chai
Yekun Chai
Baidu
natural language processingmachine learning
Shuohuan Wang
Shuohuan Wang
Baidu
Natural Language ProcessingDeep Learning
Y
Yu Sun
Baidu Inc.
H
Hua Wu
Baidu Inc.
H
Haifeng Wang
Baidu Inc.