🤖 AI Summary
Reinforcement Learning from Human Feedback (RLHF) often leads to monotonous outputs in large language models, struggling to simultaneously satisfy diversity and human preference alignment. To address this, we propose CD-RLHF—a novel framework that introduces intrinsic curiosity rewards into RLHF for the first time, establishing a dual-reward co-optimization paradigm: sparse human feedback ensures alignment fidelity, while dense curiosity-driven rewards—based on state novelty estimation—enhance exploratory behavior and output diversity. Our method integrates reinforcement learning–based intrinsic motivation modeling, policy gradient optimization, and human feedback modeling. Evaluated on summarization and instruction-following tasks, CD-RLHF achieves significant improvements in diversity metrics (e.g., +12.3%–18.7% in n-gram coverage and BERTScore diversity), while maintaining human preference win rates statistically equivalent to standard RLHF (p > 0.05). This demonstrates effective decoupling and synergistic optimization of alignment and diversity.
📝 Abstract
Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but often at the cost of reduced output diversity. This trade-off between diversity and alignment quality remains a significant challenge. Drawing inspiration from curiosity-driven exploration in reinforcement learning, we introduce curiosity-driven RLHF (CD-RLHF), a framework that incorporates intrinsic rewards for novel states, alongside traditional sparse extrinsic rewards, to optimize both output diversity and alignment quality. We demonstrate the effectiveness of CD-RLHF through extensive experiments on a range of tasks, including text summarization and instruction following. Our approach achieves significant gains in diversity on multiple diversity-oriented metrics while maintaining alignment with human preferences comparable to standard RLHF. We make our code publicly available at https://github.com/ernie-research/CD-RLHF.