CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models

📅 2025-09-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

In verifiable reward reinforcement learning (RLVR) for large language models (LLMs), low inference exploration efficiency, premature convergence, and entropy collapse hinder robust policy learning. Method: We propose a dual-signal intrinsic curiosity mechanism that jointly leverages response perplexity and multi-head value-estimate variance to construct an exploration reward—formally connecting to classical count-based exploration and, for the first time, revealing the intrinsic mechanism of calibration collapse. Integrated into GRPO/PPO and actor-critic frameworks, our method enhances policy entropy stability without altering external rewards. Results: On the AIME benchmark, our approach achieves a ~3-percentage-point improvement over standard RLVR, significantly boosting reasoning diversity and long-horizon exploration capability. It establishes an interpretable, scalable paradigm for balancing exploration and exploitation in LLM-based reinforcement learning.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful paradigm for enhancing the reasoning ability of Large Language Models (LLMs). Yet current RLVR methods often explore poorly, leading to premature convergence and entropy collapse. To address this challenge, we introduce Curiosity-Driven Exploration (CDE), a framework that leverages the model's own intrinsic sense of curiosity to guide exploration. We formalize curiosity with signals from both the actor and the critic: for the actor, we use perplexity over its generated response, and for the critic, we use the variance of value estimates from a multi-head architecture. Both signals serve as an exploration bonus within the RLVR framework to guide the model. Our theoretical analysis shows that the actor-wise bonus inherently penalizes overconfident errors and promotes diversity among correct responses; moreover, we connect the critic-wise bonus to the well-established count-based exploration bonus in RL. Empirically, our method achieves an approximate +3 point improvement over standard RLVR using GRPO/PPO on AIME benchmarks. Further analysis identifies a calibration collapse mechanism within RLVR, shedding light on common LLM failure modes.

Problem

Research questions and friction points this paper is trying to address.

Addresses poor exploration in RLVR methods

Prevents premature convergence and entropy collapse

Enhances reasoning in large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Curiosity-driven exploration with intrinsic signals

Actor uses perplexity for response diversity

Critic uses value variance as bonus

🔎 Similar Papers

Reward Guidance for Reinforcement Learning Tasks Based on Large Language Models: The LMGT Framework

2024-09-07Citations: 1

Authors to Follow