First Return, Entropy-Eliciting Explore

📅 2025-07-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Reinforcement learning from verifiable rewards (RLVR) suffers from unstable exploration and incoherent reasoning paths. Method: This paper proposes FR3E, a framework that (i) integrates first-return estimation with entropy-driven uncertainty quantification to precisely identify high-uncertainty nodes in reasoning chains; and (ii) employs a structured rollout strategy to generate semantically grounded intermediate feedback at critical decision points, enabling directed and stable exploration. Crucially, FR3E requires no dense human annotation while ensuring both reward verifiability and structural guidance. Results: On mathematical reasoning benchmarks including AIME24, FR3E significantly improves training stability, produces longer and more coherent reasoning traces, and substantially increases the proportion of fully correct solutions. It establishes, for the first time, a synergistic exploration paradigm that jointly leverages uncertainty-aware reasoning and verifiable feedback.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning from Verifiable Rewards (RLVR) improves the reasoning abilities of Large Language Models (LLMs) but it struggles with unstable exploration. We propose FR3E (First Return, Entropy-Eliciting Explore), a structured exploration framework that identifies high-uncertainty decision points in reasoning trajectories and performs targeted rollouts to construct semantically grounded intermediate feedback. Our method provides targeted guidance without relying on dense supervision. Empirical results on mathematical reasoning benchmarks(AIME24) show that FR3E promotes more stable training, produces longer and more coherent responses, and increases the proportion of fully correct trajectories. These results highlight the framework's effectiveness in improving LLM reasoning through more robust and structured exploration.
Problem

Research questions and friction points this paper is trying to address.

Improves LLM reasoning with structured exploration
Addresses unstable exploration in RLVR methods
Enhances training stability and response coherence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured exploration framework for LLMs
Targeted rollouts at high-uncertainty points
Semantically grounded intermediate feedback
🔎 Similar Papers
No similar papers found.