First Return, Entropy-Eliciting Explore

📅 2025-07-09

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Reinforcement learning from verifiable rewards (RLVR) suffers from unstable exploration and incoherent reasoning paths. Method: This paper proposes FR3E, a framework that (i) integrates first-return estimation with entropy-driven uncertainty quantification to precisely identify high-uncertainty nodes in reasoning chains; and (ii) employs a structured rollout strategy to generate semantically grounded intermediate feedback at critical decision points, enabling directed and stable exploration. Crucially, FR3E requires no dense human annotation while ensuring both reward verifiability and structural guidance. Results: On mathematical reasoning benchmarks including AIME24, FR3E significantly improves training stability, produces longer and more coherent reasoning traces, and substantially increases the proportion of fully correct solutions. It establishes, for the first time, a synergistic exploration paradigm that jointly leverages uncertainty-aware reasoning and verifiable feedback.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) improves the reasoning abilities of Large Language Models (LLMs) but it struggles with unstable exploration. We propose FR3E (First Return, Entropy-Eliciting Explore), a structured exploration framework that identifies high-uncertainty decision points in reasoning trajectories and performs targeted rollouts to construct semantically grounded intermediate feedback. Our method provides targeted guidance without relying on dense supervision. Empirical results on mathematical reasoning benchmarks(AIME24) show that FR3E promotes more stable training, produces longer and more coherent responses, and increases the proportion of fully correct trajectories. These results highlight the framework's effectiveness in improving LLM reasoning through more robust and structured exploration.

Problem

Research questions and friction points this paper is trying to address.

Improves LLM reasoning with structured exploration

Addresses unstable exploration in RLVR methods

Enhances training stability and response coherence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured exploration framework for LLMs

Targeted rollouts at high-uncertainty points

Semantically grounded intermediate feedback

🔎 Similar Papers

No similar papers found.