🤖 AI Summary
Reinforcement learning from verifiable rewards (RLVR) suffers from unstable exploration and incoherent reasoning paths. Method: This paper proposes FR3E, a framework that (i) integrates first-return estimation with entropy-driven uncertainty quantification to precisely identify high-uncertainty nodes in reasoning chains; and (ii) employs a structured rollout strategy to generate semantically grounded intermediate feedback at critical decision points, enabling directed and stable exploration. Crucially, FR3E requires no dense human annotation while ensuring both reward verifiability and structural guidance. Results: On mathematical reasoning benchmarks including AIME24, FR3E significantly improves training stability, produces longer and more coherent reasoning traces, and substantially increases the proportion of fully correct solutions. It establishes, for the first time, a synergistic exploration paradigm that jointly leverages uncertainty-aware reasoning and verifiable feedback.
📝 Abstract
Reinforcement Learning from Verifiable Rewards (RLVR) improves the reasoning abilities of Large Language Models (LLMs) but it struggles with unstable exploration. We propose FR3E (First Return, Entropy-Eliciting Explore), a structured exploration framework that identifies high-uncertainty decision points in reasoning trajectories and performs targeted rollouts to construct semantically grounded intermediate feedback. Our method provides targeted guidance without relying on dense supervision. Empirical results on mathematical reasoning benchmarks(AIME24) show that FR3E promotes more stable training, produces longer and more coherent responses, and increases the proportion of fully correct trajectories. These results highlight the framework's effectiveness in improving LLM reasoning through more robust and structured exploration.