🤖 AI Summary
To address systematic errors in large language models (LLMs) during mathematical reasoning, this paper proposes a process-reward-based reinforcement learning optimization framework. Methodologically, it introduces entropy regularization into process reward modeling for the first time and rigorously derives an optimal reward construction scheme—balancing policy optimization and initial distribution constraints—within the theoretical framework of KL-regularized Markov decision processes. By assigning stepwise scores to explicitly guide reasoning-path generation, the approach enhances controllability and robustness of the inference process. Empirically, on the GSM8K and MATH benchmarks, the method achieves 1–3% improvement in Best-of-N evaluation and over 1% absolute accuracy gain after RLHF fine-tuning. These results demonstrate substantial improvements in both reliability and generalization capability of LLMs for mathematical reasoning.
📝 Abstract
Large language models (LLMs) have shown promise in performing complex multi-step reasoning, yet they continue to struggle with mathematical reasoning, often making systematic errors. A promising solution is reinforcement learning (RL) guided by reward models, particularly those focusing on process rewards, which score each intermediate step rather than solely evaluating the final outcome. This approach is more effective at guiding policy models towards correct reasoning trajectories. In this work, we propose an entropy-regularized process reward model (ER-PRM) that integrates KL-regularized Markov Decision Processes (MDP) to balance policy optimization with the need to prevent the policy from shifting too far from its initial distribution. We derive a novel reward construction method based on the theoretical results. Our theoretical analysis shows that we could derive the optimal reward model from the initial policy sampling. Our empirical experiments on the MATH and GSM8K benchmarks demonstrate that ER-PRM consistently outperforms existing process reward models, achieving 1% improvement on GSM8K and 2-3% improvement on MATH under best-of-N evaluation, and more than 1% improvement under RLHF. These results highlight the efficacy of entropy-regularization in enhancing LLMs' reasoning capabilities.