🤖 AI Summary
Existing reinforcement learning (RL) methods—such as GRPO—fail in code reasoning due to low reward variance, while process reward models (PRMs) suffer from reliance on costly human annotations and limited validation efficacy. To address these issues, we propose ReST-RL, a unified RL framework for code generation. Its core contributions are: (1) ReST-GRPO, an improved policy gradient algorithm that significantly increases reward variance via refined gradient estimation; (2) VM-MCTS, a label-free decoding strategy that leverages a lightweight value model to guide Monte Carlo tree search for efficient reasoning-path modeling; and (3) an integrated pipeline combining self-training-based data filtering with end-to-end joint optimization. Evaluated on APPS, BigCodeBench, and HumanEval, ReST-RL consistently outperforms state-of-the-art RL and verification-based approaches, achieving simultaneous gains in both code correctness and inference efficiency.
📝 Abstract
With respect to improving the reasoning accuracy of LLMs, the representative reinforcement learning (RL) method GRPO faces failure due to insignificant reward variance, while verification methods based on process reward models (PRMs) suffer from difficulties with training data acquisition and verification effectiveness. To tackle these problems, this paper introduces ReST-RL, a unified LLM RL paradigm that significantly improves LLM's code reasoning ability by combining an improved GRPO algorithm with a meticulously designed test time decoding method assisted by a value model (VM). As the first stage of policy reinforcement, ReST-GRPO adopts an optimized ReST algorithm to filter and assemble high-value training data, increasing the reward variance of GRPO sampling, thus improving the effectiveness and efficiency of training. After the basic reasoning ability of LLM policy has been improved, we further propose a test time decoding optimization method called VM-MCTS. Through Monte-Carlo Tree Search (MCTS), we collect accurate value targets with no annotation required, on which VM training is based. When decoding, the VM is deployed by an adapted MCTS algorithm to provide precise process signals as well as verification scores, assisting the LLM policy to achieve high reasoning accuracy. We validate the effectiveness of the proposed RL paradigm through extensive experiments on coding problems. Upon comparison, our approach significantly outperforms other reinforcement training baselines (e.g., naive GRPO and ReST-DPO), as well as decoding and verification baselines (e.g., PRM-BoN and ORM-MCTS) on well-known coding benchmarks of various levels (e.g., APPS, BigCodeBench, and HumanEval), indicating its power to strengthen the reasoning ability of LLM policies. Codes for our project can be found at https://github.com/THUDM/ReST-RL.