RPM-MCTS: Knowledge-Retrieval as Process Reward Model with Monte Carlo Tree Search for Code Generation

πŸ“… 2025-11-24
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing tree-search-based code generation methods suffer from ineffective intermediate-step evaluation and poor error localization, leading to erroneous outputs and high computational overhead. This paper proposes RPM-MCTS, a novel approach that replaces conventional *trained* process reward models with *knowledge retrieval* to enable fine-grained assessment of algorithmic reasoning paths. It further enhances exploration diversity via similarity-based filtering and incorporates sandbox execution feedback for precise error localization and targeted correction. Evaluated on four public benchmarks, RPM-MCTS significantly outperforms prior methods while reducing token consumption by approximately 15%. Additionally, full fine-tuning of the base model on high-quality synthetic code data systematically improves its code generation capability. The core innovations are: (i) a training-free process reward modeling mechanism grounded in knowledge retrieval, and (ii) an execution-driven, iterative error correction framework.

Technology Category

Application Category

πŸ“ Abstract
Tree search-based methods have made significant progress in enhancing the code generation capabilities of large language models. However, due to the difficulty in effectively evaluating intermediate algorithmic steps and the inability to locate and timely correct erroneous steps, these methods often generate incorrect code and incur increased computational costs. To tackle these problems, we propose RPM-MCTS, an effective method that utilizes Knowledge-Retrieval as Process Reward Model based on Monte Carlo Tree Search to evaluate intermediate algorithmic steps. By utilizing knowledge base retrieval, RPM-MCTS avoids the complex training of process reward models. During the expansion phase, similarity filtering is employed to remove redundant nodes, ensuring diversity in reasoning paths. Furthermore, our method utilizes sandbox execution feedback to locate erroneous algorithmic steps during generation, enabling timely and targeted corrections. Extensive experiments on four public code generation benchmarks demonstrate that RPM-MCTS outperforms current state-of-the-art methods while achieving an approximately 15% reduction in token consumption. Furthermore, full fine-tuning of the base model using the data constructed by RPM-MCTS significantly enhances its code capabilities.
Problem

Research questions and friction points this paper is trying to address.

Evaluating intermediate algorithmic steps in code generation is challenging
Existing methods cannot locate and correct erroneous steps timely
Tree search approaches generate incorrect code with high computational costs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge-Retrieval Process Reward Model avoids complex training
Similarity filtering removes redundant nodes during expansion
Sandbox execution feedback locates and corrects erroneous steps
πŸ”Ž Similar Papers
No similar papers found.
Yuanyuan Lin
Yuanyuan Lin
The Chinese University of Hong Kong
Statistics
X
Xiangyu Ouyang
College of Computer Science and Technology, Xi’an Jiaotong University, China
T
Teng Zhang
School of Computer Science and Technology, Huazhong University of Science and Technology, China
Kaixin Sui
Kaixin Sui
ByteDance Seed, China