From Static to Dynamic: Adaptive Monte Carlo Search for Mathematical Process Supervision

πŸ“… 2025-09-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing methods rely on fixed-budget sampling, leading to inefficient evaluation of reasoning paths in large search spaces and compromising the trade-off between efficiency and solution quality. To address this, we propose Adaptive Monte Carlo Search (AMCS), which elevates mathematical reasoning data generation from static sampling to dynamic search: it introduces an uncertainty-aware sample allocation mechanism for adaptive node evaluation and employs time-varying policies for Monte Carlo path expansion to enhance exploration precision. Leveraging this framework, we construct MathSearch-200Kβ€”a high-quality mathematical reasoning dataset integrating process-based reward modeling with large-scale automated generation. Experiments show that a 7B-parameter model trained on MathSearch-200K achieves 76.2% accuracy on MATH500, surpassing a weakly supervised 72B model, and demonstrates strong out-of-distribution generalization.

Technology Category

Application Category

πŸ“ Abstract
The quality of process data plays a key role in training a Process Reward Model (PRM), which can enhance the complex mathematical reasoning capability of large language models. Existing methods estimate the quality of reasoning steps based on a fixed-budget sampling strategy and navigate a vast search space to perform path expansion during the automated data generation process, resulting in their inefficiency and inflexibility. To address these issues, we propose Adaptive Monte Carlo Search (AMCS), a framework that transforms data generation from fixed, static to adaptive, dynamic search at the level of node value estimation and path expansion. On one hand, AMCS adaptively refines estimation by allocating more samples to uncertain reasoning steps while using fewer samples for those that are easier to estimate. On the other hand, it enhances the path expansion through a Monte Carlo algorithm with a temporally adaptive policy that begins with broad exploration and gradually shifts toward exploiting the most promising directions. With AMCS, we construct a large-scale dataset MathSearch-200K of about 200K process supervision examples for training PRMs. To verify the effectiveness of our method, we conduct extensive experiments on four mathematical reasoning benchmarks. Experimental results show that Qwen2.5-Math-7B-PRM-AMCS achieves up to 76.2% accuracy on MATH500 with GLM-4-9B, outperforming all baseline PRMs. Notably, a 7B model supervised by Qwen2.5-Math-7B-PRM-AMCS surpasses a 72B model with weaker supervision. Moreover, Qwen2.5-Math-7B-PRM-AMCS maintains consistent advantages on out-of-distribution problems, demonstrating strong generalization capability. Our code is available at https://github.com/reml-group/AMCS.
Problem

Research questions and friction points this paper is trying to address.

Adaptively allocates samples to uncertain reasoning steps
Enhances path expansion using temporally adaptive policy
Generates large-scale process supervision dataset for mathematical reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Monte Carlo Search for dynamic data generation
Refines estimation by allocating samples adaptively
Enhances path expansion with temporally adaptive policy
πŸ”Ž Similar Papers
No similar papers found.
J
Jie Ma
MOE KLINNS Lab, Xi’an Jiaotong University
S
Shihao Qi
School of Computer Science and Technology, Xi’an Jiaotong University
Rui Xing
Rui Xing
University of Melbourne
Natural Language ProcessingArtificial IntelligenceDeep Learning
Ziang Yin
Ziang Yin
ASU PhD Student in Computer Engineering
B
Bifan Wei
Shaanxi Province Key Laboratory of Big Data Knowledge Engineering
J
Jun Liu
MOE KLINNS Lab, Xi’an Jiaotong University
Tongliang Liu
Tongliang Liu
Director, Sydney AI Centre, University of Sydney & Mohamed bin Zayed University of AI
Machine LearningLearning with Noisy LabelsTrustworthy Machine Learning