🤖 AI Summary
Existing PRM training relies heavily on hand-crafted heuristics—such as fixed token intervals or placeholder-based segmentation—to partition reasoning steps, often distorting genuine decision points and degrading reward modeling efficacy. To address this, we propose Confidence-based Adaptive Step Partitioning (CASP), a label-free, adaptive step segmentation method that leverages token-level prediction confidence from autoregressive language models to dynamically identify semantically meaningful decision boundaries. CASP ensures each reasoning step captures authentic inferential leaps, substantially increasing decision information density. Evaluated on mathematical reasoning and code generation tasks, PRMs trained with CASP achieve state-of-the-art performance under Best-of-N evaluation, outperforming token-level value-guided greedy decoding. Moreover, CASP reduces PRM training cost by over 30% while markedly improving generalization and cross-task transferability.
📝 Abstract
Current approaches for training Process Reward Models (PRMs) often involve breaking down responses into multiple reasoning steps using rule-based techniques, such as using predefined placeholder tokens or setting the reasoning step's length into a fixed size. These approaches overlook the fact that specific words do not typically mark true decision points in a text. To address this, we propose AdaptiveStep, a method that divides reasoning steps based on the model's confidence in predicting the next word. This division method provides more decision-making information at each step, enhancing downstream tasks, such as reward model learning. Moreover, our method does not require manual annotation. We demonstrate its effectiveness through experiments with AdaptiveStep-trained PRMs in mathematical reasoning and code generation tasks. Experimental results indicate that the outcome PRM achieves state-of-the-art Best-of-N performance, surpassing greedy search strategy with token-level value-guided decoding, while also reducing construction costs by over 30% compared to existing open-source PRMs. In addition, we provide a thorough analysis and case study on the PRM's performance, transferability, and generalization capabilities.