AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence

📅 2025-02-19

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Existing PRM training relies heavily on hand-crafted heuristics—such as fixed token intervals or placeholder-based segmentation—to partition reasoning steps, often distorting genuine decision points and degrading reward modeling efficacy. To address this, we propose Confidence-based Adaptive Step Partitioning (CASP), a label-free, adaptive step segmentation method that leverages token-level prediction confidence from autoregressive language models to dynamically identify semantically meaningful decision boundaries. CASP ensures each reasoning step captures authentic inferential leaps, substantially increasing decision information density. Evaluated on mathematical reasoning and code generation tasks, PRMs trained with CASP achieve state-of-the-art performance under Best-of-N evaluation, outperforming token-level value-guided greedy decoding. Moreover, CASP reduces PRM training cost by over 30% while markedly improving generalization and cross-task transferability.

Technology Category

Application Category

📝 Abstract

Current approaches for training Process Reward Models (PRMs) often involve breaking down responses into multiple reasoning steps using rule-based techniques, such as using predefined placeholder tokens or setting the reasoning step's length into a fixed size. These approaches overlook the fact that specific words do not typically mark true decision points in a text. To address this, we propose AdaptiveStep, a method that divides reasoning steps based on the model's confidence in predicting the next word. This division method provides more decision-making information at each step, enhancing downstream tasks, such as reward model learning. Moreover, our method does not require manual annotation. We demonstrate its effectiveness through experiments with AdaptiveStep-trained PRMs in mathematical reasoning and code generation tasks. Experimental results indicate that the outcome PRM achieves state-of-the-art Best-of-N performance, surpassing greedy search strategy with token-level value-guided decoding, while also reducing construction costs by over 30% compared to existing open-source PRMs. In addition, we provide a thorough analysis and case study on the PRM's performance, transferability, and generalization capabilities.

Problem

Research questions and friction points this paper is trying to address.

Overcoming rule-based step division in Process Reward Models

Automatically dividing reasoning steps via model confidence

Enhancing reward model performance without manual annotation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatic step division via model confidence prediction

Enhances reward models without manual annotation

Reduces costs while achieving state-of-the-art performance

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting