AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence

📅 2025-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing PRM training relies heavily on hand-crafted heuristics—such as fixed token intervals or placeholder-based segmentation—to partition reasoning steps, often distorting genuine decision points and degrading reward modeling efficacy. To address this, we propose Confidence-based Adaptive Step Partitioning (CASP), a label-free, adaptive step segmentation method that leverages token-level prediction confidence from autoregressive language models to dynamically identify semantically meaningful decision boundaries. CASP ensures each reasoning step captures authentic inferential leaps, substantially increasing decision information density. Evaluated on mathematical reasoning and code generation tasks, PRMs trained with CASP achieve state-of-the-art performance under Best-of-N evaluation, outperforming token-level value-guided greedy decoding. Moreover, CASP reduces PRM training cost by over 30% while markedly improving generalization and cross-task transferability.

Technology Category

Application Category

📝 Abstract
Current approaches for training Process Reward Models (PRMs) often involve breaking down responses into multiple reasoning steps using rule-based techniques, such as using predefined placeholder tokens or setting the reasoning step's length into a fixed size. These approaches overlook the fact that specific words do not typically mark true decision points in a text. To address this, we propose AdaptiveStep, a method that divides reasoning steps based on the model's confidence in predicting the next word. This division method provides more decision-making information at each step, enhancing downstream tasks, such as reward model learning. Moreover, our method does not require manual annotation. We demonstrate its effectiveness through experiments with AdaptiveStep-trained PRMs in mathematical reasoning and code generation tasks. Experimental results indicate that the outcome PRM achieves state-of-the-art Best-of-N performance, surpassing greedy search strategy with token-level value-guided decoding, while also reducing construction costs by over 30% compared to existing open-source PRMs. In addition, we provide a thorough analysis and case study on the PRM's performance, transferability, and generalization capabilities.
Problem

Research questions and friction points this paper is trying to address.

Overcoming rule-based step division in Process Reward Models
Automatically dividing reasoning steps via model confidence
Enhancing reward model performance without manual annotation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatic step division via model confidence prediction
Enhances reward models without manual annotation
Reduces costs while achieving state-of-the-art performance
🔎 Similar Papers
No similar papers found.
Y
Yuliang Liu
Nanjing University
Junjie Lu
Junjie Lu
Cystic Fibrosis Foundation
biomedical researchstem celldisease modelinggenome structure and function
Zhaoling Chen
Zhaoling Chen
Master Student, Nanjing University
Software EngineeringNatural language processing
C
Chaofeng Qu
Nanjing University
J
Jason Klein Liu
C
Chonghan Liu
Zefan Cai
Zefan Cai
Student, Peking University
Inference AccelerationMulti-Modality
Y
Yunhui Xia
L
Li Zhao
MSRA
J
Jiang Bian
MSRA
C
Chuheng Zhang
MSRA
W
Wei Shen
Z
Zhouhan Lin
Shanghai Jiaotong University