Process Reward Modeling with Entropy-Driven Uncertainty

📅 2025-03-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost and heavy reliance on fine-grained human annotations in process supervision for reward modeling. To this end, we propose EDU-PRM, a novel framework for efficient process reward modeling. Our method introduces the first logit-distribution entropy–based self-assessment mechanism, enabling uncertainty-aware dynamic step partitioning without manual labeling or fixed step lengths. Furthermore, by integrating lightweight query generation with Qwen2.5-72B adaptation, EDU-PRM achieves 71.1% accuracy using only 7,500 generated queries—nearly matching the 71.6% accuracy of the full-scale PRM—while reducing query cost by 98%. The framework significantly improves both efficiency and scalability of process reward modeling, offering a practical solution for large-language-model–based reinforcement learning from process feedback.

Technology Category

Application Category

📝 Abstract
This paper presents the Entropy-Driven Unified Process Reward Model (EDU-PRM), a novel framework that approximates state-of-the-art performance in process supervision while drastically reducing training costs. EDU-PRM introduces an entropy-guided dynamic step partitioning mechanism, using logit distribution entropy to pinpoint high-uncertainty regions during token generation dynamically. This self-assessment capability enables precise step-level feedback without manual fine-grained annotation, addressing a critical challenge in process supervision. Experiments on the Qwen2.5-72B model with only 7,500 EDU-PRM-generated training queries demonstrate accuracy closely approximating the full Qwen2.5-72B-PRM (71.1% vs. 71.6%), achieving a 98% reduction in query cost compared to prior methods. This work establishes EDU-PRM as an efficient approach for scalable process reward model training.
Problem

Research questions and friction points this paper is trying to address.

Reduces training costs for process reward models
Dynamically identifies high-uncertainty token generation regions
Eliminates need for manual fine-grained annotation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Entropy-guided dynamic step partitioning mechanism
Self-assessment for precise step-level feedback
Reduces training costs by 98%
🔎 Similar Papers
No similar papers found.
Lang Cao
Lang Cao
CS PhD Student at University of Illinois Urbana-Champaign
Machine LearningMachine ReasoningAI for Health
R
Renhong Chen
Huawei Technologies Co., Ltd., China
Yingtian Zou
Yingtian Zou
National University of Singapore
machine learningcomputer vision
C
Chao Peng
Huawei Technologies Co., Ltd., China
W
Wu Ning
Huawei Technologies Co., Ltd., China
H
Huacong Xu
Huawei Technologies Co., Ltd., China
Q
Qian Chen
Huawei Technologies Co., Ltd., China
Y
Yuxian Wang
Huawei Technologies Co., Ltd., China
P
Peishuo Su
Huawei Technologies Co., Ltd., China
M
Mofan Peng
Huawei Technologies Co., Ltd., China
Zijie Chen
Zijie Chen
Westlake University
deep learning
Y
Yitong Li
Huawei Technologies Co., Ltd., China