🤖 AI Summary
This work addresses the high computational cost and heavy reliance on fine-grained human annotations in process supervision for reward modeling. To this end, we propose EDU-PRM, a novel framework for efficient process reward modeling. Our method introduces the first logit-distribution entropy–based self-assessment mechanism, enabling uncertainty-aware dynamic step partitioning without manual labeling or fixed step lengths. Furthermore, by integrating lightweight query generation with Qwen2.5-72B adaptation, EDU-PRM achieves 71.1% accuracy using only 7,500 generated queries—nearly matching the 71.6% accuracy of the full-scale PRM—while reducing query cost by 98%. The framework significantly improves both efficiency and scalability of process reward modeling, offering a practical solution for large-language-model–based reinforcement learning from process feedback.
📝 Abstract
This paper presents the Entropy-Driven Unified Process Reward Model (EDU-PRM), a novel framework that approximates state-of-the-art performance in process supervision while drastically reducing training costs. EDU-PRM introduces an entropy-guided dynamic step partitioning mechanism, using logit distribution entropy to pinpoint high-uncertainty regions during token generation dynamically. This self-assessment capability enables precise step-level feedback without manual fine-grained annotation, addressing a critical challenge in process supervision. Experiments on the Qwen2.5-72B model with only 7,500 EDU-PRM-generated training queries demonstrate accuracy closely approximating the full Qwen2.5-72B-PRM (71.1% vs. 71.6%), achieving a 98% reduction in query cost compared to prior methods. This work establishes EDU-PRM as an efficient approach for scalable process reward model training.