Process Reward Modeling with Entropy-Driven Uncertainty

📅 2025-03-28

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the high computational cost and heavy reliance on fine-grained human annotations in process supervision for reward modeling. To this end, we propose EDU-PRM, a novel framework for efficient process reward modeling. Our method introduces the first logit-distribution entropy–based self-assessment mechanism, enabling uncertainty-aware dynamic step partitioning without manual labeling or fixed step lengths. Furthermore, by integrating lightweight query generation with Qwen2.5-72B adaptation, EDU-PRM achieves 71.1% accuracy using only 7,500 generated queries—nearly matching the 71.6% accuracy of the full-scale PRM—while reducing query cost by 98%. The framework significantly improves both efficiency and scalability of process reward modeling, offering a practical solution for large-language-model–based reinforcement learning from process feedback.

Technology Category

Application Category

📝 Abstract

This paper presents the Entropy-Driven Unified Process Reward Model (EDU-PRM), a novel framework that approximates state-of-the-art performance in process supervision while drastically reducing training costs. EDU-PRM introduces an entropy-guided dynamic step partitioning mechanism, using logit distribution entropy to pinpoint high-uncertainty regions during token generation dynamically. This self-assessment capability enables precise step-level feedback without manual fine-grained annotation, addressing a critical challenge in process supervision. Experiments on the Qwen2.5-72B model with only 7,500 EDU-PRM-generated training queries demonstrate accuracy closely approximating the full Qwen2.5-72B-PRM (71.1% vs. 71.6%), achieving a 98% reduction in query cost compared to prior methods. This work establishes EDU-PRM as an efficient approach for scalable process reward model training.

Problem

Research questions and friction points this paper is trying to address.

Reduces training costs for process reward models

Dynamically identifies high-uncertainty token generation regions

Eliminates need for manual fine-grained annotation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Entropy-guided dynamic step partitioning mechanism

Self-assessment for precise step-level feedback

Reduces training costs by 98%

🔎 Similar Papers

Uncertainty-aware Reward Model: Teaching Reward Models to Know What is Unknown