Efficient Process Reward Modeling via Contrastive Mutual Information

📅 2026-04-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

194K/year
🤖 AI Summary
This work addresses the inefficiency of traditional process reward models, which rely on costly human annotations or computationally intensive automatic methods—such as Monte Carlo estimation—to generate step-level supervision signals. The authors propose Contrastive Pointwise Mutual Information (CPMI), a novel approach that introduces contrastive mutual information into process reward modeling. CPMI leverages the internal probabilities of large language models to automatically assess the mutual information gain between intermediate reasoning steps and the final answer, augmented with hard negative sampling for effective reward annotation. Notably, this method incurs no additional inference overhead and significantly outperforms existing automatic labeling strategies on mathematical reasoning and process evaluation benchmarks: it reduces dataset construction time by 84%, decreases token generation by 98%, and simultaneously improves accuracy.

Technology Category

Application Category

📝 Abstract
Recent research has devoted considerable effort to verifying the intermediate reasoning steps of chain-of-thought (CoT) trajectories using process reward models (PRMs) and other verifier models. However, training a PRM typically requires human annotators to assign reward scores to each reasoning step, which is both costly and time-consuming. Existing automated approaches, such as Monte Carlo (MC) estimation, also demand substantial computational resources due to repeated LLM rollouts. To overcome these limitations, we propose contrastive pointwise mutual information (CPMI), a novel automatic reward labeling method that leverages the model's internal probability to infer step-level supervision while significantly reducing the computational burden of annotating dataset. CPMI quantifies how much a reasoning step increases the mutual information between the step and the correct target answer relative to hard-negative alternatives. This contrastive signal serves as a proxy for the step's contribution to the final solution and yields a reliable reward. The experimental results show that CPMI-based labeling reduces dataset construction time by 84% and token generation by 98% compared to MC estimation, while achieving higher accuracy on process-level evaluations and mathematical reasoning benchmarks.
Problem

Research questions and friction points this paper is trying to address.

process reward modeling
chain-of-thought
reward labeling
computational efficiency
reasoning steps
Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive Mutual Information
Process Reward Modeling
Automatic Reward Labeling
Chain-of-Thought Reasoning
Efficient LLM Supervision