FreePRM: Training Process Reward Models Without Ground Truth Process Labels

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Existing process reward models (PRMs) rely on costly step-level human or synthetic annotations, severely limiting scalability. This paper proposes FreePRM—the first weakly supervised PRM training framework requiring no step-level ground-truth labels. FreePRM generates pseudo-labels from final-answer correctness and introduces a novel Buffer Probability mechanism to dynamically suppress pseudo-label noise. By eliminating dependence on fine-grained annotations, FreePRM drastically reduces labeling cost while preserving model performance. On the ProcessBench benchmark, FreePRM achieves an average F1 score of 53.0%, outperforming the fully supervised baseline Math-Shepherd by 24.1% and surpassing leading open-source PRMs by 10.9–24.6%.

Technology Category

Application Category

📝 Abstract

Recent advancements in Large Language Models (LLMs) have demonstrated that Process Reward Models (PRMs) play a crucial role in enhancing model performance. However, training PRMs typically requires step-level labels, either manually annotated or automatically generated, which can be costly and difficult to obtain at scale. To address this challenge, we introduce FreePRM, a weakly supervised framework for training PRMs without access to ground-truth step-level labels. FreePRM first generates pseudo step-level labels based on the correctness of final outcome, and then employs Buffer Probability to eliminate impact of noise inherent in pseudo labeling. Experimental results show that FreePRM achieves an average F1 score of 53.0% on ProcessBench, outperforming fully supervised PRM trained on Math-Shepherd by +24.1%. Compared to other open-source PRMs, FreePRM outperforms upon RLHFlow-PRM-Mistral-8B (28.4%) by +24.6%, EurusPRM (31.3%) by +21.7%, and Skywork-PRM-7B (42.1%) by +10.9%. This work introduces a new paradigm in PRM training, significantly reducing reliance on costly step-level annotations while maintaining strong performance.

Problem

Research questions and friction points this paper is trying to address.

Training PRMs without ground-truth step-level labels

Reducing reliance on costly manual annotations

Improving PRM performance with weak supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses pseudo step-level labels from outcomes

Employs Buffer Probability to reduce noise

Achieves high performance without ground truth

🔎 Similar Papers

No similar papers found.