Process Reward Models That Think

πŸ“… 2025-04-23
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of process reward modeling (PRM)β€”namely, its heavy reliance on labor-intensive step-level human annotations and poor generalization. We propose a generative, long-chain, verifiable chain-of-thought (CoT) PRM paradigm. Our method fine-tunes large language models to generate fine-grained, verifiable long CoTs, integrating reward-guided search and best-of-N inference. Crucially, it achieves efficient training using only 1% of the process labels in PRM800K. The core contribution is the first end-to-end generative PRM framework, eliminating dependence on discriminative modeling and manual annotation. Experiments demonstrate state-of-the-art performance across ProcessBench, MATH-500, and AIME’24; cross-domain gains of +8.0% on GPQA-Diamond and +4.5% on LiveCodeBench; and a +7.2% improvement in verification accuracy under identical token budgets.

Technology Category

Application Category

πŸ“ Abstract
Step-by-step verifiers -- also known as process reward models (PRMs) -- are a key ingredient for test-time scaling. PRMs require step-level supervision, making them expensive to train. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs. Our approach capitalizes on the inherent reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and discriminative verifiers -- using only 1% of the process labels in PRM800K -- across several challenging benchmarks. Specifically, ThinkPRM beats the baselines on ProcessBench, MATH-500, and AIME '24 under best-of-N selection and reward-guided search. In an out-of-domain evaluation on a subset of GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers trained on the full PRM800K by 8% and 4.5%, respectively. Lastly, under the same token budget, ThinkPRM scales up verification compute more effectively compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of ProcessBench. Our work highlights the value of generative, long CoT PRMs that can scale test-time compute for verification while requiring minimal supervision for training. Our code, data, and models will be released at https://github.com/mukhal/thinkprm.
Problem

Research questions and friction points this paper is trying to address.

Build data-efficient step-wise reward models (PRMs)
Reduce training cost by using fewer process labels
Improve verification accuracy across multiple benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative verbalized step-wise reward models
Long chain-of-thought verification approach
Data-efficient training with minimal supervision
πŸ”Ž Similar Papers
No similar papers found.