Unsupervised Process Reward Models

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the high cost and limited scalability of traditional process reward models (PRMs), which rely on expert-annotated fine-grained labels for reasoning steps. The authors propose uPRM, the first fully unsupervised process reward model that requires neither human annotations nor ground-truth answers. Instead, uPRM leverages next-token prediction probabilities from large language models to construct a joint scoring function that automatically identifies the first erroneous step in a reasoning trajectory. Evaluated on ProcessBench, uPRM improves first-error localization accuracy by 15% over LLM-as-a-Judge. When deployed as a test-time verifier, it matches the performance of supervised PRMs and outperforms majority voting by 6.9%. Furthermore, uPRM provides more robust policy optimization signals in reinforcement learning settings.

📝 Abstract

Process Reward Models (PRMs) are a powerful mechanism for steering large language model reasoning by providing fine-grained, step-level supervision. However, this effectiveness comes at a significant cost: PRMs require expert annotations for every reasoning step, making them costly and difficult to scale. Here, we propose a method for training unsupervised PRMs (uPRM) that requires no human supervision, neither at the level of step-by-step annotations nor through ground-truth verification of final answers. The key idea behind our approach is to define a scoring function, derived from LLM next-token probabilities, that jointly assesses candidate positions of first erroneous steps across a batch of reasoning trajectories. We demonstrate the effectiveness of uPRM across diverse scenarios: (i) uPRM achieves up to 15% absolute accuracy improvements over the LLM-as-a-Judge in identifying first erroneous steps on the ProcessBench dataset; (ii) as a verifier for test-time scaling, uPRM performs comparably to supervised PRMs and outperforms the majority voting baseline by up to 6.9%, and (iii) when used as a reward signal in reinforcement learning, uPRM enables more robust policy optimization throughout training compared to a supervised PRM trained using ground-truth labels. Overall, our results open a path toward scalable reward modeling for complex reasoning tasks.

Problem

Research questions and friction points this paper is trying to address.

Process Reward Models

unsupervised learning

reasoning supervision

expert annotations

scalability

Innovation

Methods, ideas, or system contributions that make the work stand out.

unsupervised Process Reward Models

step-level reasoning supervision

next-token probability scoring