R-PRM: Reasoning-Driven Process Reward Modeling

📅 2025-03-27

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing process reward models (PRMs) for step-wise evaluation of large language model (LLM) mathematical reasoning suffer from low accuracy, heavy reliance on scarce human annotations, and poor learning efficiency. To address these limitations, this paper proposes a reasoning-driven process reward modeling framework, introducing a novel three-stage paradigm: (1) self-generation of high-quality seed data via distillation from strong LLMs; (2) annotation-free training via process-level preference optimization—a variant of PPO/RLHF operating over reasoning steps rather than final outputs; and (3) inference-time dynamic chain-of-thought expansion coupled with multi-step score fusion. Evaluated on ProcessBench and PRMBench, our method achieves +11.9 and +8.5 F1 improvements, respectively. When integrated into LLM training, it boosts average accuracy by over 8.5 percentage points across six challenging mathematical benchmark suites, significantly enhancing both generalization of reward estimation and guidance capability for complex reasoning.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) inevitably make mistakes when performing step-by-step mathematical reasoning. Process Reward Models (PRMs) have emerged as a promising solution by evaluating each reasoning step. However, existing PRMs typically output evaluation scores directly, limiting both learning efficiency and evaluation accuracy, which is further exacerbated by the scarcity of annotated data. To address these issues, we propose Reasoning-Driven Process Reward Modeling (R-PRM). First, we leverage stronger LLMs to generate seed data from limited annotations, effectively bootstrapping our model's reasoning capabilities and enabling comprehensive step-by-step evaluation. Second, we further enhance performance through preference optimization, without requiring additional annotated data. Third, we introduce inference-time scaling to fully harness the model's reasoning potential. Extensive experiments demonstrate R-PRM's effectiveness: on ProcessBench and PRMBench, it surpasses strong baselines by 11.9 and 8.5 points in F1 scores, respectively. When applied to guide mathematical reasoning, R-PRM achieves consistent accuracy improvements of over 8.5 points across six challenging datasets. Further analysis reveals that R-PRM exhibits more comprehensive evaluation and stronger generalization capabilities, thereby highlighting its significant potential.

Problem

Research questions and friction points this paper is trying to address.

Improves step-by-step reasoning evaluation in LLMs

Addresses data scarcity in Process Reward Models

Enhances accuracy and generalization in mathematical reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates seed data from limited annotations

Enhances performance via preference optimization

Introduces inference-time scaling for reasoning

🔎 Similar Papers

RLeXplore: Accelerating Research in Intrinsically-Motivated Reinforcement Learning