GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address three key challenges in process reward modeling (PRM)—high reward noise, low factual fidelity, and misaligned step-level credit assignment in multi-step reasoning—this paper proposes a tree-guided and fidelity-aware PRM framework. The method integrates Monte Carlo tree search (MCTS) to construct structured reasoning paths, employs external tools for verifiable, stepwise factual validation, and introduces a hybrid reward aggregation mechanism alongside generative, interpretable reward formatting. Trained on only 40K automatically labeled samples—approximately 10% of state-of-the-art (SOTA) methods—it achieves a 26% average relative performance gain on ProcessBench and outperforms human-annotated PRMs in reward-guided search. The core contribution lies in enabling fine-grained, low-noise, high-fidelity step-level reward modeling, marking the first integration of tool-based factual verification directly into the PRM training loop.

Technology Category

Application Category

📝 Abstract
Process Reward Models (PRMs) aim to improve multi-step reasoning in Large Language Models (LLMs) by supervising intermediate steps and identifying errors. However, building effective PRMs remains challenging due to the lack of scalable, high-quality annotations. Existing approaches rely on costly human labeling, LLM-based self-evaluation that is prone to hallucination, or Monte Carlo (MC) estimation, which infers step quality solely from rollout outcomes and often introduces noisy, misaligned supervision due to credit misattribution. These issues result in three core limitations: noisy rewards, low factual fidelity, and misalignment with step-level reasoning objectives. To address these challenges, we introduce GroundedPRM, a tree-guided and fidelity-aware framework for automatic process supervision. To reduce reward noise and enable fine-grained credit assignment, we construct structured reasoning paths via Monte Carlo Tree Search (MCTS). To eliminate hallucinated supervision, we validate each intermediate step using an external tool, providing execution-grounded correctness signals. To combine both step-level validation and global outcome assessment, we design a hybrid reward aggregation mechanism that fuses tool-based verification with MCTS-derived feedback. Finally, we format the reward signal into a rationale-enhanced, generative structure to promote interpretability and compatibility with instruction-tuned LLMs. GroundedPRM is trained on only 40K automatically labeled samples, amounting to just 10% of the data used by the best-performing PRM trained with auto-labeled supervision. Nevertheless, it achieves up to a 26% relative improvement in average performance on ProcessBench. When used for reward-guided greedy search, GroundedPRM outperforms even PRMs trained with human-labeled supervision, offering a scalable and verifiable path toward high-quality process-level reasoning.
Problem

Research questions and friction points this paper is trying to address.

Reducing noisy rewards in process supervision
Improving factual fidelity of step-level reasoning
Enabling fine-grained credit assignment for reasoning steps
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Monte Carlo Tree Search for structured reasoning paths
Validates intermediate steps with external tool execution
Combines tool verification with MCTS feedback in hybrid rewards
Y
Yao Zhang
LMU Munich
Yu Wu
Yu Wu
University of Cambridge
machine learninghealth sensingmobile health
H
Haowei Zhang
Fudan University
W
Weiguo Li
University Heidelberg
H
Haokun Chen
LMU Munich
J
Jingpei Wu
LMU Munich
G
Guohao Li
University of Oxford
Z
Zhen Han
LMU Munich
Volker Tresp
Volker Tresp
Ludwig-Maximilians-Universität München (LMU Munich)
Machine LearningArtificial IntelligenceComputational Cognitive NeuroscienceKnowledge Graphs