SCI-PRM: A Tool Aware Process Reward Model for Scientific Reasoning Verification

๐Ÿ“… 2026-06-03
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

227K/year
๐Ÿค– AI Summary
This work addresses the limitations of existing process reward models in complex scientific reasoning, which often suffer from factual inconsistencies, inaccurate tool usage, and a lack of fine-grained verification mechanisms, leading to hallucinations. To overcome these challenges, the authors propose Sci-PRM, the first tool-aware process reward model for scientific reasoning. Built upon SCIPRM70Kโ€”a newly curated large-scale dataset of tool-invoking trajectoriesโ€”Sci-PRM enables stepwise supervision over tool selection, execution, and result interpretation. By leveraging Chain-of-Tool trajectories, it facilitates fine-grained validation, supports Best-of-N decoding at test time, and provides dense reward signals in reinforcement learning, effectively mitigating the vanishing advantage problem. Experimental results demonstrate that Sci-PRM significantly enhances the reasoning performance of base models on scientific tasks, surpassing current reinforcement learning bottlenecks.
๐Ÿ“ Abstract
While Process Reward Models (PRMs) have achieved remarkable success in mathematical reasoning, their application in complex scientific domains-such as biology, chemistry, and physics remains largely unexplored. Scientific problems demand not only logical rigor but also factual consistency and the precise usage of domain-specific tools, areas where current models often suffer from hallucinations and lack of verification. In this paper, we first construct SCIPRM70K, a large-scale dataset featuring Chain-of-Tool trajectories that explicitly interleave reasoning with the execution of scientific tools. Building upon this, we train an efficient reward model called Sci-PRM to provide fine-grained supervision on tool selection, execution accuracy, and result interpretation at each step in one inference. Experiments demonstrate that Sci-PRM significantly enhances foundation models in two key aspects: (1) it enables effective test-time scaling via Best-of-N selection; and (2) when integrated into Reinforcement Learning, it serves as a dense reward signal that mitigates the critical issue of advantage disappearance, allowing the model to break through existing performance ceilings.
Problem

Research questions and friction points this paper is trying to address.

Process Reward Models
Scientific Reasoning
Tool Usage
Factual Consistency
Hallucination
Innovation

Methods, ideas, or system contributions that make the work stand out.

Process Reward Model
Scientific Reasoning
Tool Integration
Chain-of-Tool
Reinforcement Learning