๐ค AI Summary
This work addresses the limitations of existing process reward models in complex scientific reasoning, which often suffer from factual inconsistencies, inaccurate tool usage, and a lack of fine-grained verification mechanisms, leading to hallucinations. To overcome these challenges, the authors propose Sci-PRM, the first tool-aware process reward model for scientific reasoning. Built upon SCIPRM70Kโa newly curated large-scale dataset of tool-invoking trajectoriesโSci-PRM enables stepwise supervision over tool selection, execution, and result interpretation. By leveraging Chain-of-Tool trajectories, it facilitates fine-grained validation, supports Best-of-N decoding at test time, and provides dense reward signals in reinforcement learning, effectively mitigating the vanishing advantage problem. Experimental results demonstrate that Sci-PRM significantly enhances the reasoning performance of base models on scientific tasks, surpassing current reinforcement learning bottlenecks.
๐ Abstract
While Process Reward Models (PRMs) have achieved remarkable success in mathematical reasoning, their application in complex scientific domains-such as biology, chemistry, and physics remains largely unexplored. Scientific problems demand not only logical rigor but also factual consistency and the precise usage of domain-specific tools, areas where current models often suffer from hallucinations and lack of verification. In this paper, we first construct SCIPRM70K, a large-scale dataset featuring Chain-of-Tool trajectories that explicitly interleave reasoning with the execution of scientific tools. Building upon this, we train an efficient reward model called Sci-PRM to provide fine-grained supervision on tool selection, execution accuracy, and result interpretation at each step in one inference. Experiments demonstrate that Sci-PRM significantly enhances foundation models in two key aspects: (1) it enables effective test-time scaling via Best-of-N selection; and (2) when integrated into Reinforcement Learning, it serves as a dense reward signal that mitigates the critical issue of advantage disappearance, allowing the model to break through existing performance ceilings.