GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models

📅 2026-05-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

175K/year
🤖 AI Summary
Existing process reward models (PRMs) are primarily confined to mathematical reasoning and lack the capability to detect errors in intermediate steps across diverse, general-purpose reasoning scenarios, further hindered by the absence of cross-domain evaluation benchmarks. To address this gap, this work proposes GR-Ben—the first process-level evaluation benchmark for general reasoning—spanning nine subtasks across scientific and logical domains. Leveraging multi-domain data, the benchmark includes error-annotated samples to systematically evaluate 22 PRMs and large language models (LLMs). Experimental results reveal that current models exhibit substantially limited error-detection performance outside mathematical contexts: PRMs struggle to identify knowledge-related errors, while LLMs underperform in detecting computational mistakes.
📝 Abstract
Currently, process reward models (PRMs) have exhibited remarkable potential for test-time scaling. Since large language models (LLMs) regularly generate flawed intermediate reasoning steps when tackling a broad spectrum of reasoning and decision-making tasks, PRMs are required to possess capabilities for detecting process-level errors in real-world scenarios. However, existing benchmarks primarily focus on mathematical reasoning, thereby failing to comprehensively evaluate the error detection ability of PRMs across diverse reasoning scenarios. To mitigate this gap, we introduce GR-Ben, a process-level benchmark specifically designed for assessing PRM's performance across two primary reasoning domains (science and logic) and nine subdomains. We conduct extensive experiments on a diverse set of 22 models, encompassing both PRMs and LLMs, and derive two key findings: (1) In domains beyond mathematical reasoning, the error-detection ability of existing PRMs and LLMs is found to be markedly weaker by comparison.(2) In general, PRMs are less adept at identifying knowledge-based errors, whereas LLMs exhibit poorer performance in detecting computational errors.We hope GR-Ben can foster future researches on PRMs for general domains, thereby enhancing the reasoning capabilities of LLMs.
Problem

Research questions and friction points this paper is trying to address.

process reward models
reasoning benchmark
error detection
general reasoning
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Process Reward Models
General Reasoning Benchmark
Error Detection
Test-Time Scaling
Large Language Models
🔎 Similar Papers
No similar papers found.
Zhouhao Sun
Zhouhao Sun
Harbin Institute of Technology
NLP
X
Xuan Zhang
Research Center for Social Computing and Interactive Robotics, Harbin Institute of Technology, China
Xiao Ding
Xiao Ding
Harbin Institute of Technology
Natural Language ProcessingArtificial Intelligence
Bibo Cai
Bibo Cai
Harbin Institute Technology
NLP
Li Du
Li Du
BAAI
LLMNLPData ScienceInterpretable AI
Kai Xiong
Kai Xiong
Harbin Institute of Technology
Event-Centric ReasoningLarge Language ModelsEvent Graph
X
Xinran Dai
Research Center for Social Computing and Interactive Robotics, Harbin Institute of Technology, China
Fei Zhang
Fei Zhang
Shanghai Jiao Tong University
Machine LearningComputer Vision
W
weidi tang
Research Center for Social Computing and Interactive Robotics, Harbin Institute of Technology, China
Z
Zhiyuan Kan
Research Center for Social Computing and Interactive Robotics, Harbin Institute of Technology, China
Yang Zhao
Yang Zhao
Research Professor, Zhejiang University, China
Intelligent BuildingSmart GridFault detection and diagnosisEnergy efficiency
Bing Qin
Bing Qin
Professor in Harbin Institute of Technology
Natural Language ProcessingInformation ExtractionSentiment Analysis
T
Ting Liu
Research Center for Social Computing and Interactive Robotics, Harbin Institute of Technology, China