Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing PRM evaluation benchmarks focus solely on step-level correctness and lack the capacity to identify systematic reasoning errors. Method: We introduce the first PRM benchmark covering six systematic reasoning patterns—transformation, decomposition, reaggregation, deduction, verification, and integration—comprising 2,995 manually constructed reasoning paths with controlled defects. Our evaluation framework is structured by reasoning pattern and employs a comparative assessment paradigm integrating PRMs with prompt-based LLM critics. Contribution/Results: Empirical results reveal that mainstream PRMs achieve error detection rates below 60% across most patterns, with particularly poor performance on reaggregation and integration (<45%), exposing critical cross-pattern robustness deficits. This benchmark enables the first systematic, interpretable evaluation of PRMs’ ability to detect multi-paradigm errors in intermediate reasoning steps, establishing a new diagnostic standard for PRM analysis and improvement.

Technology Category

Application Category

📝 Abstract
Process Reward Models (PRMs) are crucial in complex reasoning and problem-solving tasks (e.g., LLM agents with long-horizon decision-making) by verifying the correctness of each intermediate reasoning step. In real-world scenarios, LLMs may apply various reasoning patterns (e.g., decomposition) to solve a problem, potentially suffering from errors under various reasoning patterns. Therefore, PRMs are required to identify errors under various reasoning patterns during the reasoning process. However, existing benchmarks mainly focus on evaluating PRMs with stepwise correctness, ignoring a systematic evaluation of PRMs under various reasoning patterns. To mitigate this gap, we introduce Socratic-PRMBench, a new benchmark to evaluate PRMs systematically under six reasoning patterns, including Transformation, Decomposition, Regather, Deduction, Verification, and Integration. Socratic-PRMBench}comprises 2995 reasoning paths with flaws within the aforementioned six reasoning patterns. Through our experiments on both PRMs and LLMs prompted as critic models, we identify notable deficiencies in existing PRMs. These observations underscore the significant weakness of current PRMs in conducting evaluations on reasoning steps under various reasoning patterns. We hope Socratic-PRMBench can serve as a comprehensive testbed for systematic evaluation of PRMs under diverse reasoning patterns and pave the way for future development of PRMs.
Problem

Research questions and friction points this paper is trying to address.

Evaluating Process Reward Models (PRMs) under diverse reasoning patterns
Identifying errors in intermediate reasoning steps across six patterns
Addressing gaps in existing benchmarks for systematic PRM assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Socratic-PRMBench for systematic PRM evaluation
Tests PRMs under six diverse reasoning patterns
Includes 2995 flawed reasoning paths for analysis
🔎 Similar Papers
No similar papers found.
X
Xiang Li
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
H
Haiyang Yu
Tongyi Lab, Alibaba Group
Xinghua Zhang
Xinghua Zhang
Tongyi Lab, Alibaba Group
Large Language ModelLow ResourceInformation Extraction
Z
Ziyang Huang
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
S
Shizhu He
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
K
Kang Liu
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
J
Jun Zhao
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
F
Fei Huang
Tongyi Lab, Alibaba Group
Y
Yongbin Li
Tongyi Lab, Alibaba Group