Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing PRM evaluation benchmarks focus solely on step-level correctness and lack the capacity to identify systematic reasoning errors. Method: We introduce the first PRM benchmark covering six systematic reasoning patterns—transformation, decomposition, reaggregation, deduction, verification, and integration—comprising 2,995 manually constructed reasoning paths with controlled defects. Our evaluation framework is structured by reasoning pattern and employs a comparative assessment paradigm integrating PRMs with prompt-based LLM critics. Contribution/Results: Empirical results reveal that mainstream PRMs achieve error detection rates below 60% across most patterns, with particularly poor performance on reaggregation and integration (<45%), exposing critical cross-pattern robustness deficits. This benchmark enables the first systematic, interpretable evaluation of PRMs’ ability to detect multi-paradigm errors in intermediate reasoning steps, establishing a new diagnostic standard for PRM analysis and improvement.

Technology Category

Application Category

📝 Abstract

Process Reward Models (PRMs) are crucial in complex reasoning and problem-solving tasks (e.g., LLM agents with long-horizon decision-making) by verifying the correctness of each intermediate reasoning step. In real-world scenarios, LLMs may apply various reasoning patterns (e.g., decomposition) to solve a problem, potentially suffering from errors under various reasoning patterns. Therefore, PRMs are required to identify errors under various reasoning patterns during the reasoning process. However, existing benchmarks mainly focus on evaluating PRMs with stepwise correctness, ignoring a systematic evaluation of PRMs under various reasoning patterns. To mitigate this gap, we introduce Socratic-PRMBench, a new benchmark to evaluate PRMs systematically under six reasoning patterns, including Transformation, Decomposition, Regather, Deduction, Verification, and Integration. Socratic-PRMBench}comprises 2995 reasoning paths with flaws within the aforementioned six reasoning patterns. Through our experiments on both PRMs and LLMs prompted as critic models, we identify notable deficiencies in existing PRMs. These observations underscore the significant weakness of current PRMs in conducting evaluations on reasoning steps under various reasoning patterns. We hope Socratic-PRMBench can serve as a comprehensive testbed for systematic evaluation of PRMs under diverse reasoning patterns and pave the way for future development of PRMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating Process Reward Models (PRMs) under diverse reasoning patterns

Identifying errors in intermediate reasoning steps across six patterns

Addressing gaps in existing benchmarks for systematic PRM assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Socratic-PRMBench for systematic PRM evaluation

Tests PRMs under six diverse reasoning patterns

Includes 2995 flawed reasoning paths for analysis

🔎 Similar Papers

No similar papers found.