MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning

📅 2026-04-19

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This study addresses the absence of fine-grained process reward model (PRM) evaluation benchmarks in medical reasoning, which hinders the assessment of large language models’ capabilities in clinical error detection and safety. To bridge this gap, the authors introduce MedPRMBench—the first process-level PRM evaluation framework tailored to healthcare—grounded in a three-stage clinical reasoning blueprint. It comprises a high-quality dataset of 6,500 questions, 13,000 reasoning chains, and 113,910 step-level annotations curated from seven medical question-answering sources. The framework defines 14 fine-grained error types and a four-tier clinical severity grading system. The proposed baseline model, PRMScore, achieves an accuracy of 87.1% and functions as a plug-and-play verifier that improves downstream medical QA performance by 3.2–6.7 percentage points, thereby establishing a foundational benchmark for medical PRM evaluation.

Technology Category

Application Category

📝 Abstract

Process-Level Reward Models (PRMs) are essential for guiding complex reasoning in large language models, yet existing PRM benchmarks cover only general domains such as mathematics, failing to address medical reasoning -- which is uniquely characterized by safety criticality, knowledge intensity, and diverse error patterns. Without a reliable medical PRM evaluation framework, we cannot quantify models' error detection capabilities in clinical reasoning, leaving their safety in real-world healthcare applications unverified. We propose MedPRMBench, the first process-level reward model benchmark for the medical domain. Built through a three-phase pipeline based on Clinical Reasoning Blueprints (CRBs), MedPRMBench systematically generates high-quality evaluation data from seven medical QA sources, covering 14 fine-grained error types across three categories (Simplicity, Soundness, and Sensitivity) with the first 4-level severity grading system to quantify clinical impact. The benchmark comprises 6{,}500 questions with 13{,}000 reasoning chains and 113{,}910 step-level labels, plus 6{,}879 questions for training. Our medical PRM baseline achieves an 87.1\% overall PRMScore -- substantially surpassing all baselines -- and serves as a plug-and-play verifier that improves downstream medical QA accuracy by 3.2--6.7 percentage points. Systematic evaluation spanning proprietary frontier models, open-source reasoning models, and medical-specialized models reveals critical weaknesses in current models' medical reasoning error detection capabilities, providing clear directions for future PRM improvement.

Problem

Research questions and friction points this paper is trying to address.

Process Reward Models

Medical Reasoning

Benchmark

Error Detection

Clinical Safety

Innovation

Methods, ideas, or system contributions that make the work stand out.

Process Reward Model

Medical Reasoning

Fine-grained Benchmark