Med-PRM: Medical Reasoning Models with Stepwise, Guideline-verified Process Rewards

📅 2025-06-13

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Large language models (LLMs) face a critical bottleneck in clinical decision-making: the inability to precisely locate and correct reasoning errors within complex diagnostic reasoning chains. Method: This paper proposes a fine-grained Process Reward Modeling (PRM) framework that performs retrieval-augmented clinical validation at each step of the reasoning chain, grounded in authoritative medical guidelines and peer-reviewed literature. It introduces the novel “stepwise guideline alignment” evaluation mechanism, enabling precise error localization and interpretable, evidence-based correction. Contribution/Results: The method enables lightweight models (e.g., 8B-parameter LLMs) to achieve, for the first time, >80% accuracy on MedQA. It attains state-of-the-art performance across five medical QA benchmarks and two open-ended diagnostic tasks, with base model improvements up to +13.50%. By grounding reasoning steps in clinical evidence and enabling transparent error diagnosis, the framework substantially enhances the reliability and trustworthiness of clinical reasoning in LLMs.

Technology Category

Application Category

📝 Abstract

Large language models have shown promise in clinical decision making, but current approaches struggle to localize and correct errors at specific steps of the reasoning process. This limitation is critical in medicine, where identifying and addressing reasoning errors is essential for accurate diagnosis and effective patient care. We introduce Med-PRM, a process reward modeling framework that leverages retrieval-augmented generation to verify each reasoning step against established medical knowledge bases. By verifying intermediate reasoning steps with evidence retrieved from clinical guidelines and literature, our model can precisely assess the reasoning quality in a fine-grained manner. Evaluations on five medical QA benchmarks and two open-ended diagnostic tasks demonstrate that Med-PRM achieves state-of-the-art performance, with improving the performance of base models by up to 13.50% using Med-PRM. Moreover, we demonstrate the generality of Med-PRM by integrating it in a plug-and-play fashion with strong policy models such as Meerkat, achieving over 80% accuracy on MedQA for the first time using small-scale models of 8 billion parameters. Our code and data are available at: https://med-prm.github.io/

Problem

Research questions and friction points this paper is trying to address.

Localizing and correcting errors in clinical reasoning steps

Improving medical diagnosis accuracy via stepwise verification

Enhancing small models' performance with evidence-based reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Process reward modeling for medical reasoning

Retrieval-augmented stepwise guideline verification

Plug-and-play integration with policy models

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting