ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification

๐Ÿ“… 2025-04-29
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Medical AI models often neglect the structured reasoning processes inherent in clinical diagnosis, limiting both diagnostic accuracy and interpretability. To address this, we propose a reasoning-enhanced multimodal large language model (MLLM) for chest X-ray interpretation, introducing the first clinical report-driven process supervision paradigm. Our method automatically mines and refines structured reasoning chains from radiology reports, enabling stepwise, verification-based diagnosis aligned with physician cognition. We design a two-stage training framework comprising supervised fine-tuning followed by process-aware reinforcement learning with reward modeling. We release RadRBench-CXRโ€”the first benchmark dedicated to reasoning evaluation on chest X-raysโ€”and RadRScore, a comprehensive metric assessing reasoning fidelity, diagnostic correctness, and clinical alignment. Experiments demonstrate a 16% improvement in reasoning capability and a 3.3% gain in diagnostic accuracy over state-of-the-art medical MLLMs. All data, models, and code are publicly available.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent advances in reasoning-enhanced large language models (LLMs) and multimodal LLMs (MLLMs) have significantly improved performance in complex tasks, yet medical AI models often overlook the structured reasoning processes inherent in clinical practice. In this work, we present ChestX-Reasoner, a radiology diagnosis MLLM designed to leverage process supervision mined directly from clinical reports, reflecting the step-by-step reasoning followed by radiologists. We construct a large dataset by extracting and refining reasoning chains from routine radiology reports. Our two-stage training framework combines supervised fine-tuning and reinforcement learning guided by process rewards to better align model reasoning with clinical standards. We introduce RadRBench-CXR, a comprehensive benchmark featuring 59K visual question answering samples with 301K clinically validated reasoning steps, and propose RadRScore, a metric evaluating reasoning factuality, completeness, and effectiveness. ChestX-Reasoner outperforms existing medical and general-domain MLLMs in both diagnostic accuracy and reasoning ability, achieving 16%, 5.9%, and 18% improvements in reasoning ability compared to the best medical MLLM, the best general MLLM, and its base model, respectively, as well as 3.3%, 24%, and 27% improvements in outcome accuracy. All resources are open-sourced to facilitate further research in medical reasoning MLLMs.
Problem

Research questions and friction points this paper is trying to address.

Enhancing radiology AI with clinical reasoning steps
Improving diagnostic accuracy via structured process supervision
Validating reasoning quality in medical visual question answering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages process supervision from clinical reports
Two-stage training with fine-tuning and reinforcement learning
Introduces RadRBench-CXR benchmark and RadRScore metric