MedReasoner: Reinforcement Learning Drives Reasoning Grounding from Clinical Thought to Pixel-Level Precision

📅 2025-08-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current medical multimodal large language models (MLLMs) rely on explicit spatial annotations for region-of-interest (ROI) localization, limiting their applicability to clinically prevalent implicit queries. To address this, we propose the “Unified Medical Reasoning and Localization” task and introduce the first reasoning-segmentation decoupled framework tailored for implicit queries. Our method pioneers the integration of reinforcement learning into medical vision-language modeling, featuring a dual reward mechanism that jointly optimizes output format fidelity and localization accuracy. Leveraging the 14K high-quality reasoning trajectory dataset U-MRG-14K, our approach achieves end-to-end ROI localization without pixel-level supervision. It employs a frozen segmentation expert as the localizer and an MLLM as the reasoning engine, significantly improving generalization to unseen clinical queries. Evaluated across multiple metrics, our method establishes new state-of-the-art performance, enhancing both accuracy and interpretability in medical image understanding.

Technology Category

Application Category

📝 Abstract
Accurately grounding regions of interest (ROIs) is critical for diagnosis and treatment planning in medical imaging. While multimodal large language models (MLLMs) combine visual perception with natural language, current medical-grounding pipelines still rely on supervised fine-tuning with explicit spatial hints, making them ill-equipped to handle the implicit queries common in clinical practice. This work makes three core contributions. We first define Unified Medical Reasoning Grounding (UMRG), a novel vision-language task that demands clinical reasoning and pixel-level grounding. Second, we release U-MRG-14K, a dataset of 14K samples featuring pixel-level masks alongside implicit clinical queries and reasoning traces, spanning 10 modalities, 15 super-categories, and 108 specific categories. Finally, we introduce MedReasoner, a modular framework that distinctly separates reasoning from segmentation: an MLLM reasoner is optimized with reinforcement learning, while a frozen segmentation expert converts spatial prompts into masks, with alignment achieved through format and accuracy rewards. MedReasoner achieves state-of-the-art performance on U-MRG-14K and demonstrates strong generalization to unseen clinical queries, underscoring the significant promise of reinforcement learning for interpretable medical grounding.
Problem

Research questions and friction points this paper is trying to address.

Accurately grounding ROIs in medical imaging for diagnosis
Handling implicit clinical queries without spatial hints
Unifying clinical reasoning with pixel-level segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning optimizes MLLM reasoning
Modular framework separates reasoning and segmentation
Frozen segmentation expert converts prompts to masks
🔎 Similar Papers
No similar papers found.
Zhonghao Yan
Zhonghao Yan
Beijing University of Posts and Telecommunications
Vision Language ModelAgentGenerative AIMedical Image Analysis
M
Muxi Diao
Beijing University of Posts and Telecommunications, Zhongguancun Academy
Y
Yuxuan Yang
Beijing University of Posts and Telecommunications
J
Jiayuan Xu
Beijing University of Posts and Telecommunications
K
Kaizhou Zhang
Beijing University of Posts and Telecommunications
R
Ruoyan Jing
Beijing University of Posts and Telecommunications
L
Lele Yang
Beijing University of Posts and Telecommunications
Y
Yanxi Liu
Beijing Information Science and Technology University
Kongming Liang
Kongming Liang
Beijing University of Posts and Telecommunications
Computer VisionPattern RecognitionMachine Learning
Zhanyu Ma
Zhanyu Ma
Beijing University of Posts and Telecommunications
Pattern RecognitionMachine LearningComputer VisionMultimedia TechnologyDeep Learning