MedSeg-R: Reasoning Segmentation in Medical Images with Multimodal Large Language Models

📅 2025-06-12

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

Medical image segmentation faces challenges including implicit clinical instructions, weak reasoning capabilities, and low mask precision. To address these, this work introduces “reasoning-driven medical image segmentation” — a novel task requiring models to actively reason over complex clinical semantic instructions and generate accurate segmentation masks. We propose MedSeg-R, an end-to-end framework featuring dual collaborative modules: (1) a global semantic understanding module leveraging multimodal large language models (MLLMs) and cross-modal intermediate tokens for clinical intent parsing; and (2) a pixel-level decoding head for precise localization. Additionally, we release MedSeg-QA, the first large-scale, multi-turn dialogue dataset for medical segmentation, combining LLM-assisted auto-annotation with expert physician refinement. Evaluated on multiple benchmarks, MedSeg-R achieves a 3.2% Dice score improvement and simultaneously generates interpretable clinical text analyses, demonstrating both the efficacy of reasoning-driven segmentation and its clinical applicability.

Technology Category

Application Category

📝 Abstract

Medical image segmentation is crucial for clinical diagnosis, yet existing models are limited by their reliance on explicit human instructions and lack the active reasoning capabilities to understand complex clinical questions. While recent advancements in multimodal large language models (MLLMs) have improved medical question-answering (QA) tasks, most methods struggle to generate precise segmentation masks, limiting their application in automatic medical diagnosis. In this paper, we introduce medical image reasoning segmentation, a novel task that aims to generate segmentation masks based on complex and implicit medical instructions. To address this, we propose MedSeg-R, an end-to-end framework that leverages the reasoning abilities of MLLMs to interpret clinical questions while also capable of producing corresponding precise segmentation masks for medical images. It is built on two core components: 1) a global context understanding module that interprets images and comprehends complex medical instructions to generate multi-modal intermediate tokens, and 2) a pixel-level grounding module that decodes these tokens to produce precise segmentation masks and textual responses. Furthermore, we introduce MedSeg-QA, a large-scale dataset tailored for the medical image reasoning segmentation task. It includes over 10,000 image-mask pairs and multi-turn conversations, automatically annotated using large language models and refined through physician reviews. Experiments show MedSeg-R's superior performance across several benchmarks, achieving high segmentation accuracy and enabling interpretable textual analysis of medical images.

Problem

Research questions and friction points this paper is trying to address.

Enabling segmentation from complex implicit medical instructions

Improving precision of segmentation masks in medical QA tasks

Integrating reasoning and segmentation in medical image analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages MLLMs for clinical question interpretation

Integrates global context and pixel-level grounding modules

Introduces MedSeg-QA dataset for enhanced segmentation

🔎 Similar Papers

MedRG: Medical Report Grounding with Multi-modal Large Language Model