PRS-Med: Position Reasoning Segmentation with Vision-Language Model in Medical Imaging

📅 2025-05-17

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

To address the limited natural language interaction and spatial reasoning capabilities in medical image segmentation, this paper introduces the first vision-language segmentation framework explicitly designed for positional reasoning. Methodologically, it pioneers the integration of vision-language large models into medical segmentation, proposing a position-aware multimodal reasoning architecture that jointly incorporates prompt learning, cross-modal alignment, and explicit spatial relation modeling. We further construct MMRS—the first medical dataset annotated with spatial relationships. Experiments span six imaging modalities (CT, MRI, X-ray, ultrasound, endoscopy, and RGB) and demonstrate significant improvements over state-of-the-art methods in both segmentation accuracy and positional reasoning performance. This work establishes an interpretable and generalizable technical foundation for natural language–driven, interactive clinical diagnosis.

Technology Category

Application Category

📝 Abstract

Recent advancements in prompt-based medical image segmentation have enabled clinicians to identify tumors using simple input like bounding boxes or text prompts. However, existing methods face challenges when doctors need to interact through natural language or when position reasoning is required - understanding spatial relationships between anatomical structures and pathologies. We present PRS-Med, a framework that integrates vision-language models with segmentation capabilities to generate both accurate segmentation masks and corresponding spatial reasoning outputs. Additionally, we introduce the MMRS dataset (Multimodal Medical in Positional Reasoning Segmentation), which provides diverse, spatially-grounded question-answer pairs to address the lack of position reasoning data in medical imaging. PRS-Med demonstrates superior performance across six imaging modalities (CT, MRI, X-ray, ultrasound, endoscopy, RGB), significantly outperforming state-of-the-art methods in both segmentation accuracy and position reasoning. Our approach enables intuitive doctor-system interaction through natural language, facilitating more efficient diagnoses. Our dataset pipeline, model, and codebase will be released to foster further research in spatially-aware multimodal reasoning for medical applications.

Problem

Research questions and friction points this paper is trying to address.

Enhancing medical image segmentation with position reasoning

Integrating vision-language models for spatial relationship understanding

Addressing lack of positional reasoning data in medical imaging

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates vision-language models with segmentation

Generates accurate masks and spatial reasoning

Introduces MMRS dataset for position reasoning

🔎 Similar Papers

RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training