MedVL-SAM2: A unified 3D medical vision-language model for multimodal reasoning and prompt-driven segmentation

📅 2026-01-14

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing 3D medical vision-language models struggle to simultaneously achieve fine-grained spatial localization and high-level semantic reasoning, and lack a unified architecture for multimodal task coordination. This work proposes the first unified 3D medical multimodal model that integrates image-level language reasoning and pixel-level prompt-driven segmentation within a single framework, supporting radiology report generation, visual question answering, and semantic, referential, and interactive segmentation. Built upon SAM2, the model incorporates a volumetric segmentation module and leverages large-scale 3D CT–text paired data through pretraining and multi-stage joint training, accommodating diverse multimodal prompts including text, points, and bounding boxes. It achieves state-of-the-art performance across multiple 3D tasks, substantially advancing capabilities in visual grounding, interactive segmentation, and cross-modal reasoning.

Technology Category

Application Category

📝 Abstract

Recent progress in medical vision-language models (VLMs) has achieved strong performance on image-level text-centric tasks such as report generation and visual question answering (VQA). However, achieving fine-grained visual grounding and volumetric spatial reasoning in 3D medical VLMs remains challenging, particularly when aiming to unify these capabilities within a single, generalizable framework. To address this challenge, we proposed MedVL-SAM2, a unified 3D medical multimodal model that concurrently supports report generation, VQA, and multi-paradigm segmentation, including semantic, referring, and interactive segmentation. MedVL-SAM2 integrates image-level reasoning and pixel-level perception through a cohesive architecture tailored for 3D medical imaging, and incorporates a SAM2-based volumetric segmentation module to enable precise multi-granular spatial reasoning. The model is trained in a multi-stage pipeline: it is first pre-trained on a large-scale corpus of 3D CT image-text pairs to align volumetric visual features with radiology-language embeddings. It is then jointly optimized with both language-understanding and segmentation objectives using a comprehensive 3D CT segmentation dataset. This joint training enables flexible interaction via language, point, or box prompts, thereby unifying high-level visual reasoning with spatially precise localization. Our unified architecture delivers state-of-the-art performance across report generation, VQA, and multiple 3D segmentation tasks. Extensive analyses further show that the model provides reliable 3D visual grounding, controllable interactive segmentation, and robust cross-modal reasoning, demonstrating that high-level semantic reasoning and precise 3D localization can be jointly achieved within a unified 3D medical VLM.

Problem

Research questions and friction points this paper is trying to address.

3D medical vision-language model

visual grounding

volumetric spatial reasoning

multimodal reasoning

prompt-driven segmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D medical vision-language model

unified multimodal architecture

prompt-driven segmentation