VividMed: Vision Language Model with Versatile Visual Grounding for Medicine

📅 2024-10-16

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Medical vision-language models face three key challenges: overreliance on a single visual grounding paradigm, limited capability in processing 3D medical imaging, and severe scarcity of annotated data. To address these, we propose MedVL, the first unified multimodal vision-language model tailored for medical applications. Our method introduces a novel visual grounding architecture jointly supporting semantic segmentation and instance-level localization; a multi-scale encoder with a dedicated 3D convolutional adaptation module to handle both 2D and 3D inputs; and a three-stage training strategy complemented by an automated medical text–image synthesis pipeline leveraging open-source models and datasets. Evaluated on multiple medical visual grounding benchmarks, MedVL achieves state-of-the-art performance. It also significantly improves downstream capabilities in medical visual question answering and radiology report generation. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Recent advancements in Vision Language Models (VLMs) have demonstrated remarkable promise in generating visually grounded responses. However, their application in the medical domain is hindered by unique challenges. For instance, most VLMs rely on a single method of visual grounding, whereas complex medical tasks demand more versatile approaches. Additionally, while most VLMs process only 2D images, a large portion of medical images are 3D. The lack of medical data further compounds these obstacles. To address these challenges, we present VividMed, a vision language model with versatile visual grounding for medicine. Our model supports generating both semantic segmentation masks and instance-level bounding boxes, and accommodates various imaging modalities, including both 2D and 3D data. We design a three-stage training procedure and an automatic data synthesis pipeline based on open datasets and models. Besides visual grounding tasks, VividMed also excels in other common downstream tasks, including Visual Question Answering (VQA) and report generation. Ablation studies empirically show that the integration of visual grounding ability leads to improved performance on these tasks. Our code is publicly available at https://github.com/function2-llx/MMMM.

Problem

Research questions and friction points this paper is trying to address.

Addresses medical image analysis challenges

Supports 2D and 3D medical images

Enhances visual grounding and downstream tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Versatile visual grounding in medicine

Supports 2D and 3D imaging modalities

Three-stage training and data synthesis

🔎 Similar Papers

Multi-modal vision-language model for generalizable annotation-free pathology localization and clinical diagnosis