Med3DVLM: An Efficient Vision-Language Model for 3D Medical Image Analysis

📅 2025-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost of volumetric processing and the difficulty of cross-modal feature alignment in 3D medical image analysis, this paper introduces the first efficient 3D vision-language model (3D-VLM). Methodologically, we propose a novel DCFormer—a 3D decomposed convolutional encoder—that significantly reduces computational complexity; adopt SigLIP-based pairwise sigmoid contrastive learning to enhance fine-grained vision–language semantic alignment; and design a dual-stream MLP-Mixer cross-modal projector for granular multimodal feature fusion. Evaluated on the M3D benchmark, our model achieves state-of-the-art performance: 61.00% R@1 in image–text retrieval (+41.9), 36.42 METEOR in radiology report generation (+22.04), and 79.95% accuracy in closed-ended visual question answering. Collectively, these results demonstrate substantial improvements over existing methods. This work establishes a scalable, semantically aligned multimodal foundation for 3D medical reasoning across diverse downstream tasks.

Technology Category

Application Category

📝 Abstract
Vision-language models (VLMs) have shown promise in 2D medical image analysis, but extending them to 3D remains challenging due to the high computational demands of volumetric data and the difficulty of aligning 3D spatial features with clinical text. We present Med3DVLM, a 3D VLM designed to address these challenges through three key innovations: (1) DCFormer, an efficient encoder that uses decomposed 3D convolutions to capture fine-grained spatial features at scale; (2) SigLIP, a contrastive learning strategy with pairwise sigmoid loss that improves image-text alignment without relying on large negative batches; and (3) a dual-stream MLP-Mixer projector that fuses low- and high-level image features with text embeddings for richer multi-modal representations. We evaluate our model on the M3D dataset, which includes radiology reports and VQA data for 120,084 3D medical images. Results show that Med3DVLM achieves superior performance across multiple benchmarks. For image-text retrieval, it reaches 61.00% R@1 on 2,000 samples, significantly outperforming the current state-of-the-art M3D model (19.10%). For report generation, it achieves a METEOR score of 36.42% (vs. 14.38%). In open-ended visual question answering (VQA), it scores 36.76% METEOR (vs. 33.58%), and in closed-ended VQA, it achieves 79.95% accuracy (vs. 75.78%). These results highlight Med3DVLM's ability to bridge the gap between 3D imaging and language, enabling scalable, multi-task reasoning across clinical applications. Our code is publicly available at https://github.com/mirthAI/Med3DVLM.
Problem

Research questions and friction points this paper is trying to address.

Extending 2D vision-language models to 3D medical images efficiently
Aligning 3D spatial features with clinical text accurately
Improving multi-modal representation for medical image-text tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

DCFormer encoder with decomposed 3D convolutions
SigLIP contrastive learning with sigmoid loss
Dual-stream MLP-Mixer for multi-modal fusion
🔎 Similar Papers
No similar papers found.