VELVET-Med: Vision and Efficient Language Pre-training for Volumetric Imaging Tasks in Medicine

📅 2025-08-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical 3D imaging data (e.g., CT) and corresponding radiology reports are scarce in paired form, severely limiting performance on multimodal downstream tasks. To address this, we propose VELVET-Med, a volumetric vision–language pretraining framework specifically designed for 3D medical data. Our method integrates a 3D visual encoder with TriBERT—a novel three-stage textual encoder—and introduces a hierarchical contrastive learning scheme enabling voxel-level, region-level, and semantic-level cross-modal alignment. Additionally, we jointly optimize unimodal self-supervised pretraining and cross-modal attention mechanisms. Trained on only 38,000 paired samples, VELVET-Med achieves state-of-the-art performance across four key tasks: 3D medical segmentation, cross-modal retrieval, visual question answering, and radiology report generation. It significantly improves generalization under low-data regimes and enhances deep semantic alignment between imaging and text.

Technology Category

Application Category

📝 Abstract
Vision-and-language models (VLMs) have been increasingly explored in the medical domain, particularly following the success of CLIP in general domain. However, unlike the relatively straightforward pairing of 2D images and text, curating large-scale paired data in the medical field for volumetric modalities such as CT scans remains a challenging and time-intensive process. This difficulty often limits the performance on downstream tasks. To address these challenges, we propose a novel vision-language pre-training (VLP) framework, termed as extbf{VELVET-Med}, specifically designed for limited volumetric data such as 3D CT and associated radiology reports. Instead of relying on large-scale data collection, our method focuses on the development of effective pre-training objectives and model architectures. The key contributions are: 1) We incorporate uni-modal self-supervised learning into VLP framework, which are often underexplored in the existing literature. 2) We propose a novel language encoder, termed as extbf{TriBERT}, for learning multi-level textual semantics. 3) We devise the hierarchical contrastive learning to capture multi-level vision-language correspondence. Using only 38,875 scan-report pairs, our approach seeks to uncover rich spatial and semantic relationships embedded in volumetric medical images and corresponding clinical narratives, thereby enhancing the generalization ability of the learned encoders. The resulting encoders exhibit strong transferability, achieving state-of-the-art performance across a wide range of downstream tasks, including 3D segmentation, cross-modal retrieval, visual question answering, and report generation.
Problem

Research questions and friction points this paper is trying to address.

Addresses limited volumetric medical data for vision-language pre-training
Develops efficient methods for 3D CT scans and radiology reports pairing
Enhances generalization for medical imaging tasks with sparse data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Incorporates uni-modal self-supervised learning into VLP
Introduces TriBERT language encoder for multi-level semantics
Devises hierarchical contrastive learning for vision-language correspondence
🔎 Similar Papers
No similar papers found.