Comprehensive language-image pre-training for 3D medical image understanding

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

To address the performance bottleneck in 3D medical vision-language pretraining caused by scarce paired image-report data, this paper introduces the COLIPRI model family—the first to incorporate report generation as an inductive bias into 3D vision-language pretraining. COLIPRI synergistically leverages both unpaired 3D medical images and limited paired image-report data to improve data efficiency. Methodologically, it integrates 3D convolutional backbones, contrastive learning, cross-modal alignment, and report generation objectives, complemented by classification probing and zero-shot evaluation protocols. Experiments demonstrate that COLIPRI achieves state-of-the-art performance on report generation, classification probing, and zero-shot classification tasks, while maintaining competitive results in semantic segmentation. The framework significantly enhances model generalization and cross-task transferability, thereby expanding the methodological boundaries of medical vision-language pretraining.

Technology Category

Application Category

📝 Abstract

Vision-language pre-training, i.e., aligning images with paired text, is a powerful paradigm to create encoders that can be directly used for tasks such as classification and retrieval, and for downstream tasks such as segmentation and report generation. In the 3D medical image domain, these capabilities allow vision-language encoders (VLEs) to support radiologists by retrieving patients with similar abnormalities or predicting likelihoods of abnormality. While the methodology holds promise, data availability limits the capabilities of current 3D VLEs. In this paper, we alleviate the lack of data by injecting additional inductive biases: introducing a report generation objective and pairing vision-language pre-training with vision-only pre-training. This allows us to leverage both image-only and paired image-text 3D datasets, increasing the total amount of data to which our model is exposed. Through these additional inductive biases, paired with best practices of the 3D medical imaging domain, we develop the Comprehensive Language-image Pre-training (COLIPRI) encoder family. Our COLIPRI encoders achieve state-of-the-art performance in report generation, classification probing, and zero-shot classification, and remain competitive for semantic segmentation.

Problem

Research questions and friction points this paper is trying to address.

Addressing data scarcity in 3D medical vision-language pre-training

Enhancing 3D medical image understanding through inductive biases

Improving report generation and classification for medical imaging

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining report generation with vision-language pre-training

Leveraging both image-only and paired image-text datasets

Integrating vision-only pre-training with vision-language alignment

🔎 Similar Papers

RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training