Cardiac-CLIP: A Vision-Language Foundation Model for 3D Cardiac CT Images

📅 2025-07-29

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

The absence of multimodal foundation models for 3D cardiac CT imaging hinders complex cardiovascular diagnosis. Method: We propose the first clinically oriented vision-language foundation model, trained in two stages: (1) unsupervised pretraining via a 3D masked autoencoder (MAE) to learn deep volumetric visual representations; and (2) refined cross-modal contrastive learning integrating standardized radiology reports, pathology attribute vectors, and soft-label matrices to achieve fine-grained clinical semantic alignment. Contribution/Results: The model supports cardiovascular abnormality classification, cross-modal information retrieval, and prospective clinical prediction (e.g., acute coronary syndrome). Evaluated on multi-center internal and external datasets, it achieves state-of-the-art performance across all downstream tasks—particularly excelling in challenging prospective prediction, where it significantly outperforms existing methods. These results demonstrate strong generalizability and tangible clinical utility.

Technology Category

Application Category

📝 Abstract

Foundation models have demonstrated remarkable potential in medical domain. However, their application to complex cardiovascular diagnostics remains underexplored. In this paper, we present Cardiac-CLIP, a multi-modal foundation model designed for 3D cardiac CT images. Cardiac-CLIP is developed through a two-stage pre-training strategy. The first stage employs a 3D masked autoencoder (MAE) to perform self-supervised representation learning from large-scale unlabeled volumetric data, enabling the visual encoder to capture rich anatomical and contextual features. In the second stage, contrastive learning is introduced to align visual and textual representations, facilitating cross-modal understanding. To support the pre-training, we collect 16641 real clinical CT scans, supplemented by 114k publicly available data. Meanwhile, we standardize free-text radiology reports into unified templates and construct the pathology vectors according to diagnostic attributes, based on which the soft-label matrix is generated to supervise the contrastive learning process. On the other hand, to comprehensively evaluate the effectiveness of Cardiac-CLIP, we collect 6,722 real-clinical data from 12 independent institutions, along with the open-source data to construct the evaluation dataset. Specifically, Cardiac-CLIP is comprehensively evaluated across multiple tasks, including cardiovascular abnormality classification, information retrieval and clinical analysis. Experimental results demonstrate that Cardiac-CLIP achieves state-of-the-art performance across various downstream tasks in both internal and external data. Particularly, Cardiac-CLIP exhibits great effectiveness in supporting complex clinical tasks such as the prospective prediction of acute coronary syndrome, which is notoriously difficult in real-world scenarios.

Problem

Research questions and friction points this paper is trying to address.

Develops Cardiac-CLIP for 3D cardiac CT diagnostics

Aligns visual-textual features via contrastive learning

Evaluates model on multi-task cardiovascular analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D masked autoencoder for self-supervised learning

Contrastive learning aligns visual and textual representations

Standardized radiology reports and pathology vectors

🔎 Similar Papers

RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training