Cardiac-CLIP: A Vision-Language Foundation Model for 3D Cardiac CT Images

📅 2025-07-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The absence of multimodal foundation models for 3D cardiac CT imaging hinders complex cardiovascular diagnosis. Method: We propose the first clinically oriented vision-language foundation model, trained in two stages: (1) unsupervised pretraining via a 3D masked autoencoder (MAE) to learn deep volumetric visual representations; and (2) refined cross-modal contrastive learning integrating standardized radiology reports, pathology attribute vectors, and soft-label matrices to achieve fine-grained clinical semantic alignment. Contribution/Results: The model supports cardiovascular abnormality classification, cross-modal information retrieval, and prospective clinical prediction (e.g., acute coronary syndrome). Evaluated on multi-center internal and external datasets, it achieves state-of-the-art performance across all downstream tasks—particularly excelling in challenging prospective prediction, where it significantly outperforms existing methods. These results demonstrate strong generalizability and tangible clinical utility.

Technology Category

Application Category

📝 Abstract
Foundation models have demonstrated remarkable potential in medical domain. However, their application to complex cardiovascular diagnostics remains underexplored. In this paper, we present Cardiac-CLIP, a multi-modal foundation model designed for 3D cardiac CT images. Cardiac-CLIP is developed through a two-stage pre-training strategy. The first stage employs a 3D masked autoencoder (MAE) to perform self-supervised representation learning from large-scale unlabeled volumetric data, enabling the visual encoder to capture rich anatomical and contextual features. In the second stage, contrastive learning is introduced to align visual and textual representations, facilitating cross-modal understanding. To support the pre-training, we collect 16641 real clinical CT scans, supplemented by 114k publicly available data. Meanwhile, we standardize free-text radiology reports into unified templates and construct the pathology vectors according to diagnostic attributes, based on which the soft-label matrix is generated to supervise the contrastive learning process. On the other hand, to comprehensively evaluate the effectiveness of Cardiac-CLIP, we collect 6,722 real-clinical data from 12 independent institutions, along with the open-source data to construct the evaluation dataset. Specifically, Cardiac-CLIP is comprehensively evaluated across multiple tasks, including cardiovascular abnormality classification, information retrieval and clinical analysis. Experimental results demonstrate that Cardiac-CLIP achieves state-of-the-art performance across various downstream tasks in both internal and external data. Particularly, Cardiac-CLIP exhibits great effectiveness in supporting complex clinical tasks such as the prospective prediction of acute coronary syndrome, which is notoriously difficult in real-world scenarios.
Problem

Research questions and friction points this paper is trying to address.

Develops Cardiac-CLIP for 3D cardiac CT diagnostics
Aligns visual-textual features via contrastive learning
Evaluates model on multi-task cardiovascular analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D masked autoencoder for self-supervised learning
Contrastive learning aligns visual and textual representations
Standardized radiology reports and pathology vectors
Y
Yutao Hu
School of Computer Science and Engineering, Southeast University, Nanjing, China
Ying Zheng
Ying Zheng
Department of Bioengineering, University of Washington
BioengineeringTissue EngineeringRegenerative Medicine
S
Shumei Miao
The First Affiliated Hospital of Nanjing Medical University, Nanjing, China; School of Computer Science and Engineering, Southeast University, Nanjing, China
X
Xiaolei Zhang
Department of Radiology, Jinling Hospital, Affiliated Hospital of Medical School, Nanjing University, Nanjing, China
Jiahao Xia
Jiahao Xia
Research Fellow, University of Technology Sydney
Deep Learning
Yaolei Qi
Yaolei Qi
Southeast University & University of Cambridge
Medical Image AnalysisComputer VisionDeep Learning
Y
Yiyang Zhang
School of Computer Science and Engineering, Southeast University, Nanjing, China
Yuting He
Yuting He
Foundation Medicine Inc.
Precision MedicineBiomarker and CDxCancer GenomicsMachine LearningData Mining
Q
Qian Chen
Department of Radiology, Nanjing First Hospital, Nanjing Medical University, Nanjing, China
J
Jing Ye
Radiology Department, Northern Jiangsu People’s Hospital, Yangzhou, China
H
Hongyan Qiao
Department of Medical Imaging, the Affiliated Hospital of Jiangnan University, Wuxi, China
X
Xiuhua Hu
Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, China
L
Lei Xu
Department of Radiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
J
Jiayin Zhang
Department of Radiology, Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
H
Hui Liu
Department of Radiology, Guangdong Provincial People’s Hospital, Guangzhou, China
M
Minwen Zheng
Department of Radiology, Xijing Hospital, Air Force Medical University, Xi’an, China
Y
Yining Wang
Department of Radiology, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
D
Daimin Zhang
Department of Cardiology, Sir Run Run Hospital, Nanjing Medical University, Nanjing, China
J
Ji Zhang
Department of Radiology, Taizhou People’s Hospital, Taizhou, China
Wenqi Shao
Wenqi Shao
Researcher at Shanghai AI Laboratory
Foundation Model EvaluationLLM CompressionEfficient AdaptationMultimodal Learning
Y
Yun Liu
The First Affiliated Hospital of Nanjing Medical University, Nanjing, China
L
Longjiang Zhang
Department of Radiology, Jinling Hospital, Affiliated Hospital of Medical School, Nanjing University, Nanjing, China
G
Guanyu Yang
School of Computer Science and Engineering, Southeast University, Nanjing, China