🤖 AI Summary
Existing 3D medical vision-language pretraining relies on global image embeddings, limiting fine-grained semantic capture of sparse, critical lesions and thus hindering diagnostic alignment accuracy. To address this, we propose a **3D grounding-aware vision-language pretraining paradigm**: (1) We construct the first organ-level, whole-body CT–radiology report paired dataset; (2) We introduce semantic grounding-guided cross-modal contrastive learning to jointly optimize 3D visual and textual encoders; and (3) We enable zero-shot prompt-based inference. Our method achieves anatomical-structure-guided 3D cross-modal alignment for the first time. On zero-shot anomaly detection, tumor detection, and segmentation tasks, it improves F1 score by 15.1%, AUC by 1.9%, and Dice coefficient by 3.2%, respectively—demonstrating significantly enhanced semantic understanding and precise localization of sparse pathological findings.
📝 Abstract
3D medical vision-language (VL) pretraining has shown potential in radiology by leveraging large-scale multimodal datasets with CT-report pairs. However, existing methods primarily rely on a global VL alignment directly adapted from 2D scenarios. The entire 3D image is transformed into one global embedding, resulting in a loss of sparse but critical semantics essential for accurately aligning with the corresponding diagnosis. To address this limitation, we propose CT-GLIP, a 3D Grounded Language-Image Pretrained model that constructs fine-grained CT-report pairs to enhance extit{grounded} cross-modal contrastive learning, effectively aligning grounded visual features with precise textual descriptions. Leveraging the grounded cross-modal alignment, CT-GLIP improves performance across diverse downstream tasks and can even identify organs and abnormalities in a zero-shot manner using natural language. CT-GLIP is trained on a multimodal CT dataset comprising 44,011 organ-level CT-report pairs from 17,702 patients, covering 104 organs. Evaluation is conducted on four downstream tasks: zero-shot organ recognition (OR), zero-shot abnormality detection (AD), tumor detection (TD), and tumor segmentation (TS). Empirical results show that it outperforms its counterparts with global VL alignment. Compared to vanilla CLIP, CT-GLIP achieves average performance improvements of 15.1% of F1 score, 1.9% of AUC, and 3.2% of DSC for zero-shot AD, TD, and TS tasks, respectively. This study highlights the significance of grounded VL alignment in enabling 3D medical VL foundation models to understand sparse representations within CT scans.