CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios

📅 2024-04-23
🏛️ arXiv.org
📈 Citations: 11
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D medical vision-language pretraining relies on global image embeddings, limiting fine-grained semantic capture of sparse, critical lesions and thus hindering diagnostic alignment accuracy. To address this, we propose a **3D grounding-aware vision-language pretraining paradigm**: (1) We construct the first organ-level, whole-body CT–radiology report paired dataset; (2) We introduce semantic grounding-guided cross-modal contrastive learning to jointly optimize 3D visual and textual encoders; and (3) We enable zero-shot prompt-based inference. Our method achieves anatomical-structure-guided 3D cross-modal alignment for the first time. On zero-shot anomaly detection, tumor detection, and segmentation tasks, it improves F1 score by 15.1%, AUC by 1.9%, and Dice coefficient by 3.2%, respectively—demonstrating significantly enhanced semantic understanding and precise localization of sparse pathological findings.

Technology Category

Application Category

📝 Abstract
3D medical vision-language (VL) pretraining has shown potential in radiology by leveraging large-scale multimodal datasets with CT-report pairs. However, existing methods primarily rely on a global VL alignment directly adapted from 2D scenarios. The entire 3D image is transformed into one global embedding, resulting in a loss of sparse but critical semantics essential for accurately aligning with the corresponding diagnosis. To address this limitation, we propose CT-GLIP, a 3D Grounded Language-Image Pretrained model that constructs fine-grained CT-report pairs to enhance extit{grounded} cross-modal contrastive learning, effectively aligning grounded visual features with precise textual descriptions. Leveraging the grounded cross-modal alignment, CT-GLIP improves performance across diverse downstream tasks and can even identify organs and abnormalities in a zero-shot manner using natural language. CT-GLIP is trained on a multimodal CT dataset comprising 44,011 organ-level CT-report pairs from 17,702 patients, covering 104 organs. Evaluation is conducted on four downstream tasks: zero-shot organ recognition (OR), zero-shot abnormality detection (AD), tumor detection (TD), and tumor segmentation (TS). Empirical results show that it outperforms its counterparts with global VL alignment. Compared to vanilla CLIP, CT-GLIP achieves average performance improvements of 15.1% of F1 score, 1.9% of AUC, and 3.2% of DSC for zero-shot AD, TD, and TS tasks, respectively. This study highlights the significance of grounded VL alignment in enabling 3D medical VL foundation models to understand sparse representations within CT scans.
Problem

Research questions and friction points this paper is trying to address.

Addresses loss of critical semantics in 3D medical vision-language alignment
Proposes fine-grained CT-report pairs for grounded cross-modal contrastive learning
Enhances zero-shot organ and abnormality detection in full-body CT scans
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained CT-report pairs for grounded cross-modal contrastive learning
Aligns grounded visual features with precise textual descriptions
Enables zero-shot organ and abnormality recognition using natural language