Developing Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography

📅 2024-03-26
📈 Citations: 24
Influential: 3
📄 PDF
🤖 AI Summary
Current 3D medical imaging AI is hindered by the scarcity of large-scale, paired multimodal datasets, impeding cross-modal alignment and natural language interaction. To address this, we introduce CT-RATE—the first large-scale, paired 3D chest CT–radiology report dataset (25,692 cases)—and propose CT-CLIP, a contrastive learning framework, and CT-CHAT, a vision-language dialogue model. Our contributions include: (1) the first large-scale, fine-grained alignment between 3D CT volumes and free-text radiology reports; (2) CT-CLIP—a task-agnostic foundational model integrating 3D convolutional networks with Vision Transformers, requiring no downstream fine-tuning; and (3) CT-CHAT—the first open-source, 3D CT–specific conversational model, trained via report-driven QA generation and LLM-CT co-fine-tuning for end-to-end diagnostic interaction. Experiments demonstrate that our unsupervised multi-abnormality detection outperforms fully supervised SOTA methods; cross-modal retrieval enables bidirectional image–text queries; and CT-CHAT, fine-tuned on 2.7M medical QA pairs, surpasses existing multimodal medical assistants.

Technology Category

Application Category

📝 Abstract
While computer vision has achieved tremendous success with multimodal encoding and direct textual interaction with images via chat-based large language models, similar advancements in medical imaging AI, particularly in 3D imaging, have been limited due to the scarcity of comprehensive datasets. To address this critical gap, we introduce CT-RATE, the first dataset that pairs 3D medical images with corresponding textual reports. CT-RATE comprises 25,692 non-contrast 3D chest CT scans from 21,304 unique patients. Through various reconstructions, these scans are expanded to 50,188 volumes, totaling over 14.3 million 2D slices. Each scan is accompanied by its corresponding radiology report. Leveraging CT-RATE, we develop CT-CLIP, a CT-focused contrastive language-image pretraining framework designed for broad applications without the need for task-specific training. We demonstrate how CT-CLIP can be used in two tasks: multi-abnormality detection and case retrieval. Remarkably, in multi-abnormality detection, CT-CLIP outperforms state-of-the-art fully supervised models across all key metrics, effectively eliminating the need for manual annotation. In case retrieval, it efficiently retrieves relevant cases using either image or textual queries, thereby enhancing knowledge dissemination. By combining CT-CLIP's vision encoder with a pretrained large language model, we create CT-CHAT, a vision-language foundational chat model for 3D chest CT volumes. Finetuned on over 2.7 million question-answer pairs derived from the CT-RATE dataset, CT-CHAT surpasses other multimodal AI assistants, underscoring the necessity for specialized methods in 3D medical imaging. Collectively, the open-source release of CT-RATE, CT-CLIP, and CT-CHAT not only addresses critical challenges in 3D medical imaging, but also lays the groundwork for future innovations in medical AI and improved patient care.
Problem

Research questions and friction points this paper is trying to address.

Addressing scarcity of comprehensive 3D medical imaging datasets
Developing multimodal AI for 3D CT scan analysis
Enhancing medical AI without task-specific training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CT-RATE dataset for 3D medical imaging
Develops CT-CLIP for contrastive language-image pretraining
Creates CT-CHAT vision-language chat model
🔎 Similar Papers
No similar papers found.