Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining

📅 2025-05-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses key challenges in text-to-3D CT generation: high-dimensional modeling complexity, intricate anatomical structures, and the absence of robust 3D vision-language alignment frameworks. We propose the first end-to-end method integrating 3D contrastive vision-language pretraining with latent-space diffusion modeling. Our approach constructs modality-specific cross-modal embedding spaces and employs voxelized VAE compression coupled with 3D latent diffusion—eliminating the need for post-hoc super-resolution while enabling high-fidelity 3D CT volume synthesis. Evaluated on the CT-RATE dataset, it achieves the first direct natural language-to-3D CT mapping. Quantitative and qualitative results demonstrate state-of-the-art performance in fidelity, clinical relevance, and semantic alignment. Moreover, synthetically generated CT volumes significantly improve downstream disease diagnosis model accuracy, empirically validating the method’s clinical utility.

Technology Category

Application Category

📝 Abstract
Objective: While recent advances in text-conditioned generative models have enabled the synthesis of realistic medical images, progress has been largely confined to 2D modalities such as chest X-rays. Extending text-to-image generation to volumetric Computed Tomography (CT) remains a significant challenge, due to its high dimensionality, anatomical complexity, and the absence of robust frameworks that align vision-language data in 3D medical imaging. Methods: We introduce a novel architecture for Text-to-CT generation that combines a latent diffusion model with a 3D contrastive vision-language pretraining scheme. Our approach leverages a dual-encoder CLIP-style model trained on paired CT volumes and radiology reports to establish a shared embedding space, which serves as the conditioning input for generation. CT volumes are compressed into a low-dimensional latent space via a pretrained volumetric VAE, enabling efficient 3D denoising diffusion without requiring external super-resolution stages. Results: We evaluate our method on the CT-RATE dataset and conduct a comprehensive assessment of image fidelity, clinical relevance, and semantic alignment. Our model achieves competitive performance across all tasks, significantly outperforming prior baselines for text-to-CT generation. Moreover, we demonstrate that CT scans synthesized by our framework can effectively augment real data, improving downstream diagnostic performance. Conclusion: Our results show that modality-specific vision-language alignment is a key component for high-quality 3D medical image generation. By integrating contrastive pretraining and volumetric diffusion, our method offers a scalable and controllable solution for synthesizing clinically meaningful CT volumes from text, paving the way for new applications in data augmentation, medical education, and automated clinical simulation.
Problem

Research questions and friction points this paper is trying to address.

Extending text-to-image generation to 3D CT volumes
Aligning vision-language data in 3D medical imaging
Generating clinically meaningful CT scans from text
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D latent diffusion model for CT generation
Contrastive vision-language pretraining scheme
Volumetric VAE for efficient 3D denoising
Daniele Molino
Daniele Molino
Phd in Artificial Intelligence, Università Campus Bio-medico di Roma
Intelligenza ArtificialeModelli Generativi
Camillo Maria Caruso
Camillo Maria Caruso
PhD student, Università Campus Bio-Medico di Roma
artificial intelligencedeep learningcomputer vision
Filippo Ruffini
Filippo Ruffini
Università Campus Bio-medico di Roma
Artificial IntelligenceMedical image AnalysisComputer VisionDeep Learning
P
P. Soda
Unit of Artificial Intelligence and Computer Systems, Department of Engineering, Università Campus Bio-Medico di Roma, Roma, Europe; Department of Diagnostics and Intervention, Biomedical Engineering and Radiation Physics, Umeå University, Umeå, Sweden
V
V. Guarrasi
Unit of Artificial Intelligence and Computer Systems, Department of Engineering, Università Campus Bio-Medico di Roma, Roma, Europe