🤖 AI Summary
This work introduces the first free-text-to-3D-CT volumetric generation framework for medical imaging, addressing the challenge of synthesizing high-fidelity, anatomically consistent, high-resolution CT volumes directly from natural language descriptions. Methodologically, it proposes a novel medical text-prompt modeling mechanism that eliminates reliance on fixed templates and designs a unified 3D diffusion-based generative paradigm. This paradigm integrates a fine-tuned CLIP text encoder, a 3D latent U-Net denoising network, and an anatomy-aware loss function to achieve precise semantic–voxel alignment. Evaluated on a multi-center CT dataset, the framework achieves state-of-the-art performance: a 32% reduction in Fréchet Inception Distance (FID) and a 0.18 increase in Structural Similarity Index Measure (SSIM), with marked improvements in structural fidelity—particularly at lesion and organ boundaries. The approach establishes a new paradigm for AI-assisted diagnosis and computational medical research.
📝 Abstract
Generating 3D CT volumes from descriptive free-text inputs presents a transformative opportunity in diagnostics and research. In this paper, we introduce Text2CT, a novel approach for synthesizing 3D CT volumes from textual descriptions using the diffusion model. Unlike previous methods that rely on fixed-format text input, Text2CT employs a novel prompt formulation that enables generation from diverse, free-text descriptions. The proposed framework encodes medical text into latent representations and decodes them into high-resolution 3D CT scans, effectively bridging the gap between semantic text inputs and detailed volumetric representations in a unified 3D framework. Our method demonstrates superior performance in preserving anatomical fidelity and capturing intricate structures as described in the input text. Extensive evaluations show that our approach achieves state-of-the-art results, offering promising potential applications in diagnostics, and data augmentation.