🤖 AI Summary
Current text-to-image diffusion models for 3D chest CT generation rely on simplified textual prompts, failing to capture the fine-grained clinical semantics of comprehensive radiology reports—leading to poor text–image alignment and low anatomical fidelity. To address this, we propose a multi-text-encoder fusion framework built upon a 3D latent diffusion architecture, which synergistically integrates BiomedVLP CXR BERT (for imaging findings), MedEmbed (for structured descriptions), and ClinicalBERT (for clinical reasoning). Coupled with classifier-free guidance, our method enhances semantic alignment and anatomical plausibility. Evaluated on the VLM3D Challenge 2025, our approach achieves first place, attaining state-of-the-art FID and CLIP Score. The generated 3D chest CT volumes exhibit high anatomical accuracy and visual quality, marking the first end-to-end synthesis of high-fidelity 3D thoracic CT directly from free-text radiology reports.
📝 Abstract
Text to image latent diffusion models have recently advanced medical image synthesis, but applications to 3D CT generation remain limited. Existing approaches rely on simplified prompts, neglecting the rich semantic detail in full radiology reports, which reduces text image alignment and clinical fidelity. We propose Report2CT, a radiology report conditional latent diffusion framework for synthesizing 3D chest CT volumes directly from free text radiology reports, incorporating both findings and impression sections using multiple text encoder. Report2CT integrates three pretrained medical text encoders (BiomedVLP CXR BERT, MedEmbed, and ClinicalBERT) to capture nuanced clinical context. Radiology reports and voxel spacing information condition a 3D latent diffusion model trained on 20000 CT volumes from the CT RATE dataset. Model performance was evaluated using Frechet Inception Distance (FID) for real synthetic distributional similarity and CLIP based metrics for semantic alignment, with additional qualitative and quantitative comparisons against GenerateCT model. Report2CT generated anatomically consistent CT volumes with excellent visual quality and text image alignment. Multi encoder conditioning improved CLIP scores, indicating stronger preservation of fine grained clinical details in the free text radiology reports. Classifier free guidance further enhanced alignment with only a minor trade off in FID. We ranked first in the VLM3D Challenge at MICCAI 2025 on Text Conditional CT Generation and achieved state of the art performance across all evaluation metrics. By leveraging complete radiology reports and multi encoder text conditioning, Report2CT advances 3D CT synthesis, producing clinically faithful and high quality synthetic data.