🤖 AI Summary
This study addresses the lack of standardized evaluation, insufficient clinical validation, and predominant reliance on 2D slices in cross-modal medical image synthesis by introducing the first unified 3D cross-modality image translation assessment framework. The framework incorporates standardized preprocessing, multi-center data partitioning, 3D inference, and multi-level quantitative and clinical evaluations. Seven generative models—including Pix2Pix, CycleGAN, SRGAN, and four implicit models—were systematically benchmarked across multiple oncological imaging tasks. Results indicate that SRGAN performs best overall, yet all methods exhibit limited capability in synthesizing small lesions. In CT-to-PET translation, lesion morphology is better preserved than absolute uptake values. A clinician-involved visual Turing test achieved only 56.7% accuracy in distinguishing real from synthetic images, demonstrating the high clinical realism of the generated outputs.
📝 Abstract
Medical image-to-image (I2I) translation enables virtual scanning, i.e. the synthesis of a target imaging modality from a source one without additional acquisitions. Despite growing interest, most proposed methods operate on 2D slices, are evaluated on isolated tasks with different experimental set-ups and lack clinical validation. The primary contribution of this work is a reproducible, standardized comparative evaluation of 3D I2I translation methods in oncological imaging, designed to standardize preprocessing, splitting, inference, and multi-level evaluation across heterogeneous clinical tasks. Within this framework, we compare seven generative models, three Generative Adversarial Networks (GANs: Pix2Pix, CycleGAN, SRGAN) and four latent generative models (Latent Diffusion Model, Latent Diffusion Model+ControlNet, Brownian Bridge, Flow Matching), across eleven datasets spanning three anatomical regions (head/neck, lung, pelvis) and four translation directions (cone-beam CT to CT, MRI to CT, CT to PET, MRI T2-weighted to T2-FLAIR), for a total of 77 experiments under uniform training, inference, and evaluation conditions. The results show that GANs outperform latent generative models across all tasks, with SRGAN achieving statistically significant superiority. Our lesion-level analysis reveals that all models struggle with small lesions and that, in CT to PET synthesis, models reproduce lesion shape more reliably than absolute uptake-related intensity. We also performed a Visual Turing test administered to 17 physicians, including 15 radiologists, which shows near-chance classification accuracy (56.7%), confirming that synthetic volumes are largely indistinguishable from real acquisitions, while exposing a dissociation between quantitative metrics and clinical preference.