VolDiT: Controllable Volumetric Medical Image Synthesis with Diffusion Transformers

πŸ“… 2026-03-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work proposes VolDiT, the first pure Transformer-based 3D diffusion model for medical image synthesis, addressing the limitations of conventional convolutional U-Net architectures that struggle to capture global context due to their restricted receptive fields. VolDiT processes 3D volumes directly through voxel patch embeddings and leverages global self-attention to model long-range dependencies. To enable precise structural control, the model introduces a timestep-gated adapter that maps segmentation masks into learnable control tokens, facilitating token-level conditional modulation. Experiments demonstrate that VolDiT significantly outperforms U-Net baselines in high-resolution 3D medical image generation, achieving notable advances in global consistency, synthesis fidelity, and spatial controllability.

Technology Category

Application Category

πŸ“ Abstract
Diffusion models have become a leading approach for high-fidelity medical image synthesis. However, most existing methods for 3D medical image generation rely on convolutional U-Net backbones within latent diffusion frameworks. While effective, these architectures impose strong locality biases and limited receptive fields, which may constrain scalability, global context integration, and flexible conditioning. In this work, we introduce VolDiT, the first purely transformer-based 3D Diffusion Transformer for volumetric medical image synthesis. Our approach extends diffusion transformers to native 3D data through volumetric patch embeddings and global self-attention operating directly over 3D tokens. To enable structured control, we propose a timestep-gated control adapter that maps segmentation masks into learnable control tokens that modulate transformer layers during denoising. This token-level conditioning mechanism allows precise spatial guidance while preserving the modeling advantages of transformer architectures. We evaluate our model on high-resolution 3D medical image synthesis tasks and compare it to state-of-the-art 3D latent diffusion models based on U-Nets. Results demonstrate improved global coherence, superior generative fidelity, and enhanced controllability. Our findings suggest that fully transformerbased diffusion models provide a flexible foundation for volumetric medical image synthesis. The code and models trained on public data are available at https://github.com/Cardio-AI/voldit.
Problem

Research questions and friction points this paper is trying to address.

3D medical image synthesis
diffusion models
transformer architecture
global context
controllable generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Transformer
3D Medical Image Synthesis
Volumetric Patch Embedding
Global Self-Attention
Token-level Conditioning
M
Marvin Seyfarth
Institute for Artificial Intelligence in Cardiovascular Medicine, Medical Faculty of Heidelberg University, Heidelberg University, Heidelberg, Germany; Department of Cardiology, Angiology, Pneumology, Heidelberg University Hospital, Heidelberg, Germany; DZHK (German Centre for Cardiovascular Research), Partner Site Heidelberg/Mannheim, Heidelberg, Germany
Salman Ul Hassan Dar
Salman Ul Hassan Dar
Heidelberg University
Medical Image AnalysisComputer VisionDeep Learning
Yannik Frisch
Yannik Frisch
PHD Student, TU Darmstadt
Generative ModelsRepresentation LearningSurgical DataMedical Imaging
P
Philipp Wild
University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany
N
Norbert Frey
Department of Cardiology, Angiology, Pneumology, Heidelberg University Hospital, Heidelberg, Germany; DZHK (German Centre for Cardiovascular Research), Partner Site Heidelberg/Mannheim, Heidelberg, Germany
F
Florian AndrΓ©
Department of Cardiology, Angiology, Pneumology, Heidelberg University Hospital, Heidelberg, Germany; DZHK (German Centre for Cardiovascular Research), Partner Site Heidelberg/Mannheim, Heidelberg, Germany
Sandy Engelhardt
Sandy Engelhardt
Full Professor at Heidelberg University
Cardiac Image ProcessingComputer-Assisted SurgeryDeep LearningAugmented Reality