🤖 AI Summary
Existing 2D diffusion models fail to capture the thickness distribution and volumetric structure of 3D medical images, while full 3D diffusion models are hindered by prohibitive computational cost and data requirements. Method: We propose a hybrid 2D–3D diffusion framework that jointly integrates three orthogonal-plane 2D diffusion models with a lightweight 3D feature calibration network at each denoising step. It introduces cross-plane weighted ensemble and multi-condition joint sampling to enforce 3D consistency. Contribution/Results: This work establishes the first diffusion-step-level 2D–3D collaborative modeling paradigm—eliminating the need for full 3D training and substantially reducing computational and data dependencies. Experiments demonstrate improved volumetric geometric fidelity in super-resolution and cross-modality translation tasks; downstream tumor segmentation achieves a +3.2% mDice gain, validating both the effectiveness and generalizability of explicit 3D structural modeling.
📝 Abstract
Despite success in volume-to-volume translations in medical images, most existing models struggle to effectively capture the inherent volumetric distribution using 3D representations. The current state-of-the-art approach combines multiple 2D-based networks through weighted averaging, thereby neglecting the 3D spatial structures. Directly training 3D models in medical imaging presents significant challenges due to high computational demands and the need for large-scale datasets. To address these challenges, we introduce Diff-Ensembler, a novel hybrid 2D-3D model for efficient and effective volumetric translations by ensembling perpendicularly trained 2D diffusion models with a 3D network in each diffusion step. Moreover, our model can naturally be used to ensemble diffusion models conditioned on different modalities, allowing flexible and accurate fusion of input conditions. Extensive experiments demonstrate that Diff-Ensembler attains superior accuracy and volumetric realism in 3D medical image super-resolution and modality translation. We further demonstrate the strength of our model's volumetric realism using tumor segmentation as a downstream task.