🤖 AI Summary
Existing medical image fusion methods are constrained by fixed-input modality counts, limiting adaptability to diverse clinical modality combinations. To address this, we propose the first end-to-end diffusion-based framework supporting arbitrary numbers of input modalities. Our method innovatively couples hierarchical Bayesian modeling with the diffusion process and embeds an Expectation-Maximization (EM) algorithm into the sampling stage for maximum likelihood estimation. Key components include modality-adaptive alignment, variable-length feature fusion, and uncertainty-aware reconstruction. On the Harvard multimodal dataset, our approach achieves state-of-the-art performance across all nine quantitative metrics for both two- and three-modality fusion tasks. Furthermore, cross-domain generalization experiments—spanning infrared–visible light, multi-exposure, and multi-focus imaging—demonstrate significant improvements over prior art. This work establishes a new paradigm for flexible, robust, and clinically deployable multimodal image fusion.
📝 Abstract
Different modalities of medical images provide unique physiological and anatomical information for diseases. Multi-modal medical image fusion integrates useful information from different complementary medical images with different modalities, producing a fused image that comprehensively and objectively reflects lesion characteristics to assist doctors in clinical diagnosis. However, existing fusion methods can only handle a fixed number of modality inputs, such as accepting only two-modal or tri-modal inputs, and cannot directly process varying input quantities, which hinders their application in clinical settings. To tackle this issue, we introduce FlexiD-Fuse, a diffusion-based image fusion network designed to accommodate flexible quantities of input modalities. It can end-to-end process two-modal and tri-modal medical image fusion under the same weight. FlexiD-Fuse transforms the diffusion fusion problem, which supports only fixed-condition inputs, into a maximum likelihood estimation problem based on the diffusion process and hierarchical Bayesian modeling. By incorporating the Expectation-Maximization algorithm into the diffusion sampling iteration process, FlexiD-Fuse can generate high-quality fused images with cross-modal information from source images, independently of the number of input images. We compared the latest two and tri-modal medical image fusion methods, tested them on Harvard datasets, and evaluated them using nine popular metrics. The experimental results show that our method achieves the best performance in medical image fusion with varying inputs. Meanwhile, we conducted extensive extension experiments on infrared-visible, multi-exposure, and multi-focus image fusion tasks with arbitrary numbers, and compared them with the perspective SOTA methods. The results of the extension experiments consistently demonstrate the effectiveness and superiority of our method.