🤖 AI Summary
Existing diffusion-based image fusion methods rely on predefined modality guidance, failing to model the dynamic variation of modality importance and lacking theoretical guarantees. This work first reveals the spatiotemporal imbalance in information gain during denoising and introduces Diffusion Information Gain (DIG)—a novel concept quantifying step-wise modality-specific information acquisition. We propose the first falsifiable dynamic fusion framework, theoretically grounded to reduce the upper bound of generalization error. Our method enables step-level dynamic quantification and adaptive fusion of modality contributions via variational inference, information-theoretic metrics, dynamic weight scheduling, and multi-stage feature alignment. Evaluated across diverse fusion scenarios, it achieves +2.1 dB PSNR and +0.032 SSIM improvements over state-of-the-art diffusion fusion approaches, while accelerating inference by 37%.
📝 Abstract
Image fusion integrates complementary information from multi-source images to generate more informative results. Recently, the diffusion model, which demonstrates unprecedented generative potential, has been explored in image fusion. However, these approaches typically incorporate predefined multimodal guidance into diffusion, failing to capture the dynamically changing significance of each modality, while lacking theoretical guarantees. To address this issue, we reveal a significant spatio-temporal imbalance in image denoising; specifically, the diffusion model produces dynamic information gains in different image regions with denoising steps. Based on this observation, we Dig into the Diffusion Information Gains (Dig2DIG) and theoretically derive a diffusion-based dynamic image fusion framework that provably reduces the upper bound of the generalization error. Accordingly, we introduce diffusion information gains (DIG) to quantify the information contribution of each modality at different denoising steps, thereby providing dynamic guidance during the fusion process. Extensive experiments on multiple fusion scenarios confirm that our method outperforms existing diffusion-based approaches in terms of both fusion quality and inference efficiency.