🤖 AI Summary
To address the trade-off between poor robustness of conventional methods and high computational overhead of text-guided large models in multimodal image fusion, this paper proposes a lightweight, inference-efficient fusion framework that eliminates the need for textual input during inference. Our method introduces three key innovations: (1) distilling textual prior knowledge embedded in large language models (LLMs) into a compact image fusion network; (2) designing a spatial-channel cross-fusion module to enable cross-dimensional prior modeling; and (3) implementing knowledge transfer via a teacher–student architecture. Experimental results demonstrate that the proposed approach retains over 90% of the teacher model’s performance while reducing both parameter count and inference latency by 90%. It significantly outperforms existing state-of-the-art methods across multiple benchmarks. The source code will be made publicly available.
📝 Abstract
Multi-modality image fusion aims to synthesize a single, comprehensive image from multiple source inputs. Traditional approaches, such as CNNs and GANs, offer efficiency but struggle to handle low-quality or complex inputs. Recent advances in text-guided methods leverage large model priors to overcome these limitations, but at the cost of significant computational overhead, both in memory and inference time. To address this challenge, we propose a novel framework for distilling large model priors, eliminating the need for text guidance during inference while dramatically reducing model size. Our framework utilizes a teacher-student architecture, where the teacher network incorporates large model priors and transfers this knowledge to a smaller student network via a tailored distillation process. Additionally, we introduce spatial-channel cross-fusion module to enhance the model's ability to leverage textual priors across both spatial and channel dimensions. Our method achieves a favorable trade-off between computational efficiency and fusion quality. The distilled network, requiring only 10% of the parameters and inference time of the teacher network, retains 90% of its performance and outperforms existing SOTA methods. Extensive experiments demonstrate the effectiveness of our approach. The implementation will be made publicly available as an open-source resource.