Distilling Textual Priors from LLM to Efficient Image Fusion

📅 2025-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the trade-off between poor robustness of conventional methods and high computational overhead of text-guided large models in multimodal image fusion, this paper proposes a lightweight, inference-efficient fusion framework that eliminates the need for textual input during inference. Our method introduces three key innovations: (1) distilling textual prior knowledge embedded in large language models (LLMs) into a compact image fusion network; (2) designing a spatial-channel cross-fusion module to enable cross-dimensional prior modeling; and (3) implementing knowledge transfer via a teacher–student architecture. Experimental results demonstrate that the proposed approach retains over 90% of the teacher model’s performance while reducing both parameter count and inference latency by 90%. It significantly outperforms existing state-of-the-art methods across multiple benchmarks. The source code will be made publicly available.

Technology Category

Application Category

📝 Abstract
Multi-modality image fusion aims to synthesize a single, comprehensive image from multiple source inputs. Traditional approaches, such as CNNs and GANs, offer efficiency but struggle to handle low-quality or complex inputs. Recent advances in text-guided methods leverage large model priors to overcome these limitations, but at the cost of significant computational overhead, both in memory and inference time. To address this challenge, we propose a novel framework for distilling large model priors, eliminating the need for text guidance during inference while dramatically reducing model size. Our framework utilizes a teacher-student architecture, where the teacher network incorporates large model priors and transfers this knowledge to a smaller student network via a tailored distillation process. Additionally, we introduce spatial-channel cross-fusion module to enhance the model's ability to leverage textual priors across both spatial and channel dimensions. Our method achieves a favorable trade-off between computational efficiency and fusion quality. The distilled network, requiring only 10% of the parameters and inference time of the teacher network, retains 90% of its performance and outperforms existing SOTA methods. Extensive experiments demonstrate the effectiveness of our approach. The implementation will be made publicly available as an open-source resource.
Problem

Research questions and friction points this paper is trying to address.

Efficient image fusion without text guidance
Reducing computational overhead in multi-modality fusion
Distilling large model priors to smaller networks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Teacher-student architecture for knowledge distillation
Spatial-channel cross-fusion module for enhanced priors
10% parameters with 90% performance retention
🔎 Similar Papers
No similar papers found.
R
Ran Zhang
Hefei University of Technology, Hefei, China
L
Liu Liu
Hefei University of Technology, Hefei, China
Xuanhua He
Xuanhua He
The Hong Kong University of Science and Technology
low level visionvideo generation
K
Ke Cao
University of Science and Technology of China, Hefei, China
L
Li Zhang
University of Science and Technology of China, Hefei, China
Man Zhou
Man Zhou
School of CSE, Huazhong University of Science and Technology (HUST)
Embodied AI SecurityAuthenticationMobile System SecurityWireless Sensing Security
J
Jie Zhang
Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei, China