Distilling Textual Priors from LLM to Efficient Image Fusion

📅 2025-04-09

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

To address the trade-off between poor robustness of conventional methods and high computational overhead of text-guided large models in multimodal image fusion, this paper proposes a lightweight, inference-efficient fusion framework that eliminates the need for textual input during inference. Our method introduces three key innovations: (1) distilling textual prior knowledge embedded in large language models (LLMs) into a compact image fusion network; (2) designing a spatial-channel cross-fusion module to enable cross-dimensional prior modeling; and (3) implementing knowledge transfer via a teacher–student architecture. Experimental results demonstrate that the proposed approach retains over 90% of the teacher model’s performance while reducing both parameter count and inference latency by 90%. It significantly outperforms existing state-of-the-art methods across multiple benchmarks. The source code will be made publicly available.

Technology Category

Application Category

📝 Abstract

Multi-modality image fusion aims to synthesize a single, comprehensive image from multiple source inputs. Traditional approaches, such as CNNs and GANs, offer efficiency but struggle to handle low-quality or complex inputs. Recent advances in text-guided methods leverage large model priors to overcome these limitations, but at the cost of significant computational overhead, both in memory and inference time. To address this challenge, we propose a novel framework for distilling large model priors, eliminating the need for text guidance during inference while dramatically reducing model size. Our framework utilizes a teacher-student architecture, where the teacher network incorporates large model priors and transfers this knowledge to a smaller student network via a tailored distillation process. Additionally, we introduce spatial-channel cross-fusion module to enhance the model's ability to leverage textual priors across both spatial and channel dimensions. Our method achieves a favorable trade-off between computational efficiency and fusion quality. The distilled network, requiring only 10% of the parameters and inference time of the teacher network, retains 90% of its performance and outperforms existing SOTA methods. Extensive experiments demonstrate the effectiveness of our approach. The implementation will be made publicly available as an open-source resource.

Problem

Research questions and friction points this paper is trying to address.

Efficient image fusion without text guidance

Reducing computational overhead in multi-modality fusion

Distilling large model priors to smaller networks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Teacher-student architecture for knowledge distillation

Spatial-channel cross-fusion module for enhanced priors

10% parameters with 90% performance retention

🔎 Similar Papers

No similar papers found.