Every SAM Drop Counts: Embracing Semantic Priors for Multi-Modality Image Fusion and Beyond

📅 2025-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of simultaneously preserving fine details and ensuring downstream-task adaptability in infrared and visible-light image fusion, this paper proposes SAGE, a semantic-guided lightweight fusion framework. Methodologically, SAGE introduces (1) the Semantic Persistent Attention (SPA) module—the first of its kind—to dynamically model cross-modal semantic correlations; and (2) a three-level knowledge distillation mechanism—operating at feature, pixel, and contrastive semantic levels—to transfer semantic priors from the Segment Anything Model (SAM) to a compact student network, thereby eliminating SAM dependency during inference. Extensive experiments demonstrate that SAGE consistently outperforms state-of-the-art methods in both visual quality and downstream performance for detection and segmentation tasks, while achieving real-time inference speed (>30 FPS). This advances task-oriented practicality of fused imagery without compromising fidelity or efficiency.

Technology Category

Application Category

📝 Abstract
Multi-modality image fusion, particularly infrared and visible image fusion, plays a crucial role in integrating diverse modalities to enhance scene understanding. Early research primarily focused on visual quality, yet challenges remain in preserving fine details, making it difficult to adapt to subsequent tasks. Recent approaches have shifted towards task-specific design, but struggle to achieve the ``The Best of Both Worlds'' due to inconsistent optimization goals. To address these issues, we propose a novel method that leverages the semantic knowledge from the Segment Anything Model (SAM) to Grow the quality of fusion results and Establish downstream task adaptability, namely SAGE. Specifically, we design a Semantic Persistent Attention (SPA) Module that efficiently maintains source information via the persistent repository while extracting high-level semantic priors from SAM. More importantly, to eliminate the impractical dependence on SAM during inference, we introduce a bi-level optimization-driven distillation mechanism with triplet losses, which allow the student network to effectively extract knowledge at the feature, pixel, and contrastive semantic levels, thereby removing reliance on the cumbersome SAM model. Extensive experiments show that our method achieves a balance between high-quality visual results and downstream task adaptability while maintaining practical deployment efficiency.
Problem

Research questions and friction points this paper is trying to address.

Enhance multi-modality image fusion quality
Improve downstream task adaptability
Reduce reliance on SAM during inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages SAM for semantic knowledge in fusion
Introduces Semantic Persistent Attention Module
Uses bi-level optimization-driven distillation mechanism
🔎 Similar Papers
No similar papers found.
G
Guanyao Wu
School of Software Technology & DUT-RU International School of ISE, Dalian University of Technology
H
Haoyu Liu
School of Software Technology & DUT-RU International School of ISE, Dalian University of Technology
H
Hongming Fu
School of Software Technology & DUT-RU International School of ISE, Dalian University of Technology
Yichuan Peng
Yichuan Peng
Dalian University of Technology
J
Jinyuan Liu
School of Mechanical Engineering, Dalian University of Technology
X
Xin Fan
School of Software Technology & DUT-RU International School of ISE, Dalian University of Technology
Risheng Liu
Risheng Liu
Professor, Dalian University of Technology
computer visionmachine learningoptimization