🤖 AI Summary
This work addresses the challenges of large modality-shared feature bias and insufficient modality-specific features in multispectral object detection by explicitly decoupling shared and modality-specific information in the frequency domain for the first time. Specifically, wavelet decomposition is employed to separate low-frequency (shared) and high-frequency (modality-specific) components from infrared and visible images. Dedicated alignment and preservation modules are designed for these components, respectively, and a frequency-aware query mechanism is introduced to dynamically fuse them. The proposed method integrates cross-modal attention, multi-scale gradient consistency loss, and hybrid feature enhancement, achieving state-of-the-art performance on the FLIR, LLVIP, and M3FD datasets and significantly improving detection accuracy.
📝 Abstract
Infrared-visible object detection improves detection performance by combining complementary features from multispectral images. Existing backbone-specific and backbone-shared approaches still suffer from the problems of severe bias of modality-shared features and the insufficiency of modality-specific features. To address these issues, we propose a novel detection framework WD-FQDet that explicitly decouples modality-shared and modality-specific information from infrared and visible modalities in the new view of low- and high-frequency domains, allowing fusion strategies tailored to their frequency characteristics. Specifically, a low-frequency homogeneity alignment module is proposed to align modality-shared features across modalities via a cross-modal attention mechanism, and a high-frequency specificity retention module is proposed to preserve modality-specific features through the multi-scale gradient consistency loss. To reinforce the feature representation in the frequency domain, we propose a hybrid feature enhancement module that incorporates spatial cues. Furthermore, considering that the contributions of homogeneous and modality-specific features to object detection vary across scenarios, we propose a frequency-aware query selection module to dynamically regulate their contributions. Experimental results on the FLIR, LLVIP, and M3FD datasets demonstrate that WD-FQDet achieves state-of-the-art performance across multiple evaluation metrics.