🤖 AI Summary
To address background noise and common-mode interference in feature fusion for multispectral object detection, this paper proposes a fusion framework based on cross-modal contrastive learning and iterative differential optimization. Methodologically, it introduces two novel modules: the Mutual Feature Refinement Module (MFRM) and the Differential Feature Feedback Module (DFFM), which jointly emulate a feedback-based differential amplification mechanism to dynamically generate differential guidance signals. Furthermore, it integrates cross-modal contrastive learning, relational graph modeling, and iterative optimization to adaptively enhance salient structural features while suppressing noise. Evaluated on FLIR, LLVIP, and M$^3$FD benchmarks, the method achieves state-of-the-art performance—demonstrating significant improvements in cross-modal alignment accuracy and robustness under complex, cluttered scenes.
📝 Abstract
Current multispectral object detection methods often retain extraneous background or noise during feature fusion, limiting perceptual performance.To address this, we propose an innovative feature fusion framework based on cross-modal feature contrastive and screening strategy, diverging from conventional approaches. The proposed method adaptively enhances salient structures by fusing object-aware complementary cross-modal features while suppressing shared background interference.Our solution centers on two novel, specially designed modules: the Mutual Feature Refinement Module (MFRM) and the Differential Feature Feedback Module (DFFM). The MFRM enhances intra- and inter-modal feature representations by modeling their relationships, thereby improving cross-modal alignment and discriminative power.Inspired by feedback differential amplifiers, the DFFM dynamically computes inter-modal differential features as guidance signals and feeds them back to the MFRM, enabling adaptive fusion of complementary information while suppressing common-mode noise across modalities. To enable robust feature learning, the MFRM and DFFM are integrated into a unified framework, which is formally formulated as an Iterative Relation-Map Differential Guided Feature Fusion mechanism, termed IRDFusion. IRDFusion enables high-quality cross-modal fusion by progressively amplifying salient relational signals through iterative feedback, while suppressing feature noise, leading to significant performance gains.In extensive experiments on FLIR, LLVIP and M$^3$FD datasets, IRDFusion achieves state-of-the-art performance and consistently outperforms existing methods across diverse challenging scenarios, demonstrating its robustness and effectiveness. Code will be available at https://github.com/61s61min/IRDFusion.git.