🤖 AI Summary
Visible-light–infrared cross-modal object detection is commonly hindered by feature misalignment arising from resolution disparities, spatial shifts, and modality inconsistencies, leading to alignment difficulties and noise interference. To address this, we propose a misalignment-aware unified detection framework featuring two key innovations: (1) wavelet-guided multi-frequency feature decomposition, which decouples features across frequency domains via discrete wavelet transform; and (2) modality-aware adaptive fusion, employing misalignment-sensitive cross-modal guidance to dynamically rectify misaligned features and suppress spurious responses. Our approach achieves state-of-the-art performance on DVTOD, DroneVehicle, and M3FD benchmarks, significantly improving detection accuracy and robustness under severe misalignment conditions. By explicitly modeling cross-modal alignment in the frequency domain, the method provides an interpretable and generalizable paradigm for cross-modal feature learning.
📝 Abstract
Visible-infrared object detection aims to enhance the detection robustness by exploiting the complementary information of visible and infrared image pairs. However, its performance is often limited by frequent misalignments caused by resolution disparities, spatial displacements, and modality inconsistencies. To address this issue, we propose the Wavelet-guided Misalignment-aware Network (WMNet), a unified framework designed to adaptively address different cross-modal misalignment patterns. WMNet incorporates wavelet-based multi-frequency analysis and modality-aware fusion mechanisms to improve the alignment and integration of cross-modal features. By jointly exploiting low and high-frequency information and introducing adaptive guidance across modalities, WMNet alleviates the adverse effects of noise, illumination variation, and spatial misalignment. Furthermore, it enhances the representation of salient target features while suppressing spurious or misleading information, thereby promoting more accurate and robust detection. Extensive evaluations on the DVTOD, DroneVehicle, and M3FD datasets demonstrate that WMNet achieves state-of-the-art performance on misaligned cross-modal object detection tasks, confirming its effectiveness and practical applicability.