🤖 AI Summary
Few-shot multispectral object detection (FSMOD) aims to jointly detect objects from visible and infrared modalities using only a minimal number of annotated samples. To address this challenge, we propose FSMODNet—a novel framework that, for the first time, integrates deformable attention into FSMOD to enable dynamic cross-modal feature alignment and complementary enhancement. FSMODNet further combines a feature pyramid network with few-shot learning strategies to improve generalization under data scarcity and extreme illumination conditions. Extensive experiments on two public multispectral benchmarks demonstrate that FSMODNet consistently outperforms multiple strong baselines built upon state-of-the-art detectors. Ablation studies confirm the effectiveness of each component, particularly the deformable attention module in bridging modality gaps. Overall, FSMODNet achieves superior detection accuracy and robustness in resource-constrained scenarios, establishing new performance benchmarks for few-shot multispectral detection.
📝 Abstract
Few-shot multispectral object detection (FSMOD) addresses the challenge of detecting objects across visible and thermal modalities with minimal annotated data. In this paper, we explore this complex task and introduce a framework named "FSMODNet" that leverages cross-modality feature integration to improve detection performance even with limited labels. By effectively combining the unique strengths of visible and thermal imagery using deformable attention, the proposed method demonstrates robust adaptability in complex illumination and environmental conditions. Experimental results on two public datasets show effective object detection performance in challenging low-data regimes, outperforming several baselines we established from state-of-the-art models. All code, models, and experimental data splits can be found at https://anonymous.4open.science/r/Test-B48D.