🤖 AI Summary
Infrared–visible object detection (IVOD) suffers severe performance degradation under modality missing—especially when the dominant modality is absent. Method: This paper proposes Scarf-DETR, the first DETR-based detector extension framework supporting arbitrary modality combinations, grounded in architectural compatibility. Its core innovations include: (i) a plug-and-play Scarf Neck module, (ii) pseudo-modality dropout training, and (iii) a modality-agnostic deformable attention mechanism, enabling unified modeling for single- or dual-modal inputs. Additionally, we introduce the first comprehensive IVOD benchmark covering both dominant and subordinate modality missing scenarios. Results: Experiments demonstrate that Scarf-DETR significantly outperforms existing methods under incomplete modality conditions, achieves state-of-the-art accuracy on standard IVOD benchmarks, and exhibits strong robustness, high cross-modal compatibility, and practical deployability.
📝 Abstract
Infrared and visible object detection (IVOD) is essential for numerous around-the-clock applications. Despite notable advancements, current IVOD models exhibit notable performance declines when confronted with incomplete modality data, particularly if the dominant modality is missing. In this paper, we take a thorough investigation on modality incomplete IVOD problem from an architecture compatibility perspective. Specifically, we propose a plug-and-play Scarf Neck module for DETR variants, which introduces a modality-agnostic deformable attention mechanism to enable the IVOD detector to flexibly adapt to any single or double modalities during training and inference. When training Scarf-DETR, we design a pseudo modality dropout strategy to fully utilize the multi-modality information, making the detector compatible and robust to both working modes of single and double modalities. Moreover, we introduce a comprehensive benchmark for the modality-incomplete IVOD task aimed at thoroughly assessing situations where the absent modality is either dominant or secondary. Our proposed Scarf-DETR not only performs excellently in missing modality scenarios but also achieves superior performances on the standard IVOD modality complete benchmarks. Our code will be available at https://github.com/YinghuiXing/Scarf-DETR.