🤖 AI Summary
Existing RGB-T tracking benchmarks lack sufficient representation of multi-modal failure (MMF) scenarios—such as extreme illumination or thermal saturation—that cause unimodal failure in either RGB or thermal infrared (TIR) modalities. To address this gap, we introduce MV-RGBT, the first RGB-T tracking benchmark explicitly designed for MMF evaluation, covering 19 challenging scenarios and 36 object categories. It formally defines unimodal failure patterns for RGB and TIR streams and introduces the novel problem of “when to fuse” modalities. Methodologically, we propose a modality-effectiveness-driven data acquisition and synchronized annotation framework, and design MoETrack—a modular, confidence-guided mixture-of-experts architecture enabling dynamic weighted fusion. Experiments demonstrate that MoETrack achieves state-of-the-art performance on MV-RGBT, GTOT, and LasHeR, empirically validating the critical role of non-mandatory, adaptive fusion for robust RGB-T tracking under MMF conditions.
📝 Abstract
RGBT tracking draws increasing attention because its robustness in multi-modal warranting (MMW) scenarios, such as nighttime and adverse weather conditions, where relying on a single sensing modality fails to ensure stable tracking results. However, existing benchmarks predominantly contain videos collected in common scenarios where both RGB and thermal infrared (TIR) information are of sufficient quality. This weakens the representativeness of existing benchmarks in severe imaging conditions, leading to tracking failures in MMW scenarios. To bridge this gap, we present a new benchmark considering the modality validity, MV-RGBT, captured specifically from MMW scenarios where either RGB (extreme illumination) or TIR (thermal truncation) modality is invalid. Hence, it is further divided into two subsets according to the valid modality, offering a new compositional perspective for evaluation and providing valuable insights for future designs. Moreover, MV-RGBT is the most diverse benchmark of its kind, featuring 36 different object categories captured across 19 distinct scenes. Furthermore, considering severe imaging conditions in MMW scenarios, a new problem is posed in RGBT tracking, named `when to fuse', to stimulate the development of fusion strategies for such scenarios. To facilitate its discussion, we propose a new solution with a mixture of experts, named MoETrack, where each expert generates independent tracking results along with a confidence score. Extensive results demonstrate the significant potential of MV-RGBT in advancing RGBT tracking and elicit the conclusion that fusion is not always beneficial, especially in MMW scenarios. Besides, MoETrack achieves state-of-the-art results on several benchmarks, including MV-RGBT, GTOT, and LasHeR. Github: https://github.com/Zhangyong-Tang/MVRGBT.