Collaborating Vision, Depth, and Thermal Signals for Multi-Modal Tracking: Dataset and Algorithm

πŸ“… 2025-09-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the insufficient robustness of existing bimodal (e.g., RGB-D, RGB-TIR) trackers in complex scenes, this paper introduces the novel task of RGB-D-TIR trimodal collaborative tracking. To support this task, we construct RGBDT500β€”the first large-scale, synchronously annotated trimodal video dataset comprising 500 sequences. Methodologically, we propose RDTTrack: a tracker built upon a pre-trained RGB foundation model, enhanced with prompt learning for cross-modal adaptive alignment. Crucially, we introduce an orthogonal projection constraint mechanism to explicitly model the geometric complementarity between thermal infrared and depth modalities, enabling efficient trimodal fusion. Extensive experiments demonstrate that RDTTrack significantly outperforms state-of-the-art bimodal methods in both accuracy and robustness across diverse challenging scenarios. This work validates the effectiveness and necessity of trimodal collaborative modeling for visual tracking.

Technology Category

Application Category

πŸ“ Abstract
Existing multi-modal object tracking approaches primarily focus on dual-modal paradigms, such as RGB-Depth or RGB-Thermal, yet remain challenged in complex scenarios due to limited input modalities. To address this gap, this work introduces a novel multi-modal tracking task that leverages three complementary modalities, including visible RGB, Depth (D), and Thermal Infrared (TIR), aiming to enhance robustness in complex scenarios. To support this task, we construct a new multi-modal tracking dataset, coined RGBDT500, which consists of 500 videos with synchronised frames across the three modalities. Each frame provides spatially aligned RGB, depth, and thermal infrared images with precise object bounding box annotations. Furthermore, we propose a novel multi-modal tracker, dubbed RDTTrack. RDTTrack integrates tri-modal information for robust tracking by leveraging a pretrained RGB-only tracking model and prompt learning techniques. In specific, RDTTrack fuses thermal infrared and depth modalities under a proposed orthogonal projection constraint, then integrates them with RGB signals as prompts for the pre-trained foundation tracking model, effectively harmonising tri-modal complementary cues. The experimental results demonstrate the effectiveness and advantages of the proposed method, showing significant improvements over existing dual-modal approaches in terms of tracking accuracy and robustness in complex scenarios.
Problem

Research questions and friction points this paper is trying to address.

Integrates vision, depth, and thermal modalities for robust tracking
Addresses limitations of dual-modal tracking in complex scenarios
Develops dataset and algorithm for tri-modal object tracking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates RGB, depth, thermal modalities for tracking
Fuses modalities using orthogonal projection constraint
Leverages prompt learning with pretrained RGB model
πŸ”Ž Similar Papers
No similar papers found.