Collaborating Vision, Depth, and Thermal Signals for Multi-Modal Tracking: Dataset and Algorithm

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

To address the insufficient robustness of existing bimodal (e.g., RGB-D, RGB-TIR) trackers in complex scenes, this paper introduces the novel task of RGB-D-TIR trimodal collaborative tracking. To support this task, we construct RGBDT500—the first large-scale, synchronously annotated trimodal video dataset comprising 500 sequences. Methodologically, we propose RDTTrack: a tracker built upon a pre-trained RGB foundation model, enhanced with prompt learning for cross-modal adaptive alignment. Crucially, we introduce an orthogonal projection constraint mechanism to explicitly model the geometric complementarity between thermal infrared and depth modalities, enabling efficient trimodal fusion. Extensive experiments demonstrate that RDTTrack significantly outperforms state-of-the-art bimodal methods in both accuracy and robustness across diverse challenging scenarios. This work validates the effectiveness and necessity of trimodal collaborative modeling for visual tracking.

Technology Category

Application Category

📝 Abstract

Existing multi-modal object tracking approaches primarily focus on dual-modal paradigms, such as RGB-Depth or RGB-Thermal, yet remain challenged in complex scenarios due to limited input modalities. To address this gap, this work introduces a novel multi-modal tracking task that leverages three complementary modalities, including visible RGB, Depth (D), and Thermal Infrared (TIR), aiming to enhance robustness in complex scenarios. To support this task, we construct a new multi-modal tracking dataset, coined RGBDT500, which consists of 500 videos with synchronised frames across the three modalities. Each frame provides spatially aligned RGB, depth, and thermal infrared images with precise object bounding box annotations. Furthermore, we propose a novel multi-modal tracker, dubbed RDTTrack. RDTTrack integrates tri-modal information for robust tracking by leveraging a pretrained RGB-only tracking model and prompt learning techniques. In specific, RDTTrack fuses thermal infrared and depth modalities under a proposed orthogonal projection constraint, then integrates them with RGB signals as prompts for the pre-trained foundation tracking model, effectively harmonising tri-modal complementary cues. The experimental results demonstrate the effectiveness and advantages of the proposed method, showing significant improvements over existing dual-modal approaches in terms of tracking accuracy and robustness in complex scenarios.

Problem

Research questions and friction points this paper is trying to address.

Integrates vision, depth, and thermal modalities for robust tracking

Addresses limitations of dual-modal tracking in complex scenarios

Develops dataset and algorithm for tri-modal object tracking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates RGB, depth, thermal modalities for tracking

Fuses modalities using orthogonal projection constraint

Leverages prompt learning with pretrained RGB model

🔎 Similar Papers

RGBT Tracking via All-layer Multimodal Interactions with Progressive Fusion Mamba