Towards General Multimodal Visual Tracking

πŸ“… 2025-03-14
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing bimodal visual tracking methods exhibit limited robustness under challenging conditions such as occlusion, low illumination, and fast motion. To address this, we introduce, for the first time, the general multimodal visual tracking task integrating four complementary modalities: RGB, thermal infrared, event streams, and natural language. Toward this goal, we establish QuadTrack600β€”the first large-scale, fully aligned quad-modal benchmark comprising 600 sequences and 384.7K frame groups. We further propose QuadFusion, a novel framework featuring a multi-scale Mamba-based fusion module that enables efficient, linear-complexity cross-modal interaction; a cross-modal feature fusion mechanism; and a challenge-attribute-driven evaluation protocol. Extensive experiments demonstrate that QuadFusion achieves state-of-the-art performance on QuadTrack600 as well as on established benchmarks (LasHeR, VisEvent, TNL2K), validating the effectiveness of quad-modal synergy in enhancing tracking robustness across complex scenarios.

Technology Category

Application Category

πŸ“ Abstract
Existing multimodal tracking studies focus on bi-modal scenarios such as RGB-Thermal, RGB-Event, and RGB-Language. Although promising tracking performance is achieved through leveraging complementary cues from different sources, it remains challenging in complex scenes due to the limitations of bi-modal scenarios. In this work, we introduce a general multimodal visual tracking task that fully exploits the advantages of four modalities, including RGB, thermal infrared, event, and language, for robust tracking under challenging conditions. To provide a comprehensive evaluation platform for general multimodal visual tracking, we construct QuadTrack600, a large-scale, high-quality benchmark comprising 600 video sequences (totaling 384.7K high-resolution (640x480) frame groups). In each frame group, all four modalities are spatially aligned and meticulously annotated with bounding boxes, while 21 sequence-level challenge attributes are provided for detailed performance analysis. Despite quad-modal data provides richer information, the differences in information quantity among modalities and the computational burden from four modalities are two challenging issues in fusing four modalities. To handle these issues, we propose a novel approach called QuadFusion, which incorporates an efficient Multiscale Fusion Mamba with four different scanning scales to achieve sufficient interactions of the four modalities while overcoming the exponential computational burden, for general multimodal visual tracking. Extensive experiments on the QuadTrack600 dataset and three bi-modal tracking datasets, including LasHeR, VisEvent, and TNL2K, validate the effectiveness of our QuadFusion.
Problem

Research questions and friction points this paper is trying to address.

General multimodal visual tracking using four modalities.
Challenges in fusing quad-modal data efficiently.
Introduction of QuadTrack600 benchmark for comprehensive evaluation.
Innovation

Methods, ideas, or system contributions that make the work stand out.

QuadFusion integrates four modalities efficiently.
Multiscale Fusion Mamba reduces computational burden.
QuadTrack600 benchmark supports comprehensive evaluation.
πŸ”Ž Similar Papers
No similar papers found.
Andong Lu
Andong Lu
Anhui University
CV DL
M
Mai Wen
School of Artificial Intelligence, Anhui University
J
Jinhu Wang
School of Computer Science and Technology, Anhui University
Y
Yuanzhi Guo
School of Artificial Intelligence, Anhui University
Chenglong Li
Chenglong Li
Professor, The University of Florida
Drug DesignDrug DiscoveryMolecular RecognitionMolecular ModelingProtein structure and Dynamics
Jin Tang
Jin Tang
Anhui University
Computer visionintelligent video analysis
B
Bin Luo
School of Computer Science and Technology, Anhui University