🤖 AI Summary
This work addresses the limitations of existing RGBT tracking methods, which rely solely on initial visual cues and lack linguistic guidance, rendering them vulnerable to target appearance variations, modality heterogeneity, and background clutter. To overcome these challenges, we propose the first language-guided RGBT tracking framework that unifies visual and linguistic modalities. Our approach leverages a multimodal large language model to generate descriptive text, and introduces a multimodal Transformer encoder, an adaptive token fusion (ATF) mechanism, and a context-aware reasoning module. Temporal language reasoning and dynamic knowledge updating are achieved through retrieval-augmented generation (RAG). Extensive experiments on four RGBT benchmarks demonstrate state-of-the-art performance, with significant improvements in tracking accuracy and robustness under challenging conditions such as illumination changes and occlusion.
📝 Abstract
RGB-Thermal (RGBT) tracking aims to achieve robust object localization across diverse environmental conditions by fusing visible and thermal infrared modalities. However, existing RGBT trackers rely solely on initial-frame visual information for target modeling, failing to adapt to appearance variations due to the absence of language guidance. Furthermore, current methods suffer from redundant search regions and heterogeneous modality gaps, causing background distraction. To address these issues, we first introduce textual descriptions into RGBT tracking benchmarks. This is accomplished through a pipeline that leverages Multi-modal Large Language Models (MLLMs) to automatically produce texual annotations. Afterwards, we propose RAGTrack, a novel Retrieval-Augmented Generation framework for robust RGBT tracking. To this end, we introduce a Multi-modal Transformer Encoder (MTE) for unified visual-language modeling. Then, we design an Adaptive Token Fusion (ATF) to select target-relevant tokens and perform channel exchanges based on cross-modal correlations, mitigating search redundancies and modality gaps. Finally, we propose a Context-aware Reasoning Module (CRM) to maintain a dynamic knowledge base and employ a Retrieval-Augmented Generation (RAG) to enable temporal linguistic reasoning for robust target modeling. Extensive experiments on four RGBT benchmarks demonstrate that our framework achieves state-of-the-art performance across various challenging scenarios. The source code is available https://github.com/IdolLab/RAGTrack.