RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation

📅 2026-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing RGBT tracking methods, which rely solely on initial visual cues and lack linguistic guidance, rendering them vulnerable to target appearance variations, modality heterogeneity, and background clutter. To overcome these challenges, we propose the first language-guided RGBT tracking framework that unifies visual and linguistic modalities. Our approach leverages a multimodal large language model to generate descriptive text, and introduces a multimodal Transformer encoder, an adaptive token fusion (ATF) mechanism, and a context-aware reasoning module. Temporal language reasoning and dynamic knowledge updating are achieved through retrieval-augmented generation (RAG). Extensive experiments on four RGBT benchmarks demonstrate state-of-the-art performance, with significant improvements in tracking accuracy and robustness under challenging conditions such as illumination changes and occlusion.

Technology Category

Application Category

📝 Abstract
RGB-Thermal (RGBT) tracking aims to achieve robust object localization across diverse environmental conditions by fusing visible and thermal infrared modalities. However, existing RGBT trackers rely solely on initial-frame visual information for target modeling, failing to adapt to appearance variations due to the absence of language guidance. Furthermore, current methods suffer from redundant search regions and heterogeneous modality gaps, causing background distraction. To address these issues, we first introduce textual descriptions into RGBT tracking benchmarks. This is accomplished through a pipeline that leverages Multi-modal Large Language Models (MLLMs) to automatically produce texual annotations. Afterwards, we propose RAGTrack, a novel Retrieval-Augmented Generation framework for robust RGBT tracking. To this end, we introduce a Multi-modal Transformer Encoder (MTE) for unified visual-language modeling. Then, we design an Adaptive Token Fusion (ATF) to select target-relevant tokens and perform channel exchanges based on cross-modal correlations, mitigating search redundancies and modality gaps. Finally, we propose a Context-aware Reasoning Module (CRM) to maintain a dynamic knowledge base and employ a Retrieval-Augmented Generation (RAG) to enable temporal linguistic reasoning for robust target modeling. Extensive experiments on four RGBT benchmarks demonstrate that our framework achieves state-of-the-art performance across various challenging scenarios. The source code is available https://github.com/IdolLab/RAGTrack.
Problem

Research questions and friction points this paper is trying to address.

RGBT tracking
language guidance
modality gap
appearance variation
background distraction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-Augmented Generation
RGBT tracking
visual-language modeling
Adaptive Token Fusion
Multi-modal Transformer
🔎 Similar Papers
No similar papers found.
H
Hao Li
College of Command and Control Engineering, Army Engineering University of PLA
Yuhao Wang
Yuhao Wang
Dalian University of Technology
Computer VisionMulti-modal FusionReID
W
Wenning Hao
College of Command and Control Engineering, Army Engineering University of PLA
P
Pingping Zhang
School of Future Technology, Dalian University of Technology
Dong Wang
Dong Wang
Dalian University of Technology
Computer VisionImage ProcessingObject TrackingVisual Tracking
H
Huchuan Lu
School of Future Technology, Dalian University of Technology; School of Information and Communication Engineering, Dalian University of Technology