RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation

📅 2026-03-03

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

This work addresses the limitations of existing RGBT tracking methods, which rely solely on initial visual cues and lack linguistic guidance, rendering them vulnerable to target appearance variations, modality heterogeneity, and background clutter. To overcome these challenges, we propose the first language-guided RGBT tracking framework that unifies visual and linguistic modalities. Our approach leverages a multimodal large language model to generate descriptive text, and introduces a multimodal Transformer encoder, an adaptive token fusion (ATF) mechanism, and a context-aware reasoning module. Temporal language reasoning and dynamic knowledge updating are achieved through retrieval-augmented generation (RAG). Extensive experiments on four RGBT benchmarks demonstrate state-of-the-art performance, with significant improvements in tracking accuracy and robustness under challenging conditions such as illumination changes and occlusion.

Technology Category

Application Category

📝 Abstract

RGB-Thermal (RGBT) tracking aims to achieve robust object localization across diverse environmental conditions by fusing visible and thermal infrared modalities. However, existing RGBT trackers rely solely on initial-frame visual information for target modeling, failing to adapt to appearance variations due to the absence of language guidance. Furthermore, current methods suffer from redundant search regions and heterogeneous modality gaps, causing background distraction. To address these issues, we first introduce textual descriptions into RGBT tracking benchmarks. This is accomplished through a pipeline that leverages Multi-modal Large Language Models (MLLMs) to automatically produce texual annotations. Afterwards, we propose RAGTrack, a novel Retrieval-Augmented Generation framework for robust RGBT tracking. To this end, we introduce a Multi-modal Transformer Encoder (MTE) for unified visual-language modeling. Then, we design an Adaptive Token Fusion (ATF) to select target-relevant tokens and perform channel exchanges based on cross-modal correlations, mitigating search redundancies and modality gaps. Finally, we propose a Context-aware Reasoning Module (CRM) to maintain a dynamic knowledge base and employ a Retrieval-Augmented Generation (RAG) to enable temporal linguistic reasoning for robust target modeling. Extensive experiments on four RGBT benchmarks demonstrate that our framework achieves state-of-the-art performance across various challenging scenarios. The source code is available https://github.com/IdolLab/RAGTrack.

Problem

Research questions and friction points this paper is trying to address.

RGBT tracking

language guidance

modality gap

appearance variation

background distraction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-Augmented Generation

RGBT tracking

visual-language modeling