ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking

📅 2025-08-07

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Existing visual-language tracking methods suffer from limited flexibility in multimodal feature fusion, suboptimal localization accuracy, insufficient exploitation of large-model capabilities, and a lack of interpretability in reasoning processes. To address these issues, we propose ReasoningTrack—the first long-term visual-language tracking framework incorporating Chain-of-Thought (CoT) reasoning. It leverages Qwen2.5-VL to generate dynamic target descriptions and jointly optimizes language reasoning and generation via supervised fine-tuning (SFT) and GRPO-based reinforcement learning. A unified multimodal fusion backbone enables synergistic integration of linguistic descriptions and visual features for precise localization. Furthermore, we introduce TNLLT—the first large-scale long-term visual-language tracking benchmark—designed explicitly to support reasoning-process modeling and evaluation. Extensive experiments demonstrate that ReasoningTrack significantly outperforms 20 state-of-the-art baselines across multiple benchmarks, validating its effectiveness, generalizability, and interpretability.

Technology Category

Application Category

📝 Abstract

Vision-language tracking has received increasing attention in recent years, as textual information can effectively address the inflexibility and inaccuracy associated with specifying the target object to be tracked. Existing works either directly fuse the fixed language with vision features or simply modify using attention, however, their performance is still limited. Recently, some researchers have explored using text generation to adapt to the variations in the target during tracking, however, these works fail to provide insights into the model's reasoning process and do not fully leverage the advantages of large models, which further limits their overall performance. To address the aforementioned issues, this paper proposes a novel reasoning-based vision-language tracking framework, named ReasoningTrack, based on a pre-trained vision-language model Qwen2.5-VL. Both SFT (Supervised Fine-Tuning) and reinforcement learning GRPO are used for the optimization of reasoning and language generation. We embed the updated language descriptions and feed them into a unified tracking backbone network together with vision features. Then, we adopt a tracking head to predict the specific location of the target object. In addition, we propose a large-scale long-term vision-language tracking benchmark dataset, termed TNLLT, which contains 200 video sequences. 20 baseline visual trackers are re-trained and evaluated on this dataset, which builds a solid foundation for the vision-language visual tracking task. Extensive experiments on multiple vision-language tracking benchmark datasets fully validated the effectiveness of our proposed reasoning-based natural language generation strategy. The source code of this paper will be released on https://github.com/Event-AHU/Open_VLTrack

Problem

Research questions and friction points this paper is trying to address.

Enhance vision-language tracking with reasoning-based natural language generation

Improve model reasoning process and leverage large model advantages

Address limitations in existing vision-language tracking methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses pre-trained vision-language model Qwen2.5-VL

Combines SFT and reinforcement learning GRPO

Introduces large-scale dataset TNLLT for evaluation

🔎 Similar Papers

No similar papers found.