ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing visual-language tracking methods suffer from limited flexibility in multimodal feature fusion, suboptimal localization accuracy, insufficient exploitation of large-model capabilities, and a lack of interpretability in reasoning processes. To address these issues, we propose ReasoningTrack—the first long-term visual-language tracking framework incorporating Chain-of-Thought (CoT) reasoning. It leverages Qwen2.5-VL to generate dynamic target descriptions and jointly optimizes language reasoning and generation via supervised fine-tuning (SFT) and GRPO-based reinforcement learning. A unified multimodal fusion backbone enables synergistic integration of linguistic descriptions and visual features for precise localization. Furthermore, we introduce TNLLT—the first large-scale long-term visual-language tracking benchmark—designed explicitly to support reasoning-process modeling and evaluation. Extensive experiments demonstrate that ReasoningTrack significantly outperforms 20 state-of-the-art baselines across multiple benchmarks, validating its effectiveness, generalizability, and interpretability.

Technology Category

Application Category

📝 Abstract
Vision-language tracking has received increasing attention in recent years, as textual information can effectively address the inflexibility and inaccuracy associated with specifying the target object to be tracked. Existing works either directly fuse the fixed language with vision features or simply modify using attention, however, their performance is still limited. Recently, some researchers have explored using text generation to adapt to the variations in the target during tracking, however, these works fail to provide insights into the model's reasoning process and do not fully leverage the advantages of large models, which further limits their overall performance. To address the aforementioned issues, this paper proposes a novel reasoning-based vision-language tracking framework, named ReasoningTrack, based on a pre-trained vision-language model Qwen2.5-VL. Both SFT (Supervised Fine-Tuning) and reinforcement learning GRPO are used for the optimization of reasoning and language generation. We embed the updated language descriptions and feed them into a unified tracking backbone network together with vision features. Then, we adopt a tracking head to predict the specific location of the target object. In addition, we propose a large-scale long-term vision-language tracking benchmark dataset, termed TNLLT, which contains 200 video sequences. 20 baseline visual trackers are re-trained and evaluated on this dataset, which builds a solid foundation for the vision-language visual tracking task. Extensive experiments on multiple vision-language tracking benchmark datasets fully validated the effectiveness of our proposed reasoning-based natural language generation strategy. The source code of this paper will be released on https://github.com/Event-AHU/Open_VLTrack
Problem

Research questions and friction points this paper is trying to address.

Enhance vision-language tracking with reasoning-based natural language generation
Improve model reasoning process and leverage large model advantages
Address limitations in existing vision-language tracking methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses pre-trained vision-language model Qwen2.5-VL
Combines SFT and reinforcement learning GRPO
Introduces large-scale dataset TNLLT for evaluation
🔎 Similar Papers
No similar papers found.
X
Xiao Wang
School of Computer Science and Technology, Anhui University, Hefei, China
L
Liye Jin
School of Computer Science and Technology, Anhui University, Hefei, China
X
Xufeng Lou
School of Computer Science and Technology, Anhui University, Hefei, China
Shiao Wang
Shiao Wang
安徽大学
Deep Learning
Lan Chen
Lan Chen
Communication University of China
Image/Video generation and editing
B
Bo Jiang
School of Computer Science and Technology, Anhui University, Hefei, China
Zhipeng Zhang
Zhipeng Zhang
School of Artificial Intelligence, Shanghai Jiao Tong University
Computer Vision,Object Tracking and Segmentation