ATCTrack: Aligning Target-Context Cues with Dynamic Target States for Robust Vision-Language Tracking

📅 2025-07-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address three key challenges in complex long-term vision-language tracking—(1) dynamic evolution of target states over time, (2) persistent misalignment between initial modality cues and the true target, and (3) difficulty in precisely distinguishing target words from contextual words in textual prompts—this paper proposes DATrack, a Dynamic Alignment Tracking framework. Methodologically, DATrack introduces: (1) a dynamic target-state-driven vision–text target–context co-modeling mechanism enabling temporal adaptive cross-modal feature alignment; (2) a context-word calibration module that identifies target words purely from text and adaptively weights auxiliary semantics; and (3) an end-to-end Transformer-based multimodal fusion architecture. Evaluated on mainstream benchmarks—including LAOT and VLT—DATrack achieves state-of-the-art performance, notably enhancing long-term robustness. The code and models are publicly available.

Technology Category

Application Category

📝 Abstract
Vision-language tracking aims to locate the target object in the video sequence using a template patch and a language description provided in the initial frame. To achieve robust tracking, especially in complex long-term scenarios that reflect real-world conditions as recently highlighted by MGIT, it is essential not only to characterize the target features but also to utilize the context features related to the target. However, the visual and textual target-context cues derived from the initial prompts generally align only with the initial target state. Due to their dynamic nature, target states are constantly changing, particularly in complex long-term sequences. It is intractable for these cues to continuously guide Vision-Language Trackers (VLTs). Furthermore, for the text prompts with diverse expressions, our experiments reveal that existing VLTs struggle to discern which words pertain to the target or the context, complicating the utilization of textual cues. In this work, we present a novel tracker named ATCTrack, which can obtain multimodal cues Aligned with the dynamic target states through comprehensive Target-Context feature modeling, thereby achieving robust tracking. Specifically, (1) for the visual modality, we propose an effective temporal visual target-context modeling approach that provides the tracker with timely visual cues. (2) For the textual modality, we achieve precise target words identification solely based on textual content, and design an innovative context words calibration method to adaptively utilize auxiliary context words. (3) We conduct extensive experiments on mainstream benchmarks and ATCTrack achieves a new SOTA performance. The code and models will be released at: https://github.com/XiaokunFeng/ATCTrack.
Problem

Research questions and friction points this paper is trying to address.

Aligning visual and textual cues with dynamic target states for robust tracking
Identifying target words in diverse textual expressions for accurate tracking
Modeling target-context features to handle complex long-term video sequences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic visual target-context modeling for cues
Precise target words identification in text
Adaptive context words calibration method
🔎 Similar Papers
No similar papers found.
Xiaokun Feng
Xiaokun Feng
Institute of Automation,Chinese Academy of Sciences
computer versiondeep learning
Shiyu Hu
Shiyu Hu
Research Fellow, Nanyang Technological University (NTU)
Computer VisionData-centric AIAI for Science
X
Xuchen Li
School of Artificial Intelligence, UCAS; The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, CASIA; ZGCA
D
Dailing Zhang
School of Artificial Intelligence, UCAS; The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, CASIA
Meiqi Wu
Meiqi Wu
the University of Chinese Academy of Sciences
Computer vision
J
Jing Zhang
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, CASIA
X
Xiaotang Chen
School of Artificial Intelligence, UCAS; The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, CASIA
K
Kaiqi Huang
School of Artificial Intelligence, UCAS; The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, CASIA