🤖 AI Summary
This work addresses the challenge of vision-language object tracking in videos without bounding box annotations, relying solely on natural language descriptions. To this end, the authors propose a self-supervised tracking approach centered on a dynamic token aggregation module. This module differentially weights visual tokens based on attention scores, selectively aggregates target-relevant tokens, and fuses them with language tokens to enhance semantic alignment and temporal consistency. Notably, the method operates without any manual annotations and outperforms existing self-supervised approaches across multiple vision-language tracking benchmarks, demonstrating the feasibility of effective instance-level tracking guided purely by linguistic cues.
📝 Abstract
How to achieve vision-language (VL) tracking using natural language descriptions from a video sequence \textbf{without relying on any bounding-box ground truth}? In this work, we achieve this goal by tackling \textit{self-supervised VL tracking}, which aims to evaluate tracking capabilities guided by natural language descriptions. We introduce \textbf{\tracker}, a novel self-supervised VL tracker that is capable of tracking any referred object by a language description. Unlike traditional methods that equally fuse all language and visual tokens, we propose an efficient Dynamic Token Aggregation Module, which treats each visual token \textbf{unequally}. The module consists of three main steps: i) Based on an anchor token, it selects multiple important target tokens from the template frame. ii) The selected target tokens are merged according to their attention scores and aggregated into the language tokens, thereby eliminating redundant visual token noise and enhancing semantic alignment. iii) Finally, the fused language tokens serve as guiding signals to extract potential target tokens from the search frame and propagate them to subsequent frames, enhancing temporal prompts and encouraging the tracker to autonomously learn instance tracking from unlabeled videos. This new modeling approach enables the effective self-supervised learning of language-guided tracking representations without the need for large-scale bounding box annotations. Extensive experiments on VL tracking benchmarks show that {\tracker} surpasses SOTA self-supervised methods.