🤖 AI Summary
This work addresses the challenge that existing self-supervised tracking methods struggle to effectively model contextual information from unlabeled videos due to the absence of semantic guidance, leading to unreliable contextual cues. To overcome this limitation, the authors propose a dual-modal contextual association mechanism. During early training stages, fine-grained instance-level semantic prompts are introduced to guide the forward and backward tracking branches in acquiring fundamental tracking knowledge. Subsequently, contextual noise-perturbed features are progressively injected following an easy-to-hard learning strategy, thereby enhancing the model’s robust representational capacity in complex feature spaces. Notably, this mechanism is employed solely during training and imposes no overhead on inference efficiency. Extensive experiments demonstrate that the proposed approach significantly improves both performance and robustness of self-supervised tracking across multiple benchmarks.
📝 Abstract
Learning robust contextual knowledge from unlabeled videos is essential for advancing self-supervised tracking. However, conventional self-supervised trackers lack effective context modeling, while existing context association methods based on non-semantic queries struggle to adapt to unlabeled tracking scenarios, making it difficult to learn reliable contextual cues. In this work, we propose a novel self-supervised tracking framework, named \textbf{\tracker}, which introduces a dual-modal context association mechanism that jointly leverages fine-grained semantic prompts and contextual noise to drive the model toward learning robust tracking representations. Adherent to the easy-to-hard learning principle, our contextual association mechanism operates based on two stages. During early training, instance patch tokens (prompts) are assigned to both forward and backward tracking branches to facilitate the acquisition of tracking knowledge. As training progresses, contextual noise is gradually injected into the model to perturb feature, encouraging the tracker to learn robust tracking representations in a more complex feature space. Thus, this novel contextual association mechanism enables our self-supervised model to learn high-quality tracking representations from unlabeled videos, while being applied exclusively during training to preserve efficient inference. Extensive experiments demonstrate the superiority of our method.