🤖 AI Summary
Visual tracking heavily relies on manually annotated bounding boxes, resulting in small-scale and low-diversity training datasets. To address this limitation, we propose a decoupled spatiotemporal consistency self-supervised learning framework that jointly models global spatial localization and local temporal association without requiring any bounding-box annotations, thereby enabling robust target representation learning. Our approach introduces, for the first time, an instance-level contrastive loss and multi-view instance consistency constraints to generate strong supervisory signals. It integrates self-supervised learning, contrastive learning, and spatiotemporal feature decoupling into a unified training paradigm. Extensive experiments demonstrate state-of-the-art performance across nine major benchmarks. Specifically, our method achieves absolute AUC (AO) improvements of 25.3%, 20.4%, and 14.8% over prior unsupervised/self-supervised methods on GOT10K, LaSOT, and TrackingNet, respectively—significantly advancing label-free visual tracking.
📝 Abstract
The success of visual tracking has been largely driven by datasets with manual box annotations. However, these box annotations require tremendous human effort, limiting the scale and diversity of existing tracking datasets. In this work, we present a novel Self-Supervised Tracking framework named extbf{ racker}, designed to eliminate the need of box annotations. Specifically, a decoupled spatio-temporal consistency training framework is proposed to learn rich target information across timestamps through global spatial localization and local temporal association. This allows for the simulation of appearance and motion variations of instances in real-world scenarios. Furthermore, an instance contrastive loss is designed to learn instance-level correspondences from a multi-view perspective, offering robust instance supervision without additional labels. This new design paradigm enables { racker} to effectively learn generic tracking representations in a self-supervised manner, while reducing reliance on extensive box annotations. Extensive experiments on nine benchmark datasets demonstrate that { racker} surpasses extit{SOTA} self-supervised tracking methods, achieving an improvement of more than 25.3%, 20.4%, and 14.8% in AUC (AO) score on the GOT10K, LaSOT, TrackingNet datasets, respectively. Code: https://github.com/GXNU-ZhongLab/SSTrack.