🤖 AI Summary
To address the challenges of motion modeling and temporal feature inconsistency in video dense self-supervised learning—caused by object deformation, occlusion, and camera motion—this paper proposes a motion-trajectory-guided dense representation learning framework. Methodologically, it introduces optimal transport into a momentum encoder for the first time, enabling long-range point trajectory propagation to achieve cross-frame cluster assignment and ensure spatiotemporal consistency; it also pioneers the use of motion trajectories themselves as implicit supervision signals to drive pixel- and patch-level dense learning. Technically, the framework integrates off-the-shelf point trackers, a momentum encoder, optimal transport optimization, and initialization from image pretrained models. Evaluated on six datasets across four benchmarks, our approach surpasses state-of-the-art methods by 1–6%, demonstrating significantly improved generalization and robustness in dynamic scenes and under heavy occlusion.
📝 Abstract
Dense self-supervised learning has shown great promise for learning pixel- and patch-level representations, but extending it to videos remains challenging due to the complexity of motion dynamics. Existing approaches struggle as they rely on static augmentations that fail under object deformations, occlusions, and camera movement, leading to inconsistent feature learning over time. We propose a motion-guided self-supervised learning framework that clusters dense point tracks to learn spatiotemporally consistent representations. By leveraging an off-the-shelf point tracker, we extract long-range motion trajectories and optimize feature clustering through a momentum-encoder-based optimal transport mechanism. To ensure temporal coherence, we propagate cluster assignments along tracked points, enforcing feature consistency across views despite viewpoint changes. Integrating motion as an implicit supervisory signal, our method learns representations that generalize across frames, improving robustness in dynamic scenes and challenging occlusion scenarios. By initializing from strong image-pretrained models and leveraging video data for training, we improve state-of-the-art by 1% to 6% on six image and video datasets and four evaluation benchmarks. The implementation is publicly available at our GitHub repository: https://github.com/SMSD75/MoSiC/tree/main