🤖 AI Summary
Unsupervised tracking of visually similar objects remains a fundamental challenge in computer vision, particularly when appearance cues are ambiguous—existing self-supervised methods exhibit limited robustness in such scenarios. This paper introduces the first unsupervised motion representation learning framework leveraging the intrinsic motion-appearance disentanglement naturally present in the high-noise regime of pre-trained video diffusion models. Our method extracts motion-aware features from early denoising stages to model dynamic object trajectories and integrates self-supervised matching for fine-grained instance discrimination—all without any human annotations or architectural modifications. By fully exploiting the inherent structure of diffusion models, our approach achieves a 6.0 percentage point improvement over state-of-the-art self-supervised trackers on multiple standard benchmarks. Moreover, it demonstrates superior generalization and intra-class discriminability on a newly constructed benchmark for tracking highly similar objects.
📝 Abstract
Distinguishing visually similar objects by their motion remains a critical challenge in computer vision. Although supervised trackers show promise, contemporary self-supervised trackers struggle when visual cues become ambiguous, limiting their scalability and generalization without extensive labeled data. We find that pre-trained video diffusion models inherently learn motion representations suitable for tracking without task-specific training. This ability arises because their denoising process isolates motion in early, high-noise stages, distinct from later appearance refinement. Capitalizing on this discovery, our self-supervised tracker significantly improves performance in distinguishing visually similar objects, an underexplored failure point for existing methods. Our method achieves up to a 6-point improvement over recent self-supervised approaches on established benchmarks and our newly introduced tests focused on tracking visually similar items. Visualizations confirm that these diffusion-derived motion representations enable robust tracking of even identical objects across challenging viewpoint changes and deformations.