M2P: Improving Visual Foundation Models with Mask-to-Point Weakly-Supervised Learning for Dense Point Tracking

📅 2026-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision foundation models, pretrained on static images, struggle to capture the temporal correspondences of dense points in videos. This work proposes Mask-to-Point (M2P), a weakly supervised learning approach that, for the first time, converts video object segmentation masks into point-level weak supervision signals. The method introduces three mask-driven constraints: local structural consistency via Procrustes analysis, mask-label consistency regularization, and explicit boundary-point supervision. Requiring only 3.6K weakly annotated videos for efficient training, M2P significantly outperforms DINOv2-B/14 and DINOv3-B/16 on the TAP-Vid-DAVIS benchmark by 12.8% and 14.6%, respectively, and serves as a versatile backbone compatible with diverse point tracking paradigms.

Technology Category

Application Category

📝 Abstract
Tracking Any Point (TAP) has emerged as a fundamental tool for video understanding. Current approaches adapt Vision Foundation Models (VFMs) like DINOv2 via offline finetuning or test-time optimization. However, these VFMs rely on static image pre-training, which is inherently sub-optimal for capturing dense temporal correspondence in videos. To address this, we propose Mask-to-Point (M2P) learning, which leverages rich video object segmentation (VOS) mask annotations to improve VFMs for dense point tracking. Our M2P introduces three new mask-based constraints for weakly-supervised representation learning. First, we propose a local structure consistency loss, which leverages Procrustes analysis to model the cohesive motion of points lying within a local structure, achieving more reliable point-to-point matching learning. Second, we propose a mask label consistency (MLC) loss, which enforces that sampled foreground points strictly match foreground regions across frames. The proposed MLC loss can be regarded as a regularization, which stabilizes training and prevents convergence to trivial solutions. Finally, mask boundary constrain is applied to explicitly supervise boundary points. We show that our weaklysupervised M2P models significantly outperform baseline VFMs with efficient training by using only 3.6K VOS training videos. Notably, M2P achieves 12.8% and 14.6% performance gains over DINOv2-B/14 and DINOv3-B/16 on the TAP-Vid-DAVIS benchmark, respectively. Moreover, the proposed M2P models are used as pre-trained backbones for both test-time optimized and offline fine-tuned TAP tasks, demonstrating its potential to serve as general pre-trained models for point tracking. Code will be made publicly available upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

dense point tracking
visual foundation models
temporal correspondence
video understanding
weakly-supervised learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mask-to-Point
Weakly-Supervised Learning
Dense Point Tracking
Vision Foundation Models
Temporal Correspondence
🔎 Similar Papers
No similar papers found.