M2P: Improving Visual Foundation Models with Mask-to-Point Weakly-Supervised Learning for Dense Point Tracking

📅 2026-03-18

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing vision foundation models, pretrained on static images, struggle to capture the temporal correspondences of dense points in videos. This work proposes Mask-to-Point (M2P), a weakly supervised learning approach that, for the first time, converts video object segmentation masks into point-level weak supervision signals. The method introduces three mask-driven constraints: local structural consistency via Procrustes analysis, mask-label consistency regularization, and explicit boundary-point supervision. Requiring only 3.6K weakly annotated videos for efficient training, M2P significantly outperforms DINOv2-B/14 and DINOv3-B/16 on the TAP-Vid-DAVIS benchmark by 12.8% and 14.6%, respectively, and serves as a versatile backbone compatible with diverse point tracking paradigms.

Technology Category

Application Category

📝 Abstract

Tracking Any Point (TAP) has emerged as a fundamental tool for video understanding. Current approaches adapt Vision Foundation Models (VFMs) like DINOv2 via offline finetuning or test-time optimization. However, these VFMs rely on static image pre-training, which is inherently sub-optimal for capturing dense temporal correspondence in videos. To address this, we propose Mask-to-Point (M2P) learning, which leverages rich video object segmentation (VOS) mask annotations to improve VFMs for dense point tracking. Our M2P introduces three new mask-based constraints for weakly-supervised representation learning. First, we propose a local structure consistency loss, which leverages Procrustes analysis to model the cohesive motion of points lying within a local structure, achieving more reliable point-to-point matching learning. Second, we propose a mask label consistency (MLC) loss, which enforces that sampled foreground points strictly match foreground regions across frames. The proposed MLC loss can be regarded as a regularization, which stabilizes training and prevents convergence to trivial solutions. Finally, mask boundary constrain is applied to explicitly supervise boundary points. We show that our weaklysupervised M2P models significantly outperform baseline VFMs with efficient training by using only 3.6K VOS training videos. Notably, M2P achieves 12.8% and 14.6% performance gains over DINOv2-B/14 and DINOv3-B/16 on the TAP-Vid-DAVIS benchmark, respectively. Moreover, the proposed M2P models are used as pre-trained backbones for both test-time optimized and offline fine-tuned TAP tasks, demonstrating its potential to serve as general pre-trained models for point tracking. Code will be made publicly available upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

dense point tracking

visual foundation models

temporal correspondence

video understanding

weakly-supervised learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mask-to-Point

Weakly-Supervised Learning

Dense Point Tracking

Vision Foundation Models

Temporal Correspondence

🔎 Similar Papers

No similar papers found.

Bosch Group

Hildesheim, NDS, DE

AI Research Scientist, Computer Vision - Facebook Video Intelligence