🤖 AI Summary
PointSt3R addresses the challenging problem of point tracking in mixed static-dynamic scenes—where conventional methods struggle with 2D–3D trajectory estimation without temporal dependencies. Methodologically, it repurposes a 3D reconstruction model (MASt3R) for point tracking by introducing *3D-grounded correspondence modeling*: leveraging the model’s intrinsic 2D–3D correspondence capability, augmenting it with dynamic correspondence supervision and a visibility prediction head, and training exclusively on frame pairs. Its key contribution is eliminating reliance on optical flow or recurrent temporal modeling, enabling single-stage, geometrically consistent cross-frame point matching. Experiments show significant improvements over CoTracker3 on EgoPoints and RGB-S benchmarks, while achieving comparable performance on TAP-Vid-DAVIS. These results validate that 3D geometric priors substantially enhance point tracking robustness and accuracy, establishing a new lightweight, geometry-aware paradigm for visual tracking.
📝 Abstract
Recent advances in foundational 3D reconstruction models, such as DUSt3R and MASt3R, have shown great potential in 2D and 3D correspondence in static scenes. In this paper, we propose to adapt them for the task of point tracking through 3D grounded correspondence. We first demonstrate that these models are competitive point trackers when focusing on static points, present in current point tracking benchmarks ($+33.5%$ on EgoPoints vs. CoTracker2). We propose to combine the reconstruction loss with training for dynamic correspondence along with a visibility head, and fine-tuning MASt3R for point tracking using a relatively small amount of synthetic data. Importantly, we only train and evaluate on pairs of frames where one contains the query point, effectively removing any temporal context. Using a mix of dynamic and static point correspondences, we achieve competitive or superior point tracking results on four datasets (e.g. competitive on TAP-Vid-DAVIS 73.8 $δ_{avg}$ / 85.8% occlusion acc. for PointSt3R compared to 75.7 / 88.3% for CoTracker2; and significantly outperform CoTracker3 on EgoPoints 61.3 vs 54.2 and RGB-S 87.0 vs 82.8). We also present results on 3D point tracking along with several ablations on training datasets and percentage of dynamic correspondences.