🤖 AI Summary
To address the low 3D point tracking accuracy in monocular video and the suboptimal joint optimization of depth and pose estimation in modular pipelines, this paper proposes the first end-to-end feedforward framework that jointly models point tracking, monocular depth estimation, and camera pose estimation. Our key innovation is a geometric decomposition of 3D motion in the world coordinate system into three differentiable components: scene geometry, camera ego-motion, and pixel-wise object motion. This formulation enables multi-source differentiable training on synthetic data, RGB-D videos, and unlabeled in-the-wild monocular sequences. Experiments demonstrate that our method achieves a 30% improvement in 3D tracking accuracy over state-of-the-art methods, matches the reconstruction quality of leading dynamic 3D reconstruction approaches, and accelerates inference by 50×. Moreover, it exhibits significantly enhanced generalization and practical applicability.
📝 Abstract
We present SpatialTrackerV2, a feed-forward 3D point tracking method for monocular videos. Going beyond modular pipelines built on off-the-shelf components for 3D tracking, our approach unifies the intrinsic connections between point tracking, monocular depth, and camera pose estimation into a high-performing and feedforward 3D point tracker. It decomposes world-space 3D motion into scene geometry, camera ego-motion, and pixel-wise object motion, with a fully differentiable and end-to-end architecture, allowing scalable training across a wide range of datasets, including synthetic sequences, posed RGB-D videos, and unlabeled in-the-wild footage. By learning geometry and motion jointly from such heterogeneous data, SpatialTrackerV2 outperforms existing 3D tracking methods by 30%, and matches the accuracy of leading dynamic 3D reconstruction approaches while running 50$ imes$ faster.