SpatialTrackerV2: 3D Point Tracking Made Easy

📅 2025-07-16

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

To address the low 3D point tracking accuracy in monocular video and the suboptimal joint optimization of depth and pose estimation in modular pipelines, this paper proposes the first end-to-end feedforward framework that jointly models point tracking, monocular depth estimation, and camera pose estimation. Our key innovation is a geometric decomposition of 3D motion in the world coordinate system into three differentiable components: scene geometry, camera ego-motion, and pixel-wise object motion. This formulation enables multi-source differentiable training on synthetic data, RGB-D videos, and unlabeled in-the-wild monocular sequences. Experiments demonstrate that our method achieves a 30% improvement in 3D tracking accuracy over state-of-the-art methods, matches the reconstruction quality of leading dynamic 3D reconstruction approaches, and accelerates inference by 50×. Moreover, it exhibits significantly enhanced generalization and practical applicability.

Technology Category

Application Category

📝 Abstract

We present SpatialTrackerV2, a feed-forward 3D point tracking method for monocular videos. Going beyond modular pipelines built on off-the-shelf components for 3D tracking, our approach unifies the intrinsic connections between point tracking, monocular depth, and camera pose estimation into a high-performing and feedforward 3D point tracker. It decomposes world-space 3D motion into scene geometry, camera ego-motion, and pixel-wise object motion, with a fully differentiable and end-to-end architecture, allowing scalable training across a wide range of datasets, including synthetic sequences, posed RGB-D videos, and unlabeled in-the-wild footage. By learning geometry and motion jointly from such heterogeneous data, SpatialTrackerV2 outperforms existing 3D tracking methods by 30%, and matches the accuracy of leading dynamic 3D reconstruction approaches while running 50$ imes$ faster.

Problem

Research questions and friction points this paper is trying to address.

Unifies point tracking, depth, and camera pose estimation

Decomposes 3D motion into geometry, ego-motion, and object motion

Improves 3D tracking accuracy and speed significantly

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies point tracking, depth, and camera pose

Decomposes 3D motion into geometry and motion

End-to-end architecture for scalable training

🔎 Similar Papers

Synchronized Object Detection for Autonomous Sorting, Mapping, and Quantification of Materials in Circular Healthcare