TAPIP3D: Tracking Any Point in Persistent 3D Geometry

📅 2025-04-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of long-term 3D point tracking in monocular RGB/RGB-D videos, where conventional methods suffer from camera-motion-induced drift and instability. We propose the first truly 3D-native tracking paradigm. Our core methodology comprises: (1) a steady-state camera motion compensation mechanism that lifts 2D features into a drift-free world coordinate system, yielding stable spatiotemporal 3D feature clouds; (2) Local Pair Attention to model irregular 3D point neighborhoods—replacing standard 2D correlation windows; and (3) joint integration of depth-guided feature enhancement, camera-motion-decoupled representation learning, and iterative 3D optical flow refinement. Our approach achieves significant improvements over state-of-the-art methods across multiple 3D tracking benchmarks. Notably, even when using only depth guidance, its 2D projection accuracy surpasses leading 2D trackers—demonstrating that motion compensation systematically enhances both robustness and precision.

Technology Category

Application Category

📝 Abstract
We introduce TAPIP3D, a novel approach for long-term 3D point tracking in monocular RGB and RGB-D videos. TAPIP3D represents videos as camera-stabilized spatio-temporal feature clouds, leveraging depth and camera motion information to lift 2D video features into a 3D world space where camera motion is effectively canceled. TAPIP3D iteratively refines multi-frame 3D motion estimates within this stabilized representation, enabling robust tracking over extended periods. To manage the inherent irregularities of 3D point distributions, we propose a Local Pair Attention mechanism. This 3D contextualization strategy effectively exploits spatial relationships in 3D, forming informative feature neighborhoods for precise 3D trajectory estimation. Our 3D-centric approach significantly outperforms existing 3D point tracking methods and even enhances 2D tracking accuracy compared to conventional 2D pixel trackers when accurate depth is available. It supports inference in both camera coordinates (i.e., unstabilized) and world coordinates, and our results demonstrate that compensating for camera motion improves tracking performance. Our approach replaces the conventional 2D square correlation neighborhoods used in prior 2D and 3D trackers, leading to more robust and accurate results across various 3D point tracking benchmarks. Project Page: https://tapip3d.github.io
Problem

Research questions and friction points this paper is trying to address.

Long-term 3D point tracking in monocular videos
Robust 3D trajectory estimation with irregular point distributions
Improving tracking accuracy by compensating camera motion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Camera-stabilized spatio-temporal feature clouds
Local Pair Attention mechanism
3D contextualization strategy
🔎 Similar Papers
No similar papers found.