🤖 AI Summary
Existing monocular 3D point tracking methods suffer from poor robustness to depth ambiguity and occlusion, while state-of-the-art multi-view approaches require excessive camera setups (>20 views) and computationally expensive sequence-level optimization. To address these limitations, this paper proposes the first data-driven, lightweight multi-view 3D point tracker. Our method operates with only four calibrated cameras and jointly learns point cloud feature fusion and temporal correspondence via a unified Transformer-based update mechanism, enabling long-range cross-view 3D matching. We further introduce a k-nearest-neighbor correlation modeling module to enhance multi-view feature consistency. Extensive evaluation demonstrates high accuracy and strong generalization: median 3D errors of 3.1 cm on Panoptic Studio and 2.0 cm on DexYCB, alongside competitive performance on synthetic Kubric data. The framework supports online tracking across 1–8 views and 24–150 frames, significantly reducing hardware requirements and optimization overhead compared to prior work.
📝 Abstract
We introduce the first data-driven multi-view 3D point tracker, designed to track arbitrary points in dynamic scenes using multiple camera views. Unlike existing monocular trackers, which struggle with depth ambiguities and occlusion, or prior multi-camera methods that require over 20 cameras and tedious per-sequence optimization, our feed-forward model directly predicts 3D correspondences using a practical number of cameras (e.g., four), enabling robust and accurate online tracking. Given known camera poses and either sensor-based or estimated multi-view depth, our tracker fuses multi-view features into a unified point cloud and applies k-nearest-neighbors correlation alongside a transformer-based update to reliably estimate long-range 3D correspondences, even under occlusion. We train on 5K synthetic multi-view Kubric sequences and evaluate on two real-world benchmarks: Panoptic Studio and DexYCB, achieving median trajectory errors of 3.1 cm and 2.0 cm, respectively. Our method generalizes well to diverse camera setups of 1-8 views with varying vantage points and video lengths of 24-150 frames. By releasing our tracker alongside training and evaluation datasets, we aim to set a new standard for multi-view 3D tracking research and provide a practical tool for real-world applications. Project page available at https://ethz-vlg.github.io/mvtracker.