Multi-View 3D Point Tracking

📅 2025-08-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing monocular 3D point tracking methods suffer from poor robustness to depth ambiguity and occlusion, while state-of-the-art multi-view approaches require excessive camera setups (>20 views) and computationally expensive sequence-level optimization. To address these limitations, this paper proposes the first data-driven, lightweight multi-view 3D point tracker. Our method operates with only four calibrated cameras and jointly learns point cloud feature fusion and temporal correspondence via a unified Transformer-based update mechanism, enabling long-range cross-view 3D matching. We further introduce a k-nearest-neighbor correlation modeling module to enhance multi-view feature consistency. Extensive evaluation demonstrates high accuracy and strong generalization: median 3D errors of 3.1 cm on Panoptic Studio and 2.0 cm on DexYCB, alongside competitive performance on synthetic Kubric data. The framework supports online tracking across 1–8 views and 24–150 frames, significantly reducing hardware requirements and optimization overhead compared to prior work.

Technology Category

Application Category

📝 Abstract
We introduce the first data-driven multi-view 3D point tracker, designed to track arbitrary points in dynamic scenes using multiple camera views. Unlike existing monocular trackers, which struggle with depth ambiguities and occlusion, or prior multi-camera methods that require over 20 cameras and tedious per-sequence optimization, our feed-forward model directly predicts 3D correspondences using a practical number of cameras (e.g., four), enabling robust and accurate online tracking. Given known camera poses and either sensor-based or estimated multi-view depth, our tracker fuses multi-view features into a unified point cloud and applies k-nearest-neighbors correlation alongside a transformer-based update to reliably estimate long-range 3D correspondences, even under occlusion. We train on 5K synthetic multi-view Kubric sequences and evaluate on two real-world benchmarks: Panoptic Studio and DexYCB, achieving median trajectory errors of 3.1 cm and 2.0 cm, respectively. Our method generalizes well to diverse camera setups of 1-8 views with varying vantage points and video lengths of 24-150 frames. By releasing our tracker alongside training and evaluation datasets, we aim to set a new standard for multi-view 3D tracking research and provide a practical tool for real-world applications. Project page available at https://ethz-vlg.github.io/mvtracker.
Problem

Research questions and friction points this paper is trying to address.

Tracking 3D points across multiple camera views
Overcoming depth ambiguity and occlusion in monocular tracking
Enabling robust online tracking with practical camera setups
Innovation

Methods, ideas, or system contributions that make the work stand out.

Feed-forward model predicts 3D correspondences
Fuses multi-view features into unified point cloud
Uses transformer-based update for long-range tracking
🔎 Similar Papers
No similar papers found.