🤖 AI Summary
This work addresses the problem of robust trajectory tracking for arbitrary points in multi-view dynamic videos. We propose a novel method that jointly leverages camera geometric constraints and cross-view spatiotemporal attention: it explicitly models multi-camera projection geometry and introduces a learnable cross-view attention module to aggregate consistent, viewpoint-invariant features. To enable end-to-end training and rigorous generalization evaluation, we construct a large-scale synthetic training dataset and a real-world benchmark with ground-truth trajectories. Our approach achieves state-of-the-art accuracy and robustness on challenging benchmarks—including MVS1K and DynamicScene—outperforming prior methods by significant margins. Notably, it establishes the first strong baseline for multi-view point tracking, introducing a new paradigm for dynamic scene understanding and 3D motion analysis.
📝 Abstract
Multi-view camera systems enable rich observations of complex real-world scenes, and understanding dynamic objects in multi-view settings has become central to various applications. In this work, we present MV-TAP, a novel point tracker that tracks points across multi-view videos of dynamic scenes by leveraging cross-view information. MV-TAP utilizes camera geometry and a cross-view attention mechanism to aggregate spatio-temporal information across views, enabling more complete and reliable trajectory estimation in multi-view videos. To support this task, we construct a large-scale synthetic training dataset and real-world evaluation sets tailored for multi-view tracking. Extensive experiments demonstrate that MV-TAP outperforms existing point-tracking methods on challenging benchmarks, establishing an effective baseline for advancing research in multi-view point tracking.