🤖 AI Summary
Existing vision systems struggle to track arbitrary 2D points outside the field of view in continuous video and predict their 3D orientations. This paper introduces TAPVid-360, a novel task, and TAPVid360-10k, the first large-scale dataset enabling scene-centric panoramic point tracking from narrow-field-of-view (FOV) videos—without requiring dynamic 4D ground-truth supervision. Our method synthesizes narrow-FOV sequences from 360° videos to serve as self-supervised signals, jointly computes omnidirectional 2D trajectories, and leverages CoTracker v3 to estimate per-point 3D rotations, enabling real-time updates of 3D orientation estimates. Evaluated on the new benchmark, our baseline model substantially outperforms prior TAP and TAPVid 3D approaches, achieving breakthrough improvements in wide-domain point tracking accuracy and cross-FOV generalization. This work advances visual systems toward holistic 3D scene understanding that transcends image boundaries.
📝 Abstract
Humans excel at constructing panoramic mental models of their surroundings, maintaining object permanence and inferring scene structure beyond visible regions. In contrast, current artificial vision systems struggle with persistent, panoramic understanding, often processing scenes egocentrically on a frame-by-frame basis. This limitation is pronounced in the Track Any Point (TAP) task, where existing methods fail to track 2D points outside the field of view. To address this, we introduce TAPVid-360, a novel task that requires predicting the 3D direction to queried scene points across a video sequence, even when far outside the narrow field of view of the observed video. This task fosters learning allocentric scene representations without needing dynamic 4D ground truth scene models for training. Instead, we exploit 360 videos as a source of supervision, resampling them into narrow field-of-view perspectives while computing ground truth directions by tracking points across the full panorama using a 2D pipeline. We introduce a new dataset and benchmark, TAPVid360-10k comprising 10k perspective videos with ground truth directional point tracking. Our baseline adapts CoTracker v3 to predict per-point rotations for direction updates, outperforming existing TAP and TAPVid 3D methods.