TAPVid-360: Tracking Any Point in 360 from Narrow Field of View Video

📅 2025-11-26

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Existing vision systems struggle to track arbitrary 2D points outside the field of view in continuous video and predict their 3D orientations. This paper introduces TAPVid-360, a novel task, and TAPVid360-10k, the first large-scale dataset enabling scene-centric panoramic point tracking from narrow-field-of-view (FOV) videos—without requiring dynamic 4D ground-truth supervision. Our method synthesizes narrow-FOV sequences from 360° videos to serve as self-supervised signals, jointly computes omnidirectional 2D trajectories, and leverages CoTracker v3 to estimate per-point 3D rotations, enabling real-time updates of 3D orientation estimates. Evaluated on the new benchmark, our baseline model substantially outperforms prior TAP and TAPVid 3D approaches, achieving breakthrough improvements in wide-domain point tracking accuracy and cross-FOV generalization. This work advances visual systems toward holistic 3D scene understanding that transcends image boundaries.

Technology Category

Application Category

📝 Abstract

Humans excel at constructing panoramic mental models of their surroundings, maintaining object permanence and inferring scene structure beyond visible regions. In contrast, current artificial vision systems struggle with persistent, panoramic understanding, often processing scenes egocentrically on a frame-by-frame basis. This limitation is pronounced in the Track Any Point (TAP) task, where existing methods fail to track 2D points outside the field of view. To address this, we introduce TAPVid-360, a novel task that requires predicting the 3D direction to queried scene points across a video sequence, even when far outside the narrow field of view of the observed video. This task fosters learning allocentric scene representations without needing dynamic 4D ground truth scene models for training. Instead, we exploit 360 videos as a source of supervision, resampling them into narrow field-of-view perspectives while computing ground truth directions by tracking points across the full panorama using a 2D pipeline. We introduce a new dataset and benchmark, TAPVid360-10k comprising 10k perspective videos with ground truth directional point tracking. Our baseline adapts CoTracker v3 to predict per-point rotations for direction updates, outperforming existing TAP and TAPVid 3D methods.

Problem

Research questions and friction points this paper is trying to address.

Predict 3D direction to queried points outside video field of view

Learn allocentric scene representations without 4D ground truth models

Track points across full panorama using 2D supervision from 360 videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

360 videos for supervision without 4D ground truth

Resampling narrow field-of-view from 360 panoramas

Adapting CoTracker v3 for per-point rotation prediction

🔎 Similar Papers

Tracking Everything in Robotic-Assisted Surgery