TAPNext: Tracking Any Point (TAP) as Next Token Prediction

πŸ“… 2025-04-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing methods for Track Any Point (TAP) in videos suffer from limited generalization and real-time performance due to reliance on strong inductive biases and hand-crafted heuristics. To address this, we reformulate TAP as a causal masked token sequence prediction taskβ€”its first such formulation. We propose a purely online, streaming Transformer architecture that eliminates temporal window constraints and tracking-specific priors, enabling trajectory generation to emerge naturally through end-to-end training. Our core innovation lies in redefining point tracking as autoregressive frame-by-frame prediction of discrete masked tokens corresponding to point locations, yielding low-latency and highly generalizable trajectory estimation. Evaluated on the TAP benchmark, our method achieves state-of-the-art performance, outperforming both existing online and offline trackers while significantly reducing inference latency.

Technology Category

Application Category

πŸ“ Abstract
Tracking Any Point (TAP) in a video is a challenging computer vision problem with many demonstrated applications in robotics, video editing, and 3D reconstruction. Existing methods for TAP rely heavily on complex tracking-specific inductive biases and heuristics, limiting their generality and potential for scaling. To address these challenges, we present TAPNext, a new approach that casts TAP as sequential masked token decoding. Our model is causal, tracks in a purely online fashion, and removes tracking-specific inductive biases. This enables TAPNext to run with minimal latency, and removes the temporal windowing required by many existing state of art trackers. Despite its simplicity, TAPNext achieves a new state-of-the-art tracking performance among both online and offline trackers. Finally, we present evidence that many widely used tracking heuristics emerge naturally in TAPNext through end-to-end training.
Problem

Research questions and friction points this paper is trying to address.

TAPNext addresses video point tracking challenges
It removes complex tracking-specific biases
Achieves state-of-the-art online and offline performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sequential masked token decoding for TAP
Causal online tracking without biases
End-to-end training replaces tracking heuristics
πŸ”Ž Similar Papers
No similar papers found.