Track-On: Transformer-based Online Point Tracking with Memory

📅 2025-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenging problem of long-term, real-time point tracking in video streams—particularly under severe photometric variations, occlusions, and viewpoint changes. To this end, we propose a causal dual-memory Transformer architecture. Methodologically, we introduce a lightweight online mechanism that jointly leverages spatial memory (for point-specific appearance) and contextual memory (for spatiotemporal coherence), operating strictly causally—i.e., using only past frames without access to future or full-sequence information. Our approach further integrates patch-level classification with iterative refinement to enable efficient temporal modeling. Evaluated on seven standard benchmarks—including TAP-Vid—our method establishes new state-of-the-art performance for online point tracking, matching or surpassing several offline methods while maintaining real-time inference speed (>30 FPS), strong generalization across domains, and robustness to challenging visual degradations.

Technology Category

Application Category

📝 Abstract
In this paper, we consider the problem of long-term point tracking, which requires consistent identification of points across multiple frames in a video, despite changes in appearance, lighting, perspective, and occlusions. We target online tracking on a frame-by-frame basis, making it suitable for real-world, streaming scenarios. Specifically, we introduce Track-On, a simple transformer-based model designed for online long-term point tracking. Unlike prior methods that depend on full temporal modeling, our model processes video frames causally without access to future frames, leveraging two memory modules -- spatial memory and context memory -- to capture temporal information and maintain reliable point tracking over long time horizons. At inference time, it employs patch classification and refinement to identify correspondences and track points with high accuracy. Through extensive experiments, we demonstrate that Track-On sets a new state-of-the-art for online models and delivers superior or competitive results compared to offline approaches on seven datasets, including the TAP-Vid benchmark. Our method offers a robust and scalable solution for real-time tracking in diverse applications. Project page: https://kuis-ai.github.io/track_on
Problem

Research questions and friction points this paper is trying to address.

Object Tracking
Video Analysis
Real-time Processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based tracking
Spatial-temporal memory
Real-time object tracking
🔎 Similar Papers
No similar papers found.
G
Gorkay Aydemir
Department of Computer Engineering, Koc University, KUIS AI Center
X
Xiongyi Cai
School of Artificial Intelligence, Shanghai Jiao Tong University
Weidi Xie
Weidi Xie
Shanghai Jiao Tong University | VGG, University of Oxford
Computer VisionAI for HealthcareAI for Science
Fatma Guney
Fatma Guney
Koc University
computer visionautonomous drivingdepth estimationoptical flowvideo prediction