Track-On2: Enhancing Online Point Tracking with Memory

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This paper addresses online long-term point tracking in videos—maintaining point identity consistency frame-by-frame under significant appearance changes, motion blur, and occlusion, using only past frames (causal inference). To this end, we propose Track-On2, a lightweight Transformer-based architecture. Its key contributions are: (1) an improved memory update mechanism that explicitly models temporal consistency; (2) the first systematic analysis of how synthetic data affects memory behavior, along with a dedicated synthetic training strategy; and (3) a hybrid design integrating coarse-grained block classification with fine-grained localization to enhance both robustness and efficiency. Evaluated on five synthetic and real-world benchmarks, Track-On2 outperforms all existing online methods and even surpasses strong offline models leveraging future frames. These results demonstrate a substantial advance in online point tracking performance achievable solely with synthetic data training.

Technology Category

Application Category

📝 Abstract

In this paper, we consider the problem of long-term point tracking, which requires consistent identification of points across video frames under significant appearance changes, motion, and occlusion. We target the online setting, i.e. tracking points frame-by-frame, making it suitable for real-time and streaming applications. We extend our prior model Track-On into Track-On2, a simple and efficient transformer-based model for online long-term tracking. Track-On2 improves both performance and efficiency through architectural refinements, more effective use of memory, and improved synthetic training strategies. Unlike prior approaches that rely on full-sequence access or iterative updates, our model processes frames causally and maintains temporal coherence via a memory mechanism, which is key to handling drift and occlusions without requiring future frames. At inference, we perform coarse patch-level classification followed by refinement. Beyond architecture, we systematically study synthetic training setups and their impact on memory behavior, showing how they shape temporal robustness over long sequences. Through comprehensive experiments, Track-On2 achieves state-of-the-art results across five synthetic and real-world benchmarks, surpassing prior online trackers and even strong offline methods that exploit bidirectional context. These results highlight the effectiveness of causal, memory-based architectures trained purely on synthetic data as scalable solutions for real-world point tracking. Project page: https://kuis-ai.github.io/track_on2

Problem

Research questions and friction points this paper is trying to address.

Enabling consistent point identification across video frames despite appearance changes and occlusions

Developing an online tracking model for real-time applications without future frame access

Improving temporal coherence through memory mechanisms to handle drift in long sequences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based model with memory mechanism

Coarse patch classification followed by refinement

Synthetic training strategies for temporal robustness

🔎 Similar Papers

No similar papers found.