🤖 AI Summary
Online dense point tracking requires persistent localization of all points from the first frame throughout a video sequence, particularly under occlusions and appearance variations. Conventional optical flow methods neglect temporal consistency, leading to drift, while existing sliding-window approaches suffer from inefficient propagation and poor long-range tracking performance. This paper introduces SPOT, a lightweight end-to-end differentiable framework for online inference. SPOT pioneers a streaming memory mechanism that jointly integrates feature-enhanced memory retrieval, sensory short-term motion modeling, and visibility-guided splatting-based propagation. On the CVO benchmark, SPOT achieves state-of-the-art accuracy; on TAP-Vid and RoboTAP, it matches or surpasses offline models despite operating online. SPOT reduces model parameters by 10× and accelerates inference by over 2× compared to prior online methods.
📝 Abstract
Dense point tracking is a challenging task requiring the continuous tracking of every point in the initial frame throughout a substantial portion of a video, even in the presence of occlusions. Traditional methods use optical flow models to directly estimate long-range motion, but they often suffer from appearance drifting without considering temporal consistency. Recent point tracking algorithms usually depend on sliding windows for indirect information propagation from the first frame to the current one, which is slow and less effective for long-range tracking. To account for temporal consistency and enable efficient information propagation, we present a lightweight and fast model with extbf{S}treaming memory for dense extbf{PO}int extbf{T}racking and online video processing. The extbf{SPOT} framework features three core components: a customized memory reading module for feature enhancement, a sensory memory for short-term motion dynamics modeling, and a visibility-guided splatting module for accurate information propagation. This combination enables SPOT to perform dense point tracking with state-of-the-art accuracy on the CVO benchmark, as well as comparable or superior performance to offline models on sparse tracking benchmarks such as TAP-Vid and RoboTAP. Notably, SPOT with 10$ imes$ smaller parameter numbers operates at least 2$ imes$ faster than previous state-of-the-art models while maintaining the best performance on CVO. We will release the models and codes at: https://github.com/DQiaole/SPOT.