π€ AI Summary
Point tracking in videos often exhibits multimodal uncertainty under occlusion and appearance changes, yet existing discriminative models produce only single-point estimates (e.g., mean predictions), failing to capture the underlying trajectory distribution. To address this, we propose GenPT, the first generative point tracking framework that explicitly models multimodal trajectory distributions via flow matching. Methodologically, GenPT introduces a window-dependent prior to enforce temporal consistency, incorporates confidence-guided best-first search to improve occlusion recovery, and designs a coordinate-aware variance scheduling strategy tailored for 2D point data. Evaluated on PointOdyssey, Dynamic Replica, and TAP-Vid benchmarks, GenPT achieves significant improvements over state-of-the-art methods under occlusion while maintaining competitive accuracy on visible points. Our work establishes a new paradigm for generative visual tracking under uncertainty.
π Abstract
Tracking a point through a video can be a challenging task due to uncertainty arising from visual obfuscations, such as appearance changes and occlusions. Although current state-of-the-art discriminative models excel in regressing long-term point trajectory estimates -- even through occlusions -- they are limited to regressing to a mean (or mode) in the presence of uncertainty, and fail to capture multi-modality. To overcome this limitation, we introduce Generative Point Tracker (GenPT), a generative framework for modelling multi-modal trajectories. GenPT is trained with a novel flow matching formulation that combines the iterative refinement of discriminative trackers, a window-dependent prior for cross-window consistency, and a variance schedule tuned specifically for point coordinates. We show how our model's generative capabilities can be leveraged to improve point trajectory estimates by utilizing a best-first search strategy on generated samples during inference, guided by the model's own confidence of its predictions. Empirically, we evaluate GenPT against the current state of the art on the standard PointOdyssey, Dynamic Replica, and TAP-Vid benchmarks. Further, we introduce a TAP-Vid variant with additional occlusions to assess occluded point tracking performance and highlight our model's ability to capture multi-modality. GenPT is capable of capturing the multi-modality in point trajectories, which translates to state-of-the-art tracking accuracy on occluded points, while maintaining competitive tracking accuracy on visible points compared to extant discriminative point trackers.