Generative Point Tracking with Flow Matching

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Point tracking in videos often exhibits multimodal uncertainty under occlusion and appearance changes, yet existing discriminative models produce only single-point estimates (e.g., mean predictions), failing to capture the underlying trajectory distribution. To address this, we propose GenPT, the first generative point tracking framework that explicitly models multimodal trajectory distributions via flow matching. Methodologically, GenPT introduces a window-dependent prior to enforce temporal consistency, incorporates confidence-guided best-first search to improve occlusion recovery, and designs a coordinate-aware variance scheduling strategy tailored for 2D point data. Evaluated on PointOdyssey, Dynamic Replica, and TAP-Vid benchmarks, GenPT achieves significant improvements over state-of-the-art methods under occlusion while maintaining competitive accuracy on visible points. Our work establishes a new paradigm for generative visual tracking under uncertainty.

Technology Category

Application Category

📝 Abstract

Tracking a point through a video can be a challenging task due to uncertainty arising from visual obfuscations, such as appearance changes and occlusions. Although current state-of-the-art discriminative models excel in regressing long-term point trajectory estimates -- even through occlusions -- they are limited to regressing to a mean (or mode) in the presence of uncertainty, and fail to capture multi-modality. To overcome this limitation, we introduce Generative Point Tracker (GenPT), a generative framework for modelling multi-modal trajectories. GenPT is trained with a novel flow matching formulation that combines the iterative refinement of discriminative trackers, a window-dependent prior for cross-window consistency, and a variance schedule tuned specifically for point coordinates. We show how our model's generative capabilities can be leveraged to improve point trajectory estimates by utilizing a best-first search strategy on generated samples during inference, guided by the model's own confidence of its predictions. Empirically, we evaluate GenPT against the current state of the art on the standard PointOdyssey, Dynamic Replica, and TAP-Vid benchmarks. Further, we introduce a TAP-Vid variant with additional occlusions to assess occluded point tracking performance and highlight our model's ability to capture multi-modality. GenPT is capable of capturing the multi-modality in point trajectories, which translates to state-of-the-art tracking accuracy on occluded points, while maintaining competitive tracking accuracy on visible points compared to extant discriminative point trackers.

Problem

Research questions and friction points this paper is trying to address.

Modeling multi-modal trajectories in point tracking

Overcoming uncertainty from occlusions and appearance changes

Improving tracking accuracy through generative trajectory sampling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative framework models multi-modal point trajectories

Flow matching formulation with window-dependent prior

Best-first search strategy improves trajectory estimates

🔎 Similar Papers

No similar papers found.