Denoise to Track: Harnessing Video Diffusion Priors for Robust Correspondence

📅 2025-12-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the dual challenges of scarce annotated data and limited model interpretability in zero-shot point tracking. We propose HeFT, the first framework to uncover functional specialization among attention heads in Video Diffusion Transformers (VDiT): low-frequency features predominantly govern spatiotemporal correspondence modeling. Methodologically, HeFT introduces a head–frequency-aware feature selection mechanism, integrating single-step denoising, soft argmax localization, and forward–backward consistency verification—requiring no fine-tuning or supervised training. On the TAP-Vid benchmark, HeFT achieves state-of-the-art zero-shot performance, with accuracy approaching fully supervised methods and substantially outperforming existing unsupervised and zero-shot approaches. Our core contributions are threefold: (1) revealing an intrinsic spatiotemporal representation division of labor within VDIT; (2) proposing a lightweight, interpretable zero-shot tracking paradigm; and (3) empirically demonstrating that pre-trained video diffusion models encode strong geometric priors.

Technology Category

Application Category

📝 Abstract
In this work, we introduce HeFT (Head-Frequency Tracker), a zero-shot point tracking framework that leverages the visual priors of pretrained video diffusion models. To better understand how they encode spatiotemporal information, we analyze the internal representations of Video Diffusion Transformer (VDiT). Our analysis reveals that attention heads act as minimal functional units with distinct specializations for matching, semantic understanding, and positional encoding. Additionally, we find that the low-frequency components in VDiT features are crucial for establishing correspondences, whereas the high-frequency components tend to introduce noise. Building on these insights, we propose a head- and frequency-aware feature selection strategy that jointly selects the most informative attention head and low-frequency components to enhance tracking performance. Specifically, our method extracts discriminative features through single-step denoising, applies feature selection, and employs soft-argmax localization with forward-backward consistency checks for correspondence estimation. Extensive experiments on TAP-Vid benchmarks demonstrate that HeFT achieves state-of-the-art zero-shot tracking performance, approaching the accuracy of supervised methods while eliminating the need for annotated training data. Our work further underscores the promise of video diffusion models as powerful foundation models for a wide range of downstream tasks, paving the way toward unified visual foundation models.
Problem

Research questions and friction points this paper is trying to address.

Leverages video diffusion priors for zero-shot point tracking
Selects informative attention heads and low-frequency features
Achieves state-of-the-art performance without annotated training data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages video diffusion models for zero-shot tracking
Selects informative attention heads and low-frequency features
Uses single-step denoising and consistency checks for localization
🔎 Similar Papers
No similar papers found.