Repurposing Video Diffusion Transformers for Robust Point Tracking

📅 2025-12-23

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Point tracking requires robust inter-frame correspondence localization, yet existing shallow CNN approaches—such as ResNet—that process frames independently lack explicit temporal modeling, leading to unreliable matching under occlusion and motion blur. This paper introduces DiTracker, the first method to leverage pre-trained video diffusion Transformers (DiTs) for point tracking, capitalizing on their strong implicit temporal modeling capability. DiTracker incorporates a query-key matching mechanism, lightweight LoRA-based fine-tuning, and ResNet-guided multi-scale cost volume fusion to achieve efficient spatiotemporal alignment. Evaluated on ITTO, DiTracker achieves state-of-the-art performance; on TAP-Vid, it matches or surpasses prior best methods. Notably, it converges with only 1/8 the batch size required by competing approaches, demonstrating superior training efficiency and scalability.

Technology Category

Application Category

📝 Abstract

Point tracking aims to localize corresponding points across video frames, serving as a fundamental task for 4D reconstruction, robotics, and video editing. Existing methods commonly rely on shallow convolutional backbones such as ResNet that process frames independently, lacking temporal coherence and producing unreliable matching costs under challenging conditions. Through systematic analysis, we find that video Diffusion Transformers (DiTs), pre-trained on large-scale real-world videos with spatio-temporal attention, inherently exhibit strong point tracking capability and robustly handle dynamic motions and frequent occlusions. We propose DiTracker, which adapts video DiTs through: (1) query-key attention matching, (2) lightweight LoRA tuning, and (3) cost fusion with a ResNet backbone. Despite training with 8 times smaller batch size, DiTracker achieves state-of-the-art performance on challenging ITTO benchmark and matches or outperforms state-of-the-art models on TAP-Vid benchmarks. Our work validates video DiT features as an effective and efficient foundation for point tracking.

Problem

Research questions and friction points this paper is trying to address.

Enhances point tracking across video frames

Improves handling of dynamic motions and occlusions

Adapts video diffusion transformers for robust matching

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts video Diffusion Transformers for point tracking

Uses query-key attention matching and LoRA tuning

Fuses costs with ResNet backbone for robustness

🔎 Similar Papers

Tora: Trajectory-oriented Diffusion Transformer for Video Generation

2024-07-31arXiv.orgCitations: 21

TikTok

San Jose, California

Master Thesis AI-Based Keypoint Refinement for Autonomous Driving

Bosch Group

Hildesheim, NDS, DE

AI Research Scientist, Computer Vision - Facebook Video Intelligence