Emergent Temporal Correspondences from Video Diffusion Transformers

📅 2025-06-20

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work investigates how video diffusion Transformers (DiTs) inherently model inter-frame temporal correspondences. To this end, we propose DiffTrack—the first quantitative analytical framework for DiTs—comprising a synthetically generated video dataset with pseudo-ground-truth point tracking annotations and a systematic decomposition of spatiotemporal interactions within 3D attention. Our analysis reveals, for the first time, that query-key similarity in specific layers increases markedly with denoising steps, correlating strongly with improved temporal correspondence accuracy. Leveraging this insight, we introduce a novel zero-shot point tracking paradigm and a training-free motion-aware guidance strategy for video generation. Experiments demonstrate state-of-the-art performance on zero-shot point tracking and significant gains in temporal consistency of generated videos. This work establishes foundational tools and theoretical insights for interpretability analysis and downstream applications of video DiTs.

Technology Category

Application Category

📝 Abstract

Recent advancements in video diffusion models based on Diffusion Transformers (DiTs) have achieved remarkable success in generating temporally coherent videos. Yet, a fundamental question persists: how do these models internally establish and represent temporal correspondences across frames? We introduce DiffTrack, the first quantitative analysis framework designed to answer this question. DiffTrack constructs a dataset of prompt-generated video with pseudo ground-truth tracking annotations and proposes novel evaluation metrics to systematically analyze how each component within the full 3D attention mechanism of DiTs (e.g., representations, layers, and timesteps) contributes to establishing temporal correspondences. Our analysis reveals that query-key similarities in specific, but not all, layers play a critical role in temporal matching, and that this matching becomes increasingly prominent during the denoising process. We demonstrate practical applications of DiffTrack in zero-shot point tracking, where it achieves state-of-the-art performance compared to existing vision foundation and self-supervised video models. Further, we extend our findings to motion-enhanced video generation with a novel guidance method that improves temporal consistency of generated videos without additional training. We believe our work offers crucial insights into the inner workings of video DiTs and establishes a foundation for further research and applications leveraging their temporal understanding.

Problem

Research questions and friction points this paper is trying to address.

How video diffusion models establish temporal correspondences across frames

Quantitative analysis of DiTs' 3D attention mechanism components

Improving temporal consistency in video generation and tracking

Innovation

Methods, ideas, or system contributions that make the work stand out.

DiffTrack analyzes temporal correspondences in DiTs

Uses pseudo ground-truth annotations for evaluation

Enhances video generation with motion guidance

🔎 Similar Papers

Human Video Translation via Query Warping