Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion Transfer

📅 2025-03-21

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Video diffusion Transformers (DiTs) struggle to disentangle motion from appearance in motion transfer tasks, leading to inaccurate motion modeling. To address this, we propose DeT: a novel diffusion-based framework that (i) employs lightweight temporal convolutional kernels for spatiotemporal feature disentanglement; (ii) constructs dense trajectories in the latent space and enforces explicit motion consistency supervision; and (iii) integrates multi-scale motion similarity metrics with 3D feature distillation. Furthermore, we introduce MTBench—the first general-purpose motion transfer benchmark—and a hybrid motion fidelity metric. Extensive experiments demonstrate that DeT consistently outperforms state-of-the-art diffusion models in motion coherence, local dynamic detail preservation, and cross-scene generalization. Notably, it achieves a superior balance between motion fidelity and editing fidelity, advancing the practical applicability of diffusion models for high-fidelity motion transfer.

Technology Category

Application Category

📝 Abstract

The motion transfer task involves transferring motion from a source video to newly generated videos, requiring the model to decouple motion from appearance. Previous diffusion-based methods primarily rely on separate spatial and temporal attention mechanisms within 3D U-Net. In contrast, state-of-the-art video Diffusion Transformers (DiT) models use 3D full attention, which does not explicitly separate temporal and spatial information. Thus, the interaction between spatial and temporal dimensions makes decoupling motion and appearance more challenging for DiT models. In this paper, we propose DeT, a method that adapts DiT models to improve motion transfer ability. Our approach introduces a simple yet effective temporal kernel to smooth DiT features along the temporal dimension, facilitating the decoupling of foreground motion from background appearance. Meanwhile, the temporal kernel effectively captures temporal variations in DiT features, which are closely related to motion. Moreover, we introduce explicit supervision along dense trajectories in the latent feature space to further enhance motion consistency. Additionally, we present MTBench, a general and challenging benchmark for motion transfer. We also introduce a hybrid motion fidelity metric that considers both the global and local motion similarity. Therefore, our work provides a more comprehensive evaluation than previous works. Extensive experiments on MTBench demonstrate that DeT achieves the best trade-off between motion fidelity and edit fidelity.

Problem

Research questions and friction points this paper is trying to address.

Improve motion transfer in video Diffusion Transformers (DiT)

Decouple motion and appearance via temporal kernel smoothing

Enhance motion consistency with dense trajectory supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces temporal kernel for motion decoupling

Uses dense trajectory supervision for consistency

Proposes hybrid metric for motion evaluation

🔎 Similar Papers

No similar papers found.

Nvidia

The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5. You will also be eligible for equity and benefits.

US, CA, Remote / US, WA, Remote / US, OR, Remote

Research Scientist Intern (TikTok-Privacy Innovation Lab-Multimodal Generative Model) - 2026 Start (PhD)

TikTok

San Jose, California

Authors to Follow