Latent Temporal Discrepancy as Motion Prior: A Loss-Weighting Strategy for Dynamic Fidelity in T2V

📅 2026-01-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video diffusion models struggle with temporally coherent and high-fidelity motion synthesis in highly dynamic scenes due to the limitations of static loss functions, which fail to adequately capture complex motion dynamics. To address this, this work proposes a latent temporal difference (LTD)-driven, motion-aware loss weighting strategy that leverages inter-frame changes in latent space as a motion prior. By assigning stronger penalties to regions exhibiting high temporal variation, the method stabilizes training and enhances the model’s capacity to reconstruct high-frequency dynamics. This approach overcomes the constraints of conventional static losses and achieves state-of-the-art performance, surpassing strong baselines by 3.31% on VBench and 3.58% on VMBench, thereby significantly improving motion fidelity in generated videos.

Technology Category

Application Category

📝 Abstract
Video generation models have achieved notable progress in static scenarios, yet their performance in motion video generation remains limited, with quality degrading under drastic dynamic changes. This is due to noise disrupting temporal coherence and increasing the difficulty of learning dynamic regions. {Unfortunately, existing diffusion models rely on static loss for all scenarios, constraining their ability to capture complex dynamics.} To address this issue, we introduce Latent Temporal Discrepancy (LTD) as a motion prior to guide loss weighting. LTD measures frame-to-frame variation in the latent space, assigning larger penalties to regions with higher discrepancy while maintaining regular optimization for stable regions. This motion-aware strategy stabilizes training and enables the model to better reconstruct high-frequency dynamics. Extensive experiments on the general benchmark VBench and the motion-focused VMBench show consistent gains, with our method outperforming strong baselines by 3.31% on VBench and 3.58% on VMBench, achieving significant improvements in motion quality.
Problem

Research questions and friction points this paper is trying to address.

video generation
temporal coherence
dynamic fidelity
diffusion models
motion quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent Temporal Discrepancy
motion prior
loss weighting
temporal coherence
video generation
🔎 Similar Papers
No similar papers found.
Meiqi Wu
Meiqi Wu
the University of Chinese Academy of Sciences
Computer vision
B
Bingze Song
AMAP, Alibaba Group, Beijing, China
R
Ruimin Lin
AMAP, Alibaba Group, Beijing, China
C
Chen Zhu
Southeast University, Nanjing, China
Xiaokun Feng
Xiaokun Feng
Institute of Automation,Chinese Academy of Sciences
computer versiondeep learning
Jiahong Wu
Jiahong Wu
Alibaba-AMAP
AIMLAIGCMLLM
X
Xiangxiang Chu
AMAP, Alibaba Group, Beijing, China
K
Kaiqi Huang
University of Chinese Academy of Sciences, Beijing, China