Latent Temporal Discrepancy as Motion Prior: A Loss-Weighting Strategy for Dynamic Fidelity in T2V

📅 2026-01-28

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Existing video diffusion models struggle with temporally coherent and high-fidelity motion synthesis in highly dynamic scenes due to the limitations of static loss functions, which fail to adequately capture complex motion dynamics. To address this, this work proposes a latent temporal difference (LTD)-driven, motion-aware loss weighting strategy that leverages inter-frame changes in latent space as a motion prior. By assigning stronger penalties to regions exhibiting high temporal variation, the method stabilizes training and enhances the model’s capacity to reconstruct high-frequency dynamics. This approach overcomes the constraints of conventional static losses and achieves state-of-the-art performance, surpassing strong baselines by 3.31% on VBench and 3.58% on VMBench, thereby significantly improving motion fidelity in generated videos.

Technology Category

Application Category

📝 Abstract

Video generation models have achieved notable progress in static scenarios, yet their performance in motion video generation remains limited, with quality degrading under drastic dynamic changes. This is due to noise disrupting temporal coherence and increasing the difficulty of learning dynamic regions. {Unfortunately, existing diffusion models rely on static loss for all scenarios, constraining their ability to capture complex dynamics.} To address this issue, we introduce Latent Temporal Discrepancy (LTD) as a motion prior to guide loss weighting. LTD measures frame-to-frame variation in the latent space, assigning larger penalties to regions with higher discrepancy while maintaining regular optimization for stable regions. This motion-aware strategy stabilizes training and enables the model to better reconstruct high-frequency dynamics. Extensive experiments on the general benchmark VBench and the motion-focused VMBench show consistent gains, with our method outperforming strong baselines by 3.31% on VBench and 3.58% on VMBench, achieving significant improvements in motion quality.

Problem

Research questions and friction points this paper is trying to address.

video generation

temporal coherence

dynamic fidelity

diffusion models

motion quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent Temporal Discrepancy

motion prior

loss weighting