🤖 AI Summary
Existing video diffusion models struggle with temporally coherent and high-fidelity motion synthesis in highly dynamic scenes due to the limitations of static loss functions, which fail to adequately capture complex motion dynamics. To address this, this work proposes a latent temporal difference (LTD)-driven, motion-aware loss weighting strategy that leverages inter-frame changes in latent space as a motion prior. By assigning stronger penalties to regions exhibiting high temporal variation, the method stabilizes training and enhances the model’s capacity to reconstruct high-frequency dynamics. This approach overcomes the constraints of conventional static losses and achieves state-of-the-art performance, surpassing strong baselines by 3.31% on VBench and 3.58% on VMBench, thereby significantly improving motion fidelity in generated videos.
📝 Abstract
Video generation models have achieved notable progress in static scenarios, yet their performance in motion video generation remains limited, with quality degrading under drastic dynamic changes. This is due to noise disrupting temporal coherence and increasing the difficulty of learning dynamic regions. {Unfortunately, existing diffusion models rely on static loss for all scenarios, constraining their ability to capture complex dynamics.} To address this issue, we introduce Latent Temporal Discrepancy (LTD) as a motion prior to guide loss weighting. LTD measures frame-to-frame variation in the latent space, assigning larger penalties to regions with higher discrepancy while maintaining regular optimization for stable regions. This motion-aware strategy stabilizes training and enables the model to better reconstruct high-frequency dynamics. Extensive experiments on the general benchmark VBench and the motion-focused VMBench show consistent gains, with our method outperforming strong baselines by 3.31% on VBench and 3.58% on VMBench, achieving significant improvements in motion quality.