🤖 AI Summary
This work addresses the degradation in motion fidelity commonly observed in image-conditioned video diffusion models during supervised fine-tuning, which manifests as weakened dynamics and temporal incoherence. To mitigate this issue, the authors propose the Smooth Hybrid Fine-tuning (SHIFT) framework, which innovatively integrates supervised fine-tuning with advantage-weighted fine-tuning. Central to SHIFT is a pixel-motion reward mechanism grounded in pixel-wise optical flow dynamics, designed to align generated videos with the motion characteristics of real data. By employing an adversarial advantage-driven hybrid strategy, the framework effectively suppresses reward hacking while accelerating convergence. Experimental results demonstrate that SHIFT substantially alleviates dynamic collapse and achieves notable improvements in both motion realism and long-term temporal consistency.
📝 Abstract
Image-conditioned Video diffusion models achieve impressive visual realism but often suffer from weakened motion fidelity, e.g., reduced motion dynamics or degraded long-term temporal coherence, especially after fine-tuning. We study the problem of motion alignment in video diffusion models post-training. To address this, we introduce pixel-motion rewards based on pixel flux dynamics, capturing both instantaneous and long-term motion consistency. We further propose Smooth Hybrid Fine-tuning (SHIFT), a scalable reward-driven fine-tuning framework for video diffusion models. SHIFT fuses the normal supervised fine-tuning and advantage weighted fine-tuning into a unified framework. Benefiting from novel adversarial advantages, SHIFT improves convergence speed and mitigates reward hacking. Experiments show that our approach efficiently resolves dynamic-degree collapse in modern video diffusion models supervised fine-tuning.