SHIFT: Motion Alignment in Video Diffusion Models with Adversarial Hybrid Fine-Tuning

📅 2026-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the degradation in motion fidelity commonly observed in image-conditioned video diffusion models during supervised fine-tuning, which manifests as weakened dynamics and temporal incoherence. To mitigate this issue, the authors propose the Smooth Hybrid Fine-tuning (SHIFT) framework, which innovatively integrates supervised fine-tuning with advantage-weighted fine-tuning. Central to SHIFT is a pixel-motion reward mechanism grounded in pixel-wise optical flow dynamics, designed to align generated videos with the motion characteristics of real data. By employing an adversarial advantage-driven hybrid strategy, the framework effectively suppresses reward hacking while accelerating convergence. Experimental results demonstrate that SHIFT substantially alleviates dynamic collapse and achieves notable improvements in both motion realism and long-term temporal consistency.

Technology Category

Application Category

📝 Abstract
Image-conditioned Video diffusion models achieve impressive visual realism but often suffer from weakened motion fidelity, e.g., reduced motion dynamics or degraded long-term temporal coherence, especially after fine-tuning. We study the problem of motion alignment in video diffusion models post-training. To address this, we introduce pixel-motion rewards based on pixel flux dynamics, capturing both instantaneous and long-term motion consistency. We further propose Smooth Hybrid Fine-tuning (SHIFT), a scalable reward-driven fine-tuning framework for video diffusion models. SHIFT fuses the normal supervised fine-tuning and advantage weighted fine-tuning into a unified framework. Benefiting from novel adversarial advantages, SHIFT improves convergence speed and mitigates reward hacking. Experiments show that our approach efficiently resolves dynamic-degree collapse in modern video diffusion models supervised fine-tuning.
Problem

Research questions and friction points this paper is trying to address.

motion alignment
video diffusion models
motion fidelity
temporal coherence
fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

motion alignment
video diffusion models
adversarial hybrid fine-tuning
pixel-motion rewards
temporal coherence
🔎 Similar Papers
No similar papers found.