🤖 AI Summary
Existing video diffusion models often suffer from object deformations or spatial drift due to the absence of explicit 3D structural constraints. This work proposes a self-supervised framework that, for the first time, incorporates geometric priors into video generation training in the form of preference pairs. By leveraging a foundation model for geometry estimation, the method generates dense 3D consistency signals as preference labels and employs Direct Preference Optimization (DPO) to guide the diffusion model toward learning more physically plausible and temporally coherent spatiotemporal distributions. Notably, the approach requires no manual annotations and significantly enhances temporal stability, physical realism, and motion coherence in generated videos, outperforming current state-of-the-art methods across multiple evaluation metrics.
📝 Abstract
While recent video diffusion models (VDMs) produce visually impressive results, they fundamentally struggle to maintain 3D structural consistency, often resulting in object deformation or spatial drift. We hypothesize that these failures arise because standard denoising objectives lack explicit incentives for geometric coherence. To address this, we introduce VideoGPA (Video Geometric Preference Alignment), a data-efficient self-supervised framework that leverages a geometry foundation model to automatically derive dense preference signals that guide VDMs via Direct Preference Optimization (DPO). This approach effectively steers the generative distribution toward inherent 3D consistency without requiring human annotations. VideoGPA significantly enhances temporal stability, physical plausibility, and motion coherence using minimal preference pairs, consistently outperforming state-of-the-art baselines in extensive experiments.