🤖 AI Summary
Video generation models commonly suffer from geometric inconsistency, motion instability, and visual artifacts, limiting 3D scene realism. To address this, we propose a preference optimization framework grounded in epipolar geometry constraints—requiring no end-to-end differentiability—that embeds classical multi-view geometric priors into modern video diffusion models (e.g., Latent Diffusion Transformers), enhanced with rectified flow techniques. The method is trained on static scenes yet generalizes effectively to dynamic content. Our key contribution is the first use of pairwise epipolar constraints as stable, interpretable optimization signals, bridging the 3D consistency gap inherent in purely data-driven approaches. Experiments demonstrate significant improvements in spatial geometric fidelity and camera trajectory stability, while preserving high visual quality and substantially enhancing the 3D authenticity of generated videos.
📝 Abstract
Video generation models have progressed tremendously through large latent diffusion transformers trained with rectified flow techniques. Yet these models still struggle with geometric inconsistencies, unstable motion, and visual artifacts that break the illusion of realistic 3D scenes. 3D-consistent video generation could significantly impact numerous downstream applications in generation and reconstruction tasks. We explore how epipolar geometry constraints improve modern video diffusion models. Despite massive training data, these models fail to capture fundamental geometric principles underlying visual content. We align diffusion models using pairwise epipolar geometry constraints via preference-based optimization, directly addressing unstable camera trajectories and geometric artifacts through mathematically principled geometric enforcement. Our approach efficiently enforces geometric principles without requiring end-to-end differentiability. Evaluation demonstrates that classical geometric constraints provide more stable optimization signals than modern learned metrics, which produce noisy targets that compromise alignment quality. Training on static scenes with dynamic cameras ensures high-quality measurements while the model generalizes effectively to diverse dynamic content. By bridging data-driven deep learning with classical geometric computer vision, we present a practical method for generating spatially consistent videos without compromising visual quality.