VIGOR: VIdeo Geometry-Oriented Reward for Temporal Generative Alignment

📅 2026-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video diffusion models often suffer from artifacts such as object deformation, spatial drift, and depth inconsistency due to the absence of explicit geometric supervision. This work proposes a geometry-aware reward mechanism that leverages pretrained geometric foundation models to evaluate multi-view consistency by computing cross-frame reprojection errors at the point level—rather than the pixel level—for the first time. Additionally, a geometry-informed sampling strategy is introduced to prioritize semantically rich regions while filtering out low-texture areas. The framework enables alignment of the generation process through supervised fine-tuning (SFT), reinforcement learning, or test-time scaling, significantly enhancing geometric consistency and robustness of off-the-shelf video models without requiring retraining. Experimental results demonstrate substantial improvements over current methods across multiple metrics.

Technology Category

Application Category

📝 Abstract
Video diffusion models lack explicit geometric supervision during training, leading to inconsistency artifacts such as object deformation, spatial drift, and depth violations in generated videos. To address this limitation, we propose a geometry-based reward model that leverages pretrained geometric foundation models to evaluate multi-view consistency through cross-frame reprojection error. Unlike previous geometric metrics that measure inconsistency in pixel space, where pixel intensity may introduce additional noise, our approach conducts error computation in a pointwise fashion, yielding a more physically grounded and robust error metric. Furthermore, we introduce a geometry-aware sampling strategy that filters out low-texture and non-semantic regions, focusing evaluation on geometrically meaningful areas with reliable correspondences to improve robustness. We apply this reward model to align video diffusion models through two complementary pathways: post-training of a bidirectional model via SFT or Reinforcement Learning and inference-time optimization of a Causal Video Model (e.g., Streaming video generator) via test-time scaling with our reward as a path verifier. Experimental results validate the effectiveness of our design, demonstrating that our geometry-based reward provides superior robustness compared to other variants. By enabling efficient inference-time scaling, our method offers a practical solution for enhancing open-source video models without requiring extensive computational resources for retraining.
Problem

Research questions and friction points this paper is trying to address.

video diffusion models
geometric supervision
multi-view consistency
spatial drift
depth violations
Innovation

Methods, ideas, or system contributions that make the work stand out.

geometry-aware reward
cross-frame reprojection
pointwise error metric
geometric consistency
test-time scaling
🔎 Similar Papers
No similar papers found.
T
Tengjiao Yin
VCIP & TMCC & DISSec, College of Computer Science, Nankai University
Jinglei Shi
Jinglei Shi
Nankai University
deep learning3D visionlight fieldvideo processingcompression
Heng Guo
Heng Guo
Beijing University of Posts and Telecommunications
computer vision
X
Xi Wang
LIX, École Polytechnique, IP Paris