VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward

📅 2026-03-27

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the challenge of maintaining geometric consistency in dynamic scenes for large-scale video diffusion models. To this end, the authors propose VGGRPO, a post-training optimization framework that leverages geometric guidance in the latent space without altering the pre-trained model architecture. The method introduces, for the first time, a latent geometric model capable of 4D reconstruction and integrates Grouped Relative Policy Optimization (GRPO) with a dual-reward mechanism that enforces both camera motion smoothness and geometric reprojection consistency. Experimental results demonstrate that VGGRPO significantly improves camera stability, geometric coherence, and overall video quality on both static and dynamic benchmarks, while substantially reducing computational overhead.

Technology Category

Application Category

📝 Abstract

Large-scale video diffusion models achieve impressive visual quality, yet often fail to preserve geometric consistency. Prior approaches improve consistency either by augmenting the generator with additional modules or applying geometry-aware alignment. However, architectural modifications can compromise the generalization of internet-scale pretrained models, while existing alignment methods are limited to static scenes and rely on RGB-space rewards that require repeated VAE decoding, incurring substantial compute overhead and failing to generalize to highly dynamic real-world scenes. To preserve the pretrained capacity while improving geometric consistency, we propose VGGRPO (Visual Geometry GRPO), a latent geometry-guided framework for geometry-aware video post-training. VGGRPO introduces a Latent Geometry Model (LGM) that stitches video diffusion latents to geometry foundation models, enabling direct decoding of scene geometry from the latent space. By constructing LGM from a geometry model with 4D reconstruction capability, VGGRPO naturally extends to dynamic scenes, overcoming the static-scene limitations of prior methods. Building on this, we perform latent-space Group Relative Policy Optimization with two complementary rewards: a camera motion smoothness reward that penalizes jittery trajectories, and a geometry reprojection consistency reward that enforces cross-view geometric coherence. Experiments on both static and dynamic benchmarks show that VGGRPO improves camera stability, geometry consistency, and overall quality while eliminating costly VAE decoding, making latent-space geometry-guided reinforcement an efficient and flexible approach to world-consistent video generation.

Problem

Research questions and friction points this paper is trying to address.

geometric consistency

video generation

dynamic scenes

world-consistent

latent space

Innovation

Methods, ideas, or system contributions that make the work stand out.

latent geometry model

4D reconstruction

group relative policy optimization