GrndCtrl: Grounding World Models via Self-Supervised Reward Alignment

📅 2025-12-01

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Video world models achieve high visual fidelity but lack geometric grounding, resulting in poor spatial consistency and unstable long-horizon trajectories in embodied navigation. To address this, we propose RLWG—a novel reinforcement learning-based world model fine-tuning framework that, for the first time, incorporates verifiable geometric and perceptual rewards (pose cycle-consistency, depth reprojection, and temporal coherence) into post-training. These self-supervised rewards enable trajectory-to-scene-structure alignment without human annotations. Optimization is performed via Group Relative Policy Optimization (GRPO). Evaluated on outdoor navigation tasks, RLWG significantly outperforms supervised fine-tuning, generating geometrically consistent and temporally stable rollouts over extended horizons. Our approach establishes a reliable spatial prior for embodied agents, bridging the gap between photorealistic prediction and geometric plausibility in world modeling.

Technology Category

Application Category

📝 Abstract

Recent advances in video world modeling have enabled large-scale generative models to simulate embodied environments with high visual fidelity, providing strong priors for prediction, planning, and control. Yet, despite their realism, these models often lack geometric grounding, limiting their use in navigation tasks that require spatial coherence and long-horizon stability. We introduce Reinforcement Learning with World Grounding (RLWG), a self-supervised post-training framework that aligns pretrained world models with a physically verifiable structure through geometric and perceptual rewards. Analogous to reinforcement learning from verifiable feedback (RLVR) in language models, RLWG can use multiple rewards that measure pose cycle-consistency, depth reprojection, and temporal coherence. We instantiate this framework with GrndCtrl, a reward-aligned adaptation method based on Group Relative Policy Optimization (GRPO), yielding world models that maintain stable trajectories, consistent geometry, and reliable rollouts for embodied navigation. Like post-training alignment in large language models, GrndCtrl leverages verifiable rewards to bridge generative pretraining and grounded behavior, achieving superior spatial coherence and navigation stability over supervised fine-tuning in outdoor environments.

Problem

Research questions and friction points this paper is trying to address.

Aligns pretrained world models with geometric grounding for navigation

Addresses lack of spatial coherence in video world models

Improves navigation stability through self-supervised reward alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised reward alignment for geometric grounding

Multiple verifiable rewards for spatial and temporal consistency

Group Relative Policy Optimization for stable navigation trajectories

🔎 Similar Papers

PWM: Policy Learning with Multi-Task World Models