Grounding Generated Videos in Feasible Plans via World Models

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses the challenge that existing video-based planning methods often violate temporal consistency and physical constraints, yielding infeasible action sequences. The authors propose a novel approach that integrates an action-conditioned world model with latent-space trajectory optimization. By leveraging a video-guided implicit collocation scheme, the method maps zero-shot video generation plans into dynamically feasible state-action trajectories in latent space, effectively unifying semantic objectives with physical plausibility. This is the first framework to combine zero-shot video planning with world model–based trajectory optimization, enabling the recovery of long-horizon, executable action plans from videos that may exhibit motion blur or physically implausible dynamics. Experiments on navigation and manipulation tasks demonstrate its capability to generate coherent and physically consistent behaviors directly from visual inputs.

Technology Category

Application Category

📝 Abstract

Large-scale video generative models have shown emerging capabilities as zero-shot visual planners, yet video-generated plans often violate temporal consistency and physical constraints, leading to failures when mapped to executable actions. To address this, we propose Grounding Video Plans with World Models (GVP-WM), a planning method that grounds video-generated plans into feasible action sequences using a learned action-conditioned world model. At test-time, GVP-WM first generates a video plan from initial and goal observations, then projects the video guidance onto the manifold of dynamically feasible latent trajectories via video-guided latent collocation. In particular, we formulate grounding as a goal-conditioned latent-space trajectory optimization problem that jointly optimizes latent states and actions under world-model dynamics, while preserving semantic alignment with the video-generated plan. Empirically, GVP-WM recovers feasible long-horizon plans from zero-shot image-to-video-generated and motion-blurred videos that violate physical constraints, across navigation and manipulation simulation tasks.

Problem

Research questions and friction points this paper is trying to address.

video generation

temporal consistency

physical constraints

action planning

feasibility

Innovation

Methods, ideas, or system contributions that make the work stand out.

world models

video grounding

latent trajectory optimization