π€ AI Summary
This work addresses the challenge that existing video-based planning methods often violate temporal consistency and physical constraints, yielding infeasible action sequences. The authors propose a novel approach that integrates an action-conditioned world model with latent-space trajectory optimization. By leveraging a video-guided implicit collocation scheme, the method maps zero-shot video generation plans into dynamically feasible state-action trajectories in latent space, effectively unifying semantic objectives with physical plausibility. This is the first framework to combine zero-shot video planning with world modelβbased trajectory optimization, enabling the recovery of long-horizon, executable action plans from videos that may exhibit motion blur or physically implausible dynamics. Experiments on navigation and manipulation tasks demonstrate its capability to generate coherent and physically consistent behaviors directly from visual inputs.
π Abstract
Large-scale video generative models have shown emerging capabilities as zero-shot visual planners, yet video-generated plans often violate temporal consistency and physical constraints, leading to failures when mapped to executable actions. To address this, we propose Grounding Video Plans with World Models (GVP-WM), a planning method that grounds video-generated plans into feasible action sequences using a learned action-conditioned world model. At test-time, GVP-WM first generates a video plan from initial and goal observations, then projects the video guidance onto the manifold of dynamically feasible latent trajectories via video-guided latent collocation. In particular, we formulate grounding as a goal-conditioned latent-space trajectory optimization problem that jointly optimizes latent states and actions under world-model dynamics, while preserving semantic alignment with the video-generated plan. Empirically, GVP-WM recovers feasible long-horizon plans from zero-shot image-to-video-generated and motion-blurred videos that violate physical constraints, across navigation and manipulation simulation tasks.