From Single Images to Motion Policies via Video-Generation Environment Representations

📅 2025-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of collision-free motion policy generation from a single RGB image. We propose the Video-Generated Environment Representation (VGER) framework, which bypasses error-prone monocular depth estimation by reconstructing dense point clouds via multi-view video generation. VGER jointly learns implicit geometric representations and motion policies through an integrated architecture combining large-scale video diffusion models, pre-trained 3D foundation models (e.g., Point-E), implicit neural representations (INRs), and multi-scale noise modeling. Given only one input RGB image, VGER generates geometrically consistent, smooth, continuous, and collision-free trajectories across diverse indoor and outdoor scenes. Our key contribution is the first integration of video generation into environment geometry modeling for end-to-end motion planning—eliminating explicit depth prediction. This yields significant improvements in trajectory accuracy and cross-scene generalization capability.

Technology Category

Application Category

📝 Abstract
Autonomous robots typically need to construct representations of their surroundings and adapt their motions to the geometry of their environment. Here, we tackle the problem of constructing a policy model for collision-free motion generation, consistent with the environment, from a single input RGB image. Extracting 3D structures from a single image often involves monocular depth estimation. Developments in depth estimation have given rise to large pre-trained models such as DepthAnything. However, using outputs of these models for downstream motion generation is challenging due to frustum-shaped errors that arise. Instead, we propose a framework known as Video-Generation Environment Representation (VGER), which leverages the advances of large-scale video generation models to generate a moving camera video conditioned on the input image. Frames of this video, which form a multiview dataset, are then input into a pre-trained 3D foundation model to produce a dense point cloud. We then introduce a multi-scale noise approach to train an implicit representation of the environment structure and build a motion generation model that complies with the geometry of the representation. We extensively evaluate VGER over a diverse set of indoor and outdoor environments. We demonstrate its ability to produce smooth motions that account for the captured geometry of a scene, all from a single RGB input image.
Problem

Research questions and friction points this paper is trying to address.

Generating collision-free motion policies from single RGB images
Overcoming frustum-shaped errors in monocular depth estimation
Leveraging video generation for accurate 3D environment representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses video-generation models for environment representation
Converts video frames into 3D point clouds
Trains implicit representation with multi-scale noise
🔎 Similar Papers
No similar papers found.
W
Weiming Zhi
Robotics Institute, Carnegie Mellon University, USA
Z
Ziyong Ma
Robotics Institute, Carnegie Mellon University, USA
T
Tianyi Zhang
Robotics Institute, Carnegie Mellon University, USA
Matthew Johnson-Roberson
Matthew Johnson-Roberson
Professor of Robotics, Carnegie Mellon University
RoboticsField RoboticsAutonomous VehiclesMarine Robotics