Envision: Embodied Visual Planning via Goal-Imagery Video Diffusion

📅 2025-12-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video diffusion models predominantly employ forward prediction for visual planning, lacking explicit target modeling—leading to spatial drift and target misalignment. This paper proposes a target-image-guided two-stage diffusion framework: (1) a semantics-aligned target image synthesis stage leveraging instruction-driven and region-aware cross-modal attention; and (2) a physically plausible, target-consistent planning video generation stage conditioned on the first and last frames, incorporating latent-space trajectory interpolation and the FL2V video diffusion model. To our knowledge, this is the first approach to explicitly enforce target-image constraints in diffusion-based visual planning. It significantly improves target alignment, spatial consistency, and object fidelity on object manipulation and image editing benchmarks. Generated videos are directly executable by robots, empirically validating end-to-end embodied visual planning.

Technology Category

Application Category

📝 Abstract
Embodied visual planning aims to enable manipulation tasks by imagining how a scene evolves toward a desired goal and using the imagined trajectories to guide actions. Video diffusion models, through their image-to-video generation capability, provide a promising foundation for such visual imagination. However, existing approaches are largely forward predictive, generating trajectories conditioned on the initial observation without explicit goal modeling, thus often leading to spatial drift and goal misalignment. To address these challenges, we propose Envision, a diffusion-based framework that performs visual planning for embodied agents. By explicitly constraining the generation with a goal image, our method enforces physical plausibility and goal consistency throughout the generated trajectory. Specifically, Envision operates in two stages. First, a Goal Imagery Model identifies task-relevant regions, performs region-aware cross attention between the scene and the instruction, and synthesizes a coherent goal image that captures the desired outcome. Then, an Env-Goal Video Model, built upon a first-and-last-frame-conditioned video diffusion model (FL2V), interpolates between the initial observation and the goal image, producing smooth and physically plausible video trajectories that connect the start and goal states. Experiments on object manipulation and image editing benchmarks demonstrate that Envision achieves superior goal alignment, spatial consistency, and object preservation compared to baselines. The resulting visual plans can directly support downstream robotic planning and control, providing reliable guidance for embodied agents.
Problem

Research questions and friction points this paper is trying to address.

Enables manipulation tasks by imagining scene evolution toward goals.
Addresses spatial drift and goal misalignment in visual planning.
Generates physically plausible video trajectories connecting start and goal.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Goal-Imagery Model synthesizes coherent goal images from instructions
Env-Goal Video Model interpolates between start and goal via FL2V diffusion
Two-stage diffusion framework ensures physical plausibility and goal alignment