๐ค AI Summary
Robot manipulation video generation suffers from data scarcity and 3D spatial ambiguity arising from 2D trajectory representations. To address these challenges, we propose the first diffusion-based framework integrating 3D occupancy-aware modeling and trajectory optimization. First, we construct a scene-level 3D occupancy map to ensure geometrically consistent scene understanding. Second, we optimize physically feasible end-effector trajectories in 3D spaceโreplacing ambiguous 2D paths with explicit, collision-aware 3D motion priors. Third, we design a trajectory-conditioned latent diffusion model that synthesizes coherent, obstacle-avoiding manipulation videos in third-person view, end-to-end. Our approach eliminates reliance on error-prone 2D trajectory supervision and explicitly grounds video generation in 3D dynamics. Experiments demonstrate significant improvements over state-of-the-art methods in visual fidelity and action plausibility. Notably, our method autonomously generates realistic pick-and-place videos with minimal human annotation, substantially reducing dependence on costly labeled data.
๐ Abstract
Data scarcity continues to be a major challenge in the field of robotic manipulation. Although diffusion models provide a promising solution for generating robotic manipulation videos, existing methods largely depend on 2D trajectories, which inherently face issues with 3D spatial ambiguity. In this work, we present a novel framework named ManipDreamer3D for generating plausible 3D-aware robotic manipulation videos from the input image and the text instruction. Our method combines 3D trajectory planning with a reconstructed 3D occupancy map created from a third-person perspective, along with a novel trajectory-to-video diffusion model. Specifically, ManipDreamer3D first reconstructs the 3D occupancy representation from the input image and then computes an optimized 3D end-effector trajectory, minimizing path length while avoiding collisions. Next, we employ a latent editing technique to create video sequences from the initial image latent and the optimized 3D trajectory. This process conditions our specially trained trajectory-to-video diffusion model to produce robotic pick-and-place videos. Our method generates robotic videos with autonomously planned plausible 3D trajectories, significantly reducing human intervention requirements. Experimental results demonstrate superior visual quality compared to existing methods.