ManipDreamer3D : Synthesizing Plausible Robotic Manipulation Video with Occupancy-aware 3D Trajectory

📅 2025-08-29

🏛️ arXiv.org

📈 Citations: 6

✨ Influential: 0

🤖 AI Summary

Robot manipulation video generation suffers from data scarcity and 3D spatial ambiguity arising from 2D trajectory representations. To address these challenges, we propose the first diffusion-based framework integrating 3D occupancy-aware modeling and trajectory optimization. First, we construct a scene-level 3D occupancy map to ensure geometrically consistent scene understanding. Second, we optimize physically feasible end-effector trajectories in 3D space—replacing ambiguous 2D paths with explicit, collision-aware 3D motion priors. Third, we design a trajectory-conditioned latent diffusion model that synthesizes coherent, obstacle-avoiding manipulation videos in third-person view, end-to-end. Our approach eliminates reliance on error-prone 2D trajectory supervision and explicitly grounds video generation in 3D dynamics. Experiments demonstrate significant improvements over state-of-the-art methods in visual fidelity and action plausibility. Notably, our method autonomously generates realistic pick-and-place videos with minimal human annotation, substantially reducing dependence on costly labeled data.

Technology Category

Application Category

📝 Abstract

Data scarcity continues to be a major challenge in the field of robotic manipulation. Although diffusion models provide a promising solution for generating robotic manipulation videos, existing methods largely depend on 2D trajectories, which inherently face issues with 3D spatial ambiguity. In this work, we present a novel framework named ManipDreamer3D for generating plausible 3D-aware robotic manipulation videos from the input image and the text instruction. Our method combines 3D trajectory planning with a reconstructed 3D occupancy map created from a third-person perspective, along with a novel trajectory-to-video diffusion model. Specifically, ManipDreamer3D first reconstructs the 3D occupancy representation from the input image and then computes an optimized 3D end-effector trajectory, minimizing path length while avoiding collisions. Next, we employ a latent editing technique to create video sequences from the initial image latent and the optimized 3D trajectory. This process conditions our specially trained trajectory-to-video diffusion model to produce robotic pick-and-place videos. Our method generates robotic videos with autonomously planned plausible 3D trajectories, significantly reducing human intervention requirements. Experimental results demonstrate superior visual quality compared to existing methods.

Problem

Research questions and friction points this paper is trying to address.

Addressing robotic manipulation data scarcity through video synthesis

Overcoming 3D spatial ambiguity in robotic manipulation trajectories

Generating plausible robotic pick-and-place videos with autonomous 3D planning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses 3D occupancy map reconstruction from input images

Plans collision-free 3D end-effector trajectories automatically

Employs trajectory-to-video diffusion model for generation

🔎 Similar Papers

No similar papers found.

Authors to Follow