ManipDreamer3D : Synthesizing Plausible Robotic Manipulation Video with Occupancy-aware 3D Trajectory

๐Ÿ“… 2025-08-29
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 6
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Robot manipulation video generation suffers from data scarcity and 3D spatial ambiguity arising from 2D trajectory representations. To address these challenges, we propose the first diffusion-based framework integrating 3D occupancy-aware modeling and trajectory optimization. First, we construct a scene-level 3D occupancy map to ensure geometrically consistent scene understanding. Second, we optimize physically feasible end-effector trajectories in 3D spaceโ€”replacing ambiguous 2D paths with explicit, collision-aware 3D motion priors. Third, we design a trajectory-conditioned latent diffusion model that synthesizes coherent, obstacle-avoiding manipulation videos in third-person view, end-to-end. Our approach eliminates reliance on error-prone 2D trajectory supervision and explicitly grounds video generation in 3D dynamics. Experiments demonstrate significant improvements over state-of-the-art methods in visual fidelity and action plausibility. Notably, our method autonomously generates realistic pick-and-place videos with minimal human annotation, substantially reducing dependence on costly labeled data.

Technology Category

Application Category

๐Ÿ“ Abstract
Data scarcity continues to be a major challenge in the field of robotic manipulation. Although diffusion models provide a promising solution for generating robotic manipulation videos, existing methods largely depend on 2D trajectories, which inherently face issues with 3D spatial ambiguity. In this work, we present a novel framework named ManipDreamer3D for generating plausible 3D-aware robotic manipulation videos from the input image and the text instruction. Our method combines 3D trajectory planning with a reconstructed 3D occupancy map created from a third-person perspective, along with a novel trajectory-to-video diffusion model. Specifically, ManipDreamer3D first reconstructs the 3D occupancy representation from the input image and then computes an optimized 3D end-effector trajectory, minimizing path length while avoiding collisions. Next, we employ a latent editing technique to create video sequences from the initial image latent and the optimized 3D trajectory. This process conditions our specially trained trajectory-to-video diffusion model to produce robotic pick-and-place videos. Our method generates robotic videos with autonomously planned plausible 3D trajectories, significantly reducing human intervention requirements. Experimental results demonstrate superior visual quality compared to existing methods.
Problem

Research questions and friction points this paper is trying to address.

Addressing robotic manipulation data scarcity through video synthesis
Overcoming 3D spatial ambiguity in robotic manipulation trajectories
Generating plausible robotic pick-and-place videos with autonomous 3D planning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses 3D occupancy map reconstruction from input images
Plans collision-free 3D end-effector trajectories automatically
Employs trajectory-to-video diffusion model for generation
๐Ÿ”Ž Similar Papers
No similar papers found.
Y
Ying Li
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Xiaobao Wei
Xiaobao Wei
Institute of Software, Chinese Academy of Sciences
3D Vision
Xiaowei Chi
Xiaowei Chi
The Hong Kong University of Science and Technology
Multimodal GenerationRoboticsComputer Vision
Yuming Li
Yuming Li
Peking University
Z
Zhongyu Zhao
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
H
Hao Wang
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Ningning Ma
Ningning Ma
Autonomous Driving Development, NIO
M
Ming Lu
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models