3DFlowAction: Learning Cross-Embodiment Manipulation from 3D Flow World Model

📅 2025-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Robot manipulation generalization is hindered by the absence of a unified, large-scale action representation and cross-embodiment training data. Method: We propose an embodiment-agnostic action representation based on 3D optical flow and introduce the first 3D flow world model for manipulation tasks, conditioned on language instructions to generate physically plausible 3D flow trajectories. Our approach integrates video diffusion modeling, flow-guided rendering, GPT-4o semantic validation, and flow-constrained optimization to establish a closed-loop planning framework. Contribution/Results: Without hardware-specific fine-tuning, our method enables zero-shot transfer across diverse robot embodiments. It achieves a 32.7% absolute improvement in action success rate over baseline methods on complex manipulation tasks, significantly enhancing generalization across environments and robotic morphologies.

Technology Category

Application Category

📝 Abstract
Manipulation has long been a challenging task for robots, while humans can effortlessly perform complex interactions with objects, such as hanging a cup on the mug rack. A key reason is the lack of a large and uniform dataset for teaching robots manipulation skills. Current robot datasets often record robot action in different action spaces within a simple scene. This hinders the robot to learn a unified and robust action representation for different robots within diverse scenes. Observing how humans understand a manipulation task, we find that understanding how the objects should move in the 3D space is a critical clue for guiding actions. This clue is embodiment-agnostic and suitable for both humans and different robots. Motivated by this, we aim to learn a 3D flow world model from both human and robot manipulation data. This model predicts the future movement of the interacting objects in 3D space, guiding action planning for manipulation. Specifically, we synthesize a large-scale 3D optical flow dataset, named ManiFlow-110k, through a moving object auto-detect pipeline. A video diffusion-based world model then learns manipulation physics from these data, generating 3D optical flow trajectories conditioned on language instructions. With the generated 3D object optical flow, we propose a flow-guided rendering mechanism, which renders the predicted final state and leverages GPT-4o to assess whether the predicted flow aligns with the task description. This equips the robot with a closed-loop planning ability. Finally, we consider the predicted 3D optical flow as constraints for an optimization policy to determine a chunk of robot actions for manipulation. Extensive experiments demonstrate strong generalization across diverse robotic manipulation tasks and reliable cross-embodiment adaptation without hardware-specific training.
Problem

Research questions and friction points this paper is trying to address.

Lack of large, uniform dataset for robot manipulation skills
Difficulty in learning unified action representation for diverse robots
Need for embodiment-agnostic 3D object movement understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learning 3D flow world model from human and robot data
Generating 3D optical flow trajectories with video diffusion
Flow-guided rendering mechanism for closed-loop planning
H
Hongyan Zhi
South China University of Technology
Peihao Chen
Peihao Chen
Researcher at Robotics X Lab, Tencent
Embodied AIMulti-Modal Video Understanding
S
Siyuan Zhou
Hong Kong University of Science and Technology
Y
Yubo Dong
South China University of Technology
Q
Quanxi Wu
South China University of Technology
L
Lei Han
Tencent Robotics X
Mingkui Tan
Mingkui Tan
South China University of Technology
Machine LearningLarge-scale Optimization