Diffusion-Based Imaginative Coordination for Bimanual Manipulation

📅 2025-07-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Dual-arm manipulation remains hindered by high-dimensional action spaces and challenging inter-arm coordination, limiting robotic deployment in industrial and domestic settings. This paper proposes a diffusion-based “imaginative coordination” framework that decouples video perception from action generation: it predicts multi-frame latent states and employs unidirectional attention to enable action synthesis without real-time video rendering—significantly improving inference efficiency. Crucially, the framework jointly optimizes video representation learning and action sequence modeling. Experiments on the ALOHA benchmark, RoboTwin simulation, and physical hardware demonstrate substantial improvements over the ACT baseline, achieving absolute gains of +24.9%, +11.1%, and +32.5% in task success rate, respectively. These results validate the method’s effectiveness, generalization capability across simulation-to-reality transfer, and practical deployability.

Technology Category

Application Category

📝 Abstract
Bimanual manipulation is crucial in robotics, enabling complex tasks in industrial automation and household services. However, it poses significant challenges due to the high-dimensional action space and intricate coordination requirements. While video prediction has been recently studied for representation learning and control, leveraging its ability to capture rich dynamic and behavioral information, its potential for enhancing bimanual coordination remains underexplored. To bridge this gap, we propose a unified diffusion-based framework for the joint optimization of video and action prediction. Specifically, we propose a multi-frame latent prediction strategy that encodes future states in a compressed latent space, preserving task-relevant features. Furthermore, we introduce a unidirectional attention mechanism where video prediction is conditioned on the action, while action prediction remains independent of video prediction. This design allows us to omit video prediction during inference, significantly enhancing efficiency. Experiments on two simulated benchmarks and a real-world setting demonstrate a significant improvement in the success rate over the strong baseline ACT using our method, achieving a extbf{24.9%} increase on ALOHA, an extbf{11.1%} increase on RoboTwin, and a extbf{32.5%} increase in real-world experiments. Our models and code are publicly available at https://github.com/return-sleep/Diffusion_based_imaginative_Coordination.
Problem

Research questions and friction points this paper is trying to address.

High-dimensional action space in bimanual robotic manipulation
Intricate coordination requirements for bimanual tasks
Underexplored potential of video prediction for coordination enhancement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified diffusion-based video and action prediction
Multi-frame latent prediction for future states
Unidirectional attention mechanism for efficiency
🔎 Similar Papers
No similar papers found.
H
Huilin Xu
Fudan University
J
Jian Ding
King Abdullah University of Science and Technology
Jiakun Xu
Jiakun Xu
PhD student @ ETHz
computer visionvisual geometryvectorization
R
Ruixiang Wang
The Chinese University of Hong Kong, Shenzhen
J
Jun Chen
King Abdullah University of Science and Technology
Jinjie Mai
Jinjie Mai
KAUST
3D Vision
Yanwei Fu
Yanwei Fu
Fudan University
Computer visionmachine learningMultimedia
Bernard Ghanem
Bernard Ghanem
Professor, King Abdullah University of Science and Technology
computer visionmachine learning
F
Feng Xu
Fudan University
M
Mohamed Elhoseiny
King Abdullah University of Science and Technology