🤖 AI Summary
Addressing the challenges of modeling multi-body spatiotemporal dynamics and weak master-slave arm coordination in multitask bimanual robotic manipulation, this paper proposes the Hierarchical Gaussian World Model (HGWM). HGWM introduces a novel task-oriented Gaussian lattice generation mechanism and a leader-follower architecture that explicitly decouples the dynamics of the stabilizing (leader) and manipulating (follower) arms, enabling precise modeling of bimanual interaction. The method integrates Gaussian visual representation, hierarchical world modeling, future scene prediction, and multi-body dynamical constraints. Evaluated on ten simulated tasks, HGWM achieves an average performance improvement of 20.2% over baselines including ManiGaussian. On nine complex real-world bimanual tasks, it attains a 60% average success rate—demonstrating substantial gains in multitask generalization and control accuracy.
📝 Abstract
Multi-task robotic bimanual manipulation is becoming increasingly popular as it enables sophisticated tasks that require diverse dual-arm collaboration patterns. Compared to unimanual manipulation, bimanual tasks pose challenges to understanding the multi-body spatiotemporal dynamics. An existing method ManiGaussian pioneers encoding the spatiotemporal dynamics into the visual representation via Gaussian world model for single-arm settings, which ignores the interaction of multiple embodiments for dual-arm systems with significant performance drop. In this paper, we propose ManiGaussian++, an extension of ManiGaussian framework that improves multi-task bimanual manipulation by digesting multi-body scene dynamics through a hierarchical Gaussian world model. To be specific, we first generate task-oriented Gaussian Splatting from intermediate visual features, which aims to differentiate acting and stabilizing arms for multi-body spatiotemporal dynamics modeling. We then build a hierarchical Gaussian world model with the leader-follower architecture, where the multi-body spatiotemporal dynamics is mined for intermediate visual representation via future scene prediction. The leader predicts Gaussian Splatting deformation caused by motions of the stabilizing arm, through which the follower generates the physical consequences resulted from the movement of the acting arm. As a result, our method significantly outperforms the current state-of-the-art bimanual manipulation techniques by an improvement of 20.2% in 10 simulated tasks, and achieves 60% success rate on average in 9 challenging real-world tasks. Our code is available at https://github.com/April-Yz/ManiGaussian_Bimanual.