🤖 AI Summary
Existing diffusion-based policy learning methods neglect physical safety constraints inherent in bimanual manipulation, leading to hazardous interactions. This paper proposes a test-time trajectory optimization framework that synergistically integrates diffusion models with vision-language models (VLMs) for semantic-driven safe planning. Specifically, we design a variable safety cost function tailored to bimanual collaboration modes; leverage VLMs to dynamically identify critical points and their relational structure, thereby adaptively generating full-task safety constraints; and employ guided diffusion denoising to optimize trajectories satisfying all constraints. Evaluated on eight simulated bimanual tasks, our method achieves a 13.7% higher task success rate and an 18.8% reduction in unsafe interactions compared to state-of-the-art approaches. On four real-world bimanual robotic tasks, it improves success rate by 32.5%, demonstrating both safety and practical efficacy.
📝 Abstract
Bimanual manipulation has been widely applied in household services and manufacturing, which enables the complex task completion with coordination requirements. Recent diffusion-based policy learning approaches have achieved promising performance in modeling action distributions for bimanual manipulation. However, they ignored the physical safety constraints of bimanual manipulation, which leads to the dangerous behaviors with damage to robots and objects. To this end, we propose a test-time trajectory optimization framework named SafeBimanual for any pre-trained diffusion-based bimanual manipulation policies, which imposes the safety constraints on bimanual actions to avoid dangerous robot behaviors with improved success rate. Specifically, we design diverse cost functions for safety constraints in different dual-arm cooperation patterns including avoidance of tearing objects and collision between arms and objects, which optimizes the manipulator trajectories with guided sampling of diffusion denoising process. Moreover, we employ a vision-language model (VLM) to schedule the cost functions by specifying keypoints and corresponding pairwise relationship, so that the optimal safety constraint is dynamically generated in the entire bimanual manipulation process. SafeBimanual demonstrates superiority on 8 simulated tasks in RoboTwin with a 13.7% increase in success rate and a 18.8% reduction in unsafe interactions over state-of-the-art diffusion-based methods. Extensive experiments on 4 real-world tasks further verify its practical value by improving the success rate by 32.5%.