🤖 AI Summary
This work addresses the safety risks in dual-arm manipulation under vision-language instructions, where unmodeled self-collisions can lead to collisions between robot arms or with objects. The authors propose a risk-aware, end-to-end Vision-Language-Action (VLA) framework that integrates, for the first time, a short-term self-collision risk estimator with a VLA model. By fusing proprioceptive and visual embeddings, the system dynamically predicts high-risk commands and halts their execution in real time. Additionally, it incorporates a risk-guided state recovery and policy optimization mechanism. The method is pretrained in simulation using collision labels and fine-tuned on a real PiPER dual-arm robot. Experiments across five dual-arm tasks demonstrate a significant reduction in self-collision rates and higher task success compared to existing approaches such as RDT and APEX.
📝 Abstract
Vision Language Action (VLA) models enable instruction following manipulation, yet dualarm deployment remains unsafe due to under modeled selfcollisions between arms and grasped objects. We introduce CoFreeVLA, which augments an endtoend VLA with a short horizon selfcollision risk estimator that predicts collision likelihood from proprioception, visual embeddings, and planned actions. The estimator gates risky commands, recovers to safe states via risk-guided adjustments, and shapes policy refinement for safer rollouts. It is pre-trained with model-based collision labels and posttrained on real robot rollouts for calibration. On five bimanual tasks with the PiPER robot arm, CoFreeVLA reduces selfcollisions and improves success rates versus RDT and APEX.