π€ AI Summary
Existing research on handover failure primarily focuses on sliding or external disturbances, lacking benchmark datasets and evaluation protocols for human-initiated, unavoidable failures (e.g., refusal to accept, failure to release). This work introduces the first multimodal dataset specifically designed for human-led, unavoidable handover failures, along with two baseline methods: video-based classification and joint temporal action segmentation. These enable real-time failure detection and causal attribution on robotic platforms. We innovatively formulate a joint temporal segmentation task that unifies human and robot actions with handover outcomes. Our approach employs 3D CNNs for video modeling, torque signal processing, gripper pose fusion, and multimodal temporal alignment. Experiments demonstrate that video modality is most critical; incorporating force and pose data improves failure detection accuracy by 12.3% and action segmentation mean average precision (mAP) by 9.7%.
π Abstract
An object handover between a robot and a human is a coordinated action which is prone to failure for reasons such as miscommunication, incorrect actions and unexpected object properties. Existing works on handover failure detection and prevention focus on preventing failures due to object slip or external disturbances. However, there is a lack of datasets and evaluation methods that consider unpreventable failures caused by the human participant. To address this deficit, we present the multimodal Handover Failure Detection dataset, which consists of failures induced by the human participant, such as ignoring the robot or not releasing the object. We also present two baseline methods for handover failure detection: (i) a video classification method using 3D CNNs and (ii) a temporal action segmentation approach which jointly classifies the human action, robot action and overall outcome of the action. The results show that video is an important modality, but using force-torque data and gripper position help improve failure detection and action segmentation accuracy.