Enhancing Video-Based Robot Failure Detection Using Task Knowledge

📅 2025-08-26

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

This paper addresses the insufficient robustness of video-based failure detection in robotic task execution. We propose a novel detection method that jointly models spatiotemporal action dynamics and task-relevant object semantics. Our approach explicitly captures the spatiotemporal relationships between robot action sequences and salient objects within the field of view; leverages easily accessible task context—such as goal descriptions and step instructions—to guide detection; and introduces a zero-overhead variable-frame-rate data augmentation strategy, combined with test-time augmentation, to enhance generalization without additional computational cost. Evaluated on the ARMBench benchmark, our method achieves an F1 score of 81.4%, outperforming the strongest baseline by 3.5 points and surpassing prior state-of-the-art methods. These results demonstrate the effectiveness and practicality of integrating spatiotemporal knowledge guidance with lightweight, computation-free augmentation for robotic failure detection.

Technology Category

Application Category

📝 Abstract

Robust robotic task execution hinges on the reliable detection of execution failures in order to trigger safe operation modes, recovery strategies, or task replanning. However, many failure detection methods struggle to provide meaningful performance when applied to a variety of real-world scenarios. In this paper, we propose a video-based failure detection approach that uses spatio-temporal knowledge in the form of the actions the robot performs and task-relevant objects within the field of view. Both pieces of information are available in most robotic scenarios and can thus be readily obtained. We demonstrate the effectiveness of our approach on three datasets that we amend, in part, with additional annotations of the aforementioned task-relevant knowledge. In light of the results, we also propose a data augmentation method that improves performance by applying variable frame rates to different parts of the video. We observe an improvement from 77.9 to 80.0 in F1 score on the ARMBench dataset without additional computational expense and an additional increase to 81.4 with test-time augmentation. The results emphasize the importance of spatio-temporal information during failure detection and suggest further investigation of suitable heuristics in future implementations. Code and annotations are available.

Problem

Research questions and friction points this paper is trying to address.

Detecting robot execution failures robustly in real-world scenarios

Improving video-based failure detection using spatio-temporal task knowledge

Enhancing performance through task-relevant object and action information

Innovation

Methods, ideas, or system contributions that make the work stand out.

Video-based failure detection using spatio-temporal knowledge

Incorporates robot actions and task-relevant objects

Variable frame rate data augmentation method

🔎 Similar Papers

A Multimodal Handover Failure Detection Dataset and Baselines