TIMID: Time-Dependent Mistake Detection in Videos of Robot Executions

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the challenge of detecting complex temporal errors in high-level robotic tasks, which existing video anomaly detection methods—primarily focused on low-level motion anomalies—struggle to capture. The study formulates task-level temporal errors as a video anomaly detection problem and introduces an end-to-end framework that integrates vision-language models with weakly supervised learning to enable frame-level error detection using only task videos and textual prompts. To circumvent the scarcity of real-world error examples, the authors construct a multi-robot simulation dataset and demonstrate zero-shot transfer from simulation to real-world evaluation. Experimental results show that the proposed method effectively identifies diverse temporal errors and significantly outperforms off-the-shelf vision-language models, whose performance is limited by a lack of explicit temporal reasoning capabilities.

Technology Category

Application Category

📝 Abstract

As robotic systems execute increasingly difficult task sequences, so does the number of ways in which they can fail. Video Anomaly Detection (VAD) frameworks typically focus on singular, low-level kinematic or action failures, struggling to identify more complex temporal or spatial task violations, because they do not necessarily manifest as low-level execution errors. To address this problem, the main contribution of this paper is a new VAD-inspired architecture, TIMID, which is able to detect robot time-dependent mistakes when executing high-level tasks. Our architecture receives as inputs a video and prompts of the task and the potential mistake, and returns a frame-level prediction in the video of whether the mistake is present or not. By adopting a VAD formulation, the model can be trained with weak supervision, requiring only a single label per video. Additionally, to alleviate the problem of data scarcity of incorrect executions, we introduce a multi-robot simulation dataset with controlled temporal errors and real executions for zero-shot sim-to-real evaluation. Our experiments demonstrate that out-of-the-box VLMs lack the explicit temporal reasoning required for this task, whereas our framework successfully detects different types of temporal errors. Project: https://ropertunizar.github.io/TIMID/

Problem

Research questions and friction points this paper is trying to address.

Video Anomaly Detection

Temporal Errors

Robot Task Execution

Time-Dependent Mistakes

High-Level Tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Time-Dependent Mistake Detection

Video Anomaly Detection

Weakly Supervised Learning