Foundational World Models Accurately Detect Bimanual Manipulator Failures

📅 2026-03-07

📈 Citations: 1

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the challenge of runtime anomaly detection in dual-arm manipulation robots, where the high-dimensional state space impedes explicit modeling of failure modes. To overcome this, the authors propose a novel approach that integrates compressed latent representations from a pretrained vision foundation model (Cosmos Tokenizer) with a history-aware probabilistic world model. By leveraging prediction uncertainty to generate inconsistency scores, the method enables efficient fault monitoring during operation. Requiring only approximately 1/20 of the trainable parameters compared to existing learning-based approaches, the proposed framework achieves a 3.8% higher fault detection rate on a newly curated multi-view annotated failure dataset and demonstrates superior performance in real-world scenarios, significantly outperforming conventional statistical baselines.

Technology Category

Application Category

📝 Abstract

Deploying visuomotor robots at scale is challenging due to the potential for anomalous failures to degrade performance, cause damage, or endanger human life. Bimanual manipulators are no exception; these robots have vast state spaces comprised of high-dimensional images and proprioceptive signals. Explicitly defining failure modes within such state spaces is infeasible. In this work, we overcome these challenges by training a probabilistic, history informed, world model within the compressed latent space of a pretrained vision foundation model (NVIDIA's Cosmos Tokenizer). The model outputs uncertainty estimates alongside its predictions that serve as non-conformity scores within a conformal prediction framework. We use these scores to develop a runtime monitor, correlating periods of high uncertainty with anomalous failures. To test these methods, we use the simulated Push-T environment and the Bimanual Cable Manipulation dataset, the latter of which we introduce in this work. This new dataset features trajectories with multiple synchronized camera views, proprioceptive signals, and annotated failures from a challenging data center maintenance task. We benchmark our methods against baselines from the anomaly detection and out-of-distribution detection literature, and show that our approach considerably outperforms statistical techniques. Furthermore, we show that our approach requires approximately one twentieth of the trainable parameters as the next-best learning-based approach, yet outperforms it by 3.8% in terms of failure detection rate, paving the way toward safely deploying manipulator robots in real-world environments where reliability is non-negotiable.

Problem

Research questions and friction points this paper is trying to address.

bimanual manipulators

anomaly detection

failure detection

visuomotor robots

out-of-distribution detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

world model

conformal prediction

foundation model