🤖 AI Summary
This work proposes a novel Vision-Language-Action (VLA) framework endowed with proactive self-correction capabilities to address the limitation of existing robot failure detection methods, which typically rely on post-hoc interventions and cannot prevent errors during execution. The framework predicts potential failures at critical subtask transition points and triggers a backtracking mechanism, augmented by Minimum Bayes Risk (MBR) decoding to enhance retry success rates. It represents the first approach enabling VLA models to actively anticipate failures and backtrack to prior subtasks during execution, further introducing an MBR-based zero-shot test-time scaling strategy. Experimental results demonstrate that the proposed framework significantly improves task success rates across both well-trained and under-trained VLA models, confirming its effectiveness and robustness.
📝 Abstract
Current work on robot failure detection and correction typically operate in a post hoc manner, analyzing errors and applying corrections only after failures occur. This work introduces CycleVLA, a system that equips Vision-Language-Action models (VLAs) with proactive self-correction, the capability to anticipate incipient failures and recover before they fully manifest during execution. CycleVLA achieves this by integrating a progress-aware VLA that flags critical subtask transition points where failures most frequently occur, a VLM-based failure predictor and planner that triggers subtask backtracking upon predicted failure, and a test-time scaling strategy based on Minimum Bayes Risk (MBR) decoding to improve retry success after backtracking. Extensive experiments show that CycleVLA improves performance for both well-trained and under-trained VLAs, and that MBR serves as an effective zero-shot test-time scaling strategy for VLAs. Project Page: https://dannymcy.github.io/cyclevla/