FPC-VLA: A Vision-Language-Action Framework with a Supervisor for Failure Prediction and Correction

📅 2025-09-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Traditional perception-planning pipelines lack flexibility in open-world robotic manipulation, while existing end-to-end vision-language-action (VLA) models fail to predict or recover from execution failures. To address this, we propose a dual-model framework: a lightweight, annotation-free learnable supervision module is integrated with a primary VLA model to trigger risk prediction and corrective strategy generation at critical frames; additionally, a similarity-guided action fusion mechanism enhances output robustness. The framework supports zero-shot transfer and fine-tuning deployment, balancing efficiency and generalization. Evaluated on SIMPLER and LIBERO simulation benchmarks and multiple real robotic arms, our method significantly outperforms state-of-the-art baselines in long-horizon, complex tasks—demonstrating superior robustness, strong cross-task generalization, and practical applicability.

Technology Category

Application Category

📝 Abstract

Robotic manipulation is a fundamental component of automation. However, traditional perception-planning pipelines often fall short in open-ended tasks due to limited flexibility, while the architecture of a single end-to-end Vision-Language-Action (VLA) offers promising capabilities but lacks crucial mechanisms for anticipating and recovering from failure. To address these challenges, we propose FPC-VLA, a dual-model framework that integrates VLA with a supervisor for failure prediction and correction. The supervisor evaluates action viability through vision-language queries and generates corrective strategies when risks arise, trained efficiently without manual labeling. A similarity-guided fusion module further refines actions by leveraging past predictions. Evaluation results on multiple simulation platforms (SIMPLER and LIBERO) and robot embodiments (WidowX, Google Robot, Franka) show that FPC-VLA outperforms state-of-the-art models in both zero-shot and fine-tuned settings. By activating the supervisor only at keyframes, our approach significantly increases task success rates with minimal impact on execution time. Successful real-world deployments on diverse, long-horizon tasks confirm FPC-VLA's strong generalization and practical utility for building more reliable autonomous systems.

Problem

Research questions and friction points this paper is trying to address.

Addresses robotic manipulation failures in open-ended tasks

Integrates supervisor for failure prediction and correction

Improves task success rates with minimal time impact

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-model framework with supervisor for failure handling

Vision-language queries evaluate action viability automatically

Similarity-guided fusion refines actions using past predictions

🔎 Similar Papers

No similar papers found.

Authors to Follow