FPC-VLA: A Vision-Language-Action Framework with a Supervisor for Failure Prediction and Correction

📅 2025-09-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional perception-planning pipelines lack flexibility in open-world robotic manipulation, while existing end-to-end vision-language-action (VLA) models fail to predict or recover from execution failures. To address this, we propose a dual-model framework: a lightweight, annotation-free learnable supervision module is integrated with a primary VLA model to trigger risk prediction and corrective strategy generation at critical frames; additionally, a similarity-guided action fusion mechanism enhances output robustness. The framework supports zero-shot transfer and fine-tuning deployment, balancing efficiency and generalization. Evaluated on SIMPLER and LIBERO simulation benchmarks and multiple real robotic arms, our method significantly outperforms state-of-the-art baselines in long-horizon, complex tasks—demonstrating superior robustness, strong cross-task generalization, and practical applicability.

Technology Category

Application Category

📝 Abstract
Robotic manipulation is a fundamental component of automation. However, traditional perception-planning pipelines often fall short in open-ended tasks due to limited flexibility, while the architecture of a single end-to-end Vision-Language-Action (VLA) offers promising capabilities but lacks crucial mechanisms for anticipating and recovering from failure. To address these challenges, we propose FPC-VLA, a dual-model framework that integrates VLA with a supervisor for failure prediction and correction. The supervisor evaluates action viability through vision-language queries and generates corrective strategies when risks arise, trained efficiently without manual labeling. A similarity-guided fusion module further refines actions by leveraging past predictions. Evaluation results on multiple simulation platforms (SIMPLER and LIBERO) and robot embodiments (WidowX, Google Robot, Franka) show that FPC-VLA outperforms state-of-the-art models in both zero-shot and fine-tuned settings. By activating the supervisor only at keyframes, our approach significantly increases task success rates with minimal impact on execution time. Successful real-world deployments on diverse, long-horizon tasks confirm FPC-VLA's strong generalization and practical utility for building more reliable autonomous systems.
Problem

Research questions and friction points this paper is trying to address.

Addresses robotic manipulation failures in open-ended tasks
Integrates supervisor for failure prediction and correction
Improves task success rates with minimal time impact
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-model framework with supervisor for failure handling
Vision-language queries evaluate action viability automatically
Similarity-guided fusion refines actions using past predictions
🔎 Similar Papers
No similar papers found.
Y
Yifan Yang
The Institute of Robotics and Automatic Information System, Tianjin Key Laboratory of Intelligent Robotics, and TBI center, Nankai University, Tianjin 300350, China
Z
Zhixiang Duan
Xiaomi EV, Beijing, China
T
Tianshi Xie
Xiaomi EV, Beijing, China
F
Fuyu Cao
Faculty of Robot Science and Engineering, Northeastern University, Shenyang 110819, China
P
Pinxi Shen
Xiaomi EV, Beijing, China
P
Peili Song
The Institute of Robotics and Automatic Information System, Tianjin Key Laboratory of Intelligent Robotics, and TBI center, Nankai University, Tianjin 300350, China
P
Piaopiao Jin
Xiaomi EV, Beijing, China
G
Guokang Sun
Xiaomi EV, Beijing, China
Shaoqing Xu
Shaoqing Xu
University of Macau, BUAA, Xiaomi EV
3D Computer Vision3D GenerationVision and Language ModelEnd2EndWorld Model
Yangwei You
Yangwei You
Xiaomi Robotics Lab
Legged LocomotionRobotics
Jingtai Liu
Jingtai Liu
The Institute of Robotics and Automatic Information System, Tianjin Key Laboratory of Intelligent Robotics, and TBI center, Nankai University, Tianjin 300350, China