🤖 AI Summary
This work addresses the critical limitation of existing vision–language–action (VLA) models, which lack reliable uncertainty quantification and thus struggle to safely handle potential failures in real-world environments. The authors introduce the first application of conformal prediction to the action outputs and robot state space of VLA models, enabling action-level calibrated confidence estimation and proactive anomaly detection without modifying the original model architecture or requiring retraining. Evaluated on both simulated and real-world robotic manipulation tasks, the proposed method significantly enhances failure anticipation, effectively reduces catastrophic errors, and yields well-calibrated uncertainty metrics. These improvements collectively strengthen the overall reliability of VLA systems operating in safety-critical settings.
📝 Abstract
Vision-language-action (VLA) models have emerged as generalist robotic controllers capable of mapping visual observations and natural language instructions to continuous action sequences. However, VLAs provide no calibrated measure of confidence in their action predictions, thus limiting their reliability in real-world settings where uncertainty and failures must be anticipated. To address this problem we introduce ReconVLA, a reliable conformal model that produces uncertainty-guided and failure-aware control signals. Concretely, our approach applies conformal prediction directly to the action token outputs of pretrained VLA policies, yielding calibrated uncertainty estimates that correlate with execution quality and task success. Furthermore, we extend conformal prediction to the robot state space to detect outliers or unsafe states before failures occur, providing a simple yet effective failure detection mechanism that complements the action-level uncertainty. We evaluate ReconVLA in both simulation and real robot experiments across diverse manipulation tasks. Our results show that conformalized action predictions consistently improve failure anticipation, reduce catastrophic errors, and provide a calibrated measure of confidence without retraining or modifying the underlying VLA.