VLAConf: Calibrated Task-Success Confidence for Vision-Language-Action Models

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the lack of efficient and generalizable confidence estimation mechanisms in existing vision-language-action (VLA) models, which hinders risk-sensitive decision-making. The authors propose VLAConf, a discriminative one-class confidence framework that leverages internal representations from a frozen pretrained VLA model to directly predict step-level anomaly scores via a lightweight confidence head in a single forward pass. By incorporating step-conditional modeling to encode phase-specific information from action trajectories, VLAConf generates well-calibrated confidence estimates without requiring repeated sampling. The method exhibits strong cross-architecture generality, high computational efficiency, and compatibility with continuous action spaces. It significantly outperforms existing approaches on the LIBERO benchmark and demonstrates practical efficacy in real-world robotic experiments.

📝 Abstract

Confidence estimation for Vision-Language-Action (VLA) models is essential for robots to perform manipulation tasks in the open world, providing crucial signals for risk-sensitive decision-making and failure anticipation. Existing confidence estimation methods typically rely on ensemble-based paradigms or action-token probabilities to predict the likelihood of task success. However, they still encounter challenges in computational efficiency and cross-architecture generalizability. These methods usually require repeated sampling, leading to inference inefficiency, and are restricted to VLA models with discrete action outputs, making them difficult to apply to continuous action spaces. To address this issue, we propose VLAConf, a one-class discriminative confidence framework. By leveraging frozen pretrained VLA internal representations, VLAConf directly estimates step-wise anomaly scores in a single forward pass using a lightweight confidence head, thereby eliminating the overhead of exhaustive resampling. We additionally use step-conditioned modeling to encode rollout-phase information along the manipulation trajectory. Experiments on the LIBERO benchmark demonstrate that VLAConf significantly improves the quality of the confidence signal constructed for post-hoc calibration, outperforming existing baselines by a large margin in inference efficiency. The effectiveness of VLAConf is further validated in real-robot experiments. To access the source code and supplementary videos, visit https://sites.google.com/view/vlaconf.

Problem

Research questions and friction points this paper is trying to address.

confidence estimation

Vision-Language-Action models

computational efficiency

continuous action spaces

cross-architecture generalizability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action models

confidence estimation

one-class discrimination