VLAConf: Calibrated Task-Success Confidence for Vision-Language-Action Models

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of efficient and generalizable confidence estimation mechanisms in existing vision-language-action (VLA) models, which hinders risk-sensitive decision-making. The authors propose VLAConf, a discriminative one-class confidence framework that leverages internal representations from a frozen pretrained VLA model to directly predict step-level anomaly scores via a lightweight confidence head in a single forward pass. By incorporating step-conditional modeling to encode phase-specific information from action trajectories, VLAConf generates well-calibrated confidence estimates without requiring repeated sampling. The method exhibits strong cross-architecture generality, high computational efficiency, and compatibility with continuous action spaces. It significantly outperforms existing approaches on the LIBERO benchmark and demonstrates practical efficacy in real-world robotic experiments.
📝 Abstract
Confidence estimation for Vision-Language-Action (VLA) models is essential for robots to perform manipulation tasks in the open world, providing crucial signals for risk-sensitive decision-making and failure anticipation. Existing confidence estimation methods typically rely on ensemble-based paradigms or action-token probabilities to predict the likelihood of task success. However, they still encounter challenges in computational efficiency and cross-architecture generalizability. These methods usually require repeated sampling, leading to inference inefficiency, and are restricted to VLA models with discrete action outputs, making them difficult to apply to continuous action spaces. To address this issue, we propose VLAConf, a one-class discriminative confidence framework. By leveraging frozen pretrained VLA internal representations, VLAConf directly estimates step-wise anomaly scores in a single forward pass using a lightweight confidence head, thereby eliminating the overhead of exhaustive resampling. We additionally use step-conditioned modeling to encode rollout-phase information along the manipulation trajectory. Experiments on the LIBERO benchmark demonstrate that VLAConf significantly improves the quality of the confidence signal constructed for post-hoc calibration, outperforming existing baselines by a large margin in inference efficiency. The effectiveness of VLAConf is further validated in real-robot experiments. To access the source code and supplementary videos, visit https://sites.google.com/view/vlaconf.
Problem

Research questions and friction points this paper is trying to address.

confidence estimation
Vision-Language-Action models
computational efficiency
continuous action spaces
cross-architecture generalizability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action models
confidence estimation
one-class discrimination
continuous action space
inference efficiency
Dehao Huang
Dehao Huang
Southern University of Science and Technology
Robot Grasping and Manipulation
A
Aoxiang Gu
Department of Electronic and Electrical Engineering, Southern University of Science and Technology, Shenzhen, China
Chengjie Zhang
Chengjie Zhang
Master, Soutern University of Science and Technology
Robotic ManipulationHuman-Robot InteractionSignal Processing
B
Bolin Zou
Department of Electronic and Electrical Engineering, Southern University of Science and Technology, Shenzhen, China
Wenlong Dong
Wenlong Dong
Southern University of Science and Technology
Robotics、Perception
Z
Zilang Cen
Zhongguancun Academy, Beijing, China; National Cybersecurity Academy, Wuhan University, Wuhan, China
Y
Yue Wang
Zhongguancun Academy, Beijing, China
Hong Zhang
Hong Zhang
Chair Professor, SUSTech; Professor Emeritus, University of Alberta
roboticscomputer visionimage processing