🤖 AI Summary
This work addresses the challenge that existing vision-language-action (VLA) models often fail to reliably capture transient high-risk signals in continuous control tasks, as such signals are typically diluted by averaged uncertainty estimates. To overcome this limitation, the authors propose a unified uncertainty quantification framework that preserves transient risk through sliding-window max-pooling. The approach further incorporates motion-aware stability weighting and degree-of-freedom adaptive calibration to emphasize critical failure moments. Evaluated on the LIBERO benchmark, the method significantly improves failure prediction accuracy, yielding more reliable early-warning signals that effectively support human-robot collaborative intervention.
📝 Abstract
Vision-Language-Action (VLA) models enable general-purpose robotic policies by mapping visual observations and language instructions to low-level actions, but they often lack reliable introspection. A common practice is to compute a token-level uncertainty signal and take its mean over a rollout. However, mean aggregation can dilute short-lived but safety-critical uncertainty spikes in continuous control. In particular, successful rollouts may contain localized high-entropy segments due to benign noise or non-critical micro-adjustments, while failure rollouts can appear low-entropy for most timesteps and only exhibit brief spikes near the onset of failure. We propose a unified uncertainty quantification approach for predicting rollout success versus failure that (1) uses max-based sliding window pooling to preserve transient risk signals, (2) applies motion-aware stability weighting to emphasize high-frequency action oscillations associated with unstable behaviors, and (3) performs DoF-adaptive calibration via Bayesian Optimization to prioritize kinematically critical axes. Experiments on the LIBERO benchmark show that our method substantially improves failure prediction accuracy and yields more reliable signals for failure detection, which can support downstream human-in-the-loop interventions.