🤖 AI Summary
This work addresses the challenge of execution failures and reasoning errors in vision–language–action systems and world models caused by low-quality actions during deployment. To mitigate this issue, the authors propose a unified runtime verification framework that evaluates the prior validity of candidate action chunks before execution or world model rollout. The architecture features a novel dual-mode predictive resampling mechanism, combined with multitask training objectives to alleviate class imbalance and boundary instability. It employs a multimodal backbone, modality-aware pooling, and a lightweight dual-branch head that integrates focal classification, advantage regression, and soft-threshold calibration. Evaluated on the LIBERO benchmark, the approach improves closed-loop success rate from 30.79% to 37.62%, validates individual actions in just 183.9 milliseconds, and significantly reduces execution steps while suppressing error accumulation in world model predictions.
📝 Abstract
While large vision-language-action (VLA) models and generative world models (WM) have advanced long-horizon embodied intelligence, their practical deployment remains challenged by uncertainty in learning-based action generation. Low-quality actions may cause physical failures during execution or lead to misleading world-model rollouts with redundant rendering costs. To address this issue, we propose Pre-VLA, a unified runtime verification architecture that performs preemptive action validity assessment before physical execution or world-model imagination. Pre-VLA leverages an efficient multimodal backbone with modality-aware pooling and a lightweight dual-branch head to predict both safety confidence and critic-derived advantage scores for candidate action chunks. To handle severe class imbalance and unstable boundary decisions, we train Pre-VLA with a multi-task objective combining Focal classification, advantage regression, and soft-threshold calibration. During deployment, a dual-mode preemptive resampling scheduler filters low-quality actions and triggers adaptive resampling under a limited computation budget. Experiments on the LIBERO benchmark show that Pre-VLA improves the average closed-loop success rate across four suites from 30.79\% to 37.62\% over RynnVLA-002, reduces task execution steps, achieves 183.9 ms average forward verification time per action chunk, and mitigates error accumulation in world-model rollouts.