🤖 AI Summary
This work addresses the high computational overhead of existing Vision-Language-Action (VLA) models in embodied control and the limited adaptability of conventional action chunking methods, which rely on open-loop execution and are prone to error accumulation. The authors propose SV-VLA, a novel framework that introduces speculative verification into VLA-based control for the first time. SV-VLA employs a heavyweight VLA macro-planner to generate action chunks along with contextual information, while a lightweight closed-loop verifier continuously checks execution against up-to-date observations and triggers replanning only when necessary. By integrating efficient open-loop planning with robust closed-loop validation, the approach achieves both computational efficiency and reliable execution in dynamic environments. Experimental results demonstrate that SV-VLA significantly enhances system robustness while preserving the efficiency inherent to action chunking.
📝 Abstract
Vision-Language-Action (VLA) models, as large foundation models for embodied control, have shown strong performance in manipulation tasks. However, their performance comes at high inference cost. To improve efficiency, recent methods adopt action chunking, which predicts a sequence of future actions for open-loop execution. Although effective for reducing computation, open-loop execution is sensitive to environmental changes and prone to error accumulation due to the lack of close-loop feedback. To address this limitation, we propose Speculative Verification for VLA Control (SV-VLA), a framework that combines efficient open-loop long-horizon planning with lightweight closed-loop online verification. Specifically, SV-VLA uses a heavy VLA as a low-frequency macro-planner to generate an action chunk together with a planning context, while a lightweight verifier continuously monitors execution based on the latest observations. Conditioned on both the current observation and the planning context, the verifier compares the planned action against a closed-loop reference action and triggers replanning only when necessary. Experiments demonstrate that SV-VLA combines the efficiency of chunked prediction with the robustness of closed-loop control, enabling efficient and reliable VLA-based control in dynamic environments. Code is available: https://github.com/edsad122/SV-VLA.