Dynamic Execution Commitment of Vision-Language-Action Models

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing vision-language-action (VLA) models employ fixed action execution horizons, rendering them brittle in dynamic or out-of-distribution scenarios due to their neglect of state-dependent prediction reliability. This work proposes the A3 mechanism, which formulates execution commitment as a self-speculative prefix validation problem. By leveraging ensemble sampling to compute consensus scores over action trajectories, A3 dynamically determines a reliable action prefix length through two key properties: consensus-ranking conditional invariance and prefix-closure sequence consistency. The approach adaptively selects execution lengths without manual hyperparameter tuning, achieving a superior trade-off between task success rate and inference throughput across diverse VLA architectures and benchmarks.

📝 Abstract

Vision-Language-Action (VLA) models predominantly adopt action chunking, i.e., predicting and committing to a short horizon of consecutive low-level actions in a single forward pass, to amortize the inference cost of large-scale backbones and reduce per-step latency. However, committing these multi-step predictions to real-world execution requires balancing success rate against inference efficiency, a decision typically governed by fixed execution horizons tuned per task. Such heuristics ignore the state-dependent nature of predictive reliability, leading to brittle performance in dynamic or out-of-distribution settings. In this paper, we introduce A3, an Adaptive Action Acceptance mechanism that reframes dynamic execution commitment as a self-speculative prefix verification problem. A3 first computes a trajectory-wise consensus score of actions via group sampling, then selects a representative draft and prioritizes downstream verification. Specifically, it enforces: (1) consensus-ordered conditional invariance, which validates low-consensus actions by judging whether they remain consistent when re-decoded conditioned on high-consensus actions; and (2) prefix-closed sequential consistency, which guarantees physical rollout integrity by accepting only the longest continuous sequence of verified actions starting from the beginning. Consequently, the execution horizon emerges as the longest verifiable prefix satisfying both internal model logic and sequential execution constraints. Experiments across diverse VLA models and benchmarks demonstrate that A3 eliminates the need for manual horizon tuning while achieving a superior trade-off between execution robustness and inference throughput.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models

action chunking

execution commitment

predictive reliability

dynamic environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Action Acceptance

Vision-Language-Action Models

Self-Speculative Verification