Action Draft and Verify: A Self-Verifying Framework for Vision-Language-Action Model

📅 2026-03-18

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the challenge of achieving both high-fidelity action generation and robust out-of-distribution generalization in vision–language–action tasks. We propose the Action-Draft-and-Verify framework, which uniquely integrates the semantic priors of autoregressive models with the precise control capabilities of diffusion models. Specifically, a diffusion-based action expert generates multiple candidate action chunks in parallel, and a vision–language model performs a single forward pass to rerank and select the optimal action sequence. Evaluated under identical backbone architectures, training data, and action chunk lengths, our method improves task success rates by 4.3% in simulation and achieves a substantial 19.7% gain in the real world, demonstrating efficient and robust action decision-making.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models have recently demonstrated strong performance across embodied tasks. Modern VLAs commonly employ diffusion action experts to efficiently generate high-precision continuous action chunks, while auto-regressive generation can be slower and less accurate at low-level control. Yet auto-regressive paradigms still provide complementary priors that can improve robustness and generalization in out-of-distribution environments. To leverage both paradigms, we propose Action-Draft-and-Verify (ADV): diffusion action expert drafts multiple candidate action chunks, and the VLM selects one by scoring all candidates in a single forward pass with a perplexity-style metric. Under matched backbones, training data, and action-chunk length, ADV improves success rate by +4.3 points in simulation and +19.7 points in real-world over diffusion-based baseline, with a single-pass VLM reranking overhead.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

diffusion models

autoregressive generation

out-of-distribution generalization

action generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Action-Draft-and-Verify

Vision-Language-Action

diffusion action expert