Action Draft and Verify: A Self-Verifying Framework for Vision-Language-Action Model

📅 2026-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of achieving both high-fidelity action generation and robust out-of-distribution generalization in vision–language–action tasks. We propose the Action-Draft-and-Verify framework, which uniquely integrates the semantic priors of autoregressive models with the precise control capabilities of diffusion models. Specifically, a diffusion-based action expert generates multiple candidate action chunks in parallel, and a vision–language model performs a single forward pass to rerank and select the optimal action sequence. Evaluated under identical backbone architectures, training data, and action chunk lengths, our method improves task success rates by 4.3% in simulation and achieves a substantial 19.7% gain in the real world, demonstrating efficient and robust action decision-making.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models have recently demonstrated strong performance across embodied tasks. Modern VLAs commonly employ diffusion action experts to efficiently generate high-precision continuous action chunks, while auto-regressive generation can be slower and less accurate at low-level control. Yet auto-regressive paradigms still provide complementary priors that can improve robustness and generalization in out-of-distribution environments. To leverage both paradigms, we propose Action-Draft-and-Verify (ADV): diffusion action expert drafts multiple candidate action chunks, and the VLM selects one by scoring all candidates in a single forward pass with a perplexity-style metric. Under matched backbones, training data, and action-chunk length, ADV improves success rate by +4.3 points in simulation and +19.7 points in real-world over diffusion-based baseline, with a single-pass VLM reranking overhead.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
diffusion models
autoregressive generation
out-of-distribution generalization
action generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Action-Draft-and-Verify
Vision-Language-Action
diffusion action expert
self-verifying framework
perplexity-style scoring
🔎 Similar Papers
No similar papers found.
C
Chen Zhao
School of Information, Renmin University of China, Key Laboratory of Data Engineering and Knowledge Engineering, Beijing, China
Zhuoran Wang
Zhuoran Wang
professor of University of Electronic Science and Technology of China
photonicselectronics
H
Haoyang Li
School of Information, Renmin University of China, Key Laboratory of Data Engineering and Knowledge Engineering, Beijing, China
S
Shifeng Bao
Beijing University of Posts and Telecommunications
G
Guanlin Li
School of Information, Renmin University of China, Key Laboratory of Data Engineering and Knowledge Engineering, Beijing, China
Y
Youhe Feng
School of Information, Renmin University of China
Yang Li
Yang Li
Renmin Unversity of China
Jie Tang
Jie Tang
UW Madison
Computed Tomography
Jing Zhang
Jing Zhang
Renmin University of China
large model alignmentmodel compression & inference optimizationdata intelligence