Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment

📅 2026-02-12

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work addresses the “intention-action gap” in vision-language-action (VLA) models, which often generate actions misaligned with high-level instructions. To mitigate this, the authors propose a test-time verification framework that enhances semantic alignment by having a vision-language model rephrase the instruction, jointly scales action generation to produce diverse candidates, and employs a novel contrastive verifier—CoVer—to select the optimal action. Key innovations include scaling laws for test-time verification, a “compute-on-launch” strategy, and a hierarchical verification reasoning pipeline. Evaluated on the SIMPLER benchmark, the approach improves in-distribution and out-of-distribution performance by 22% and 13%, respectively, and achieves a 45% gain in real-robot experiments. On PolaRiS, it boosts task progression and success rates by 14% and 9%, substantially outperforming methods that rely solely on scaling pretraining data.

Technology Category

Application Category

📝 Abstract

The long-standing vision of general-purpose robots hinges on their ability to understand and act upon natural language instructions. Vision-Language-Action (VLA) models have made remarkable progress toward this goal, yet their generated actions can still misalign with the given instructions. In this paper, we investigate test-time verification as a means to shrink the"intention-action gap.''We first characterize the test-time scaling law for embodied instruction following and demonstrate that jointly scaling the number of rephrased instructions and generated actions greatly increases test-time sample diversity, often recovering correct actions more efficiently than scaling each dimension independently. To capitalize on these scaling laws, we present CoVer, a contrastive verifier for vision-language-action alignment, and show that our architecture scales gracefully with additional computational resources and data. We then introduce"boot-time compute"and a hierarchical verification inference pipeline for VLAs. At deployment, our framework precomputes a diverse set of rephrased instructions from a Vision-Language-Model (VLM), repeatedly generates action candidates for each instruction, and then uses a verifier to select the optimal high-level prompt and low-level action chunks. Compared to scaling policy pre-training on the same data, our verification approach yields 22% gains in-distribution and 13% out-of-distribution on the SIMPLER benchmark, with a further 45% improvement in real-world experiments. On the PolaRiS benchmark, CoVer achieves 14% gains in task progress and 9% in success rate.

Problem

Research questions and friction points this paper is trying to address.

vision-language-action alignment

intention-action gap

embodied instruction following

test-time verification

natural language instructions

Innovation

Methods, ideas, or system contributions that make the work stand out.

test-time verification

vision-language-action alignment

scaling law