VLA-R1: Enhancing Reasoning in Vision-Language-Action Models

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language-action (VLA) models lack explicit multi-step reasoning capabilities, neglect affordance and geometric constraints, and suffer from limited post-training improvement in reasoning quality. To address these limitations, we propose the Verifiable Reward Reinforcement Learning (RLVR) framework, which jointly optimizes reasoning processes and action execution. We introduce VLA-CoT-13K—the first high-quality, chain-of-thought–annotated VLA dataset—and incorporate Grouped Relative Policy Optimization (GRPO), region alignment, and trajectory consistency modeling to systematically enhance physical constraint alignment and multi-step reasoning. Extensive experiments across in-domain and cross-domain settings, simulation environments, and real-world robotic platforms demonstrate that our approach significantly improves reasoning robustness and manipulation accuracy, consistently outperforming state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models aim to unify perception, language understanding, and action generation, offering strong cross-task and cross-scene generalization with broad impact on embodied AI. However, current VLA models often lack explicit step-by-step reasoning, instead emitting final actions without considering affordance constraints or geometric relations. Their post-training pipelines also rarely reinforce reasoning quality, relying primarily on supervised fine-tuning with weak reward design. To address these challenges, we present VLA-R1, a reasoning-enhanced VLA that integrates Reinforcement Learning from Verifiable Rewards (RLVR) with Group Relative Policy Optimization (GRPO) to systematically optimize both reasoning and execution. Specifically, we design an RLVR-based post-training strategy with verifiable rewards for region alignment, trajectory consistency, and output formatting, thereby strengthening reasoning robustness and execution accuracy. Moreover, we develop VLA-CoT-13K, a high-quality dataset that provides chain-of-thought supervision explicitly aligned with affordance and trajectory annotations. Furthermore, extensive evaluations on in-domain, out-of-domain, simulation, and real-robot platforms demonstrate that VLA-R1 achieves superior generalization and real-world performance compared to prior VLA methods. We plan to release the model, code, and dataset following the publication of this work. Code: https://github.com/GigaAI-research/VLA-R1. Website: https://gigaai-research.github.io/VLA-R1.
Problem

Research questions and friction points this paper is trying to address.

VLA models lack explicit step-by-step reasoning processes
Post-training pipelines rarely reinforce reasoning quality effectively
Models ignore affordance constraints and geometric relations in actions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Reinforcement Learning from Verifiable Rewards
Uses Group Relative Policy Optimization for reasoning
Develops chain-of-thought dataset with affordance annotations
🔎 Similar Papers
No similar papers found.
A
Angen Ye
GigaAI, CASIA
Z
Zeyu Zhang
GigaAI
Boyuan Wang
Boyuan Wang
Institute of Automation, Chinese Academy of Sciences
Computer VisionAIGCWorld ModelEmbodied AI
X
Xiaofeng Wang
GigaAI, Tsinghua University
D
Dapeng Zhang
CASIA
Z
Zheng Zhu
GigaAI