VLA-R1: Enhancing Reasoning in Vision-Language-Action Models

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Current vision-language-action (VLA) models lack explicit multi-step reasoning capabilities, neglect affordance and geometric constraints, and suffer from limited post-training improvement in reasoning quality. To address these limitations, we propose the Verifiable Reward Reinforcement Learning (RLVR) framework, which jointly optimizes reasoning processes and action execution. We introduce VLA-CoT-13K—the first high-quality, chain-of-thought–annotated VLA dataset—and incorporate Grouped Relative Policy Optimization (GRPO), region alignment, and trajectory consistency modeling to systematically enhance physical constraint alignment and multi-step reasoning. Extensive experiments across in-domain and cross-domain settings, simulation environments, and real-world robotic platforms demonstrate that our approach significantly improves reasoning robustness and manipulation accuracy, consistently outperforming state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models aim to unify perception, language understanding, and action generation, offering strong cross-task and cross-scene generalization with broad impact on embodied AI. However, current VLA models often lack explicit step-by-step reasoning, instead emitting final actions without considering affordance constraints or geometric relations. Their post-training pipelines also rarely reinforce reasoning quality, relying primarily on supervised fine-tuning with weak reward design. To address these challenges, we present VLA-R1, a reasoning-enhanced VLA that integrates Reinforcement Learning from Verifiable Rewards (RLVR) with Group Relative Policy Optimization (GRPO) to systematically optimize both reasoning and execution. Specifically, we design an RLVR-based post-training strategy with verifiable rewards for region alignment, trajectory consistency, and output formatting, thereby strengthening reasoning robustness and execution accuracy. Moreover, we develop VLA-CoT-13K, a high-quality dataset that provides chain-of-thought supervision explicitly aligned with affordance and trajectory annotations. Furthermore, extensive evaluations on in-domain, out-of-domain, simulation, and real-robot platforms demonstrate that VLA-R1 achieves superior generalization and real-world performance compared to prior VLA methods. We plan to release the model, code, and dataset following the publication of this work. Code: https://github.com/GigaAI-research/VLA-R1. Website: https://gigaai-research.github.io/VLA-R1.

Problem

Research questions and friction points this paper is trying to address.

VLA models lack explicit step-by-step reasoning processes

Post-training pipelines rarely reinforce reasoning quality effectively

Models ignore affordance constraints and geometric relations in actions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Reinforcement Learning from Verifiable Rewards

Uses Group Relative Policy Optimization for reasoning

Develops chain-of-thought dataset with affordance annotations

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling