SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

📅 2025-09-11

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Vision-Language-Action (VLA) models face two key challenges: scarcity of human demonstration trajectories and poor cross-task generalization. To address these, this paper introduces the first end-to-end reinforcement learning framework tailored for VLA—veRL—which eliminates reliance on large-scale expert demonstrations. Our method employs trajectory sampling, multi-environment parallel rendering, efficient loss optimization, and a dedicated policy exploration mechanism to substantially enhance long-horizon action planning. Notably, veRL is the first to elicit novel out-of-distribution behaviors—such as “pushcut”—during training. Integrated with OpenVLA-OFT, veRL achieves state-of-the-art performance on LIBERO and surpasses the π₀ baseline on RoboTwin 1.0/2.0. Moreover, it demonstrates significantly stronger real-world deployment performance compared to supervised fine-tuning approaches.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models have recently emerged as a powerful paradigm for robotic manipulation. Despite substantial progress enabled by large-scale pretraining and supervised fine-tuning (SFT), these models face two fundamental challenges: (i) the scarcity and high cost of large-scale human-operated robotic trajectories required for SFT scaling, and (ii) limited generalization to tasks involving distribution shift. Recent breakthroughs in Large Reasoning Models (LRMs) demonstrate that reinforcement learning (RL) can dramatically enhance step-by-step reasoning capabilities, raising a natural question: Can RL similarly improve the long-horizon step-by-step action planning of VLA? In this work, we introduce SimpleVLA-RL, an efficient RL framework tailored for VLA models. Building upon veRL, we introduce VLA-specific trajectory sampling, scalable parallelization, multi-environment rendering, and optimized loss computation. When applied to OpenVLA-OFT, SimpleVLA-RL achieves SoTA performance on LIBERO and even outperforms $π_0$ on RoboTwin 1.0&2.0 with the exploration-enhancing strategies we introduce. SimpleVLA-RL not only reduces dependence on large-scale data and enables robust generalization, but also remarkably surpasses SFT in real-world tasks. Moreover, we identify a novel phenomenon ``pushcut'' during RL training, wherein the policy discovers previously unseen patterns beyond those seen in the previous training process. Github: https://github.com/PRIME-RL/SimpleVLA-RL

Problem

Research questions and friction points this paper is trying to address.

Addressing scarcity of human-operated robotic trajectories for VLA training

Improving generalization of VLA models under distribution shifts

Enhancing long-horizon action planning through reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient RL framework for VLA models

VLA-specific trajectory sampling and parallelization

Exploration-enhancing strategies for improved generalization

🔎 Similar Papers

No similar papers found.