🤖 AI Summary
Vision-Language-Action (VLA) models face two key challenges: scarcity of human demonstration trajectories and poor cross-task generalization. To address these, this paper introduces the first end-to-end reinforcement learning framework tailored for VLA—veRL—which eliminates reliance on large-scale expert demonstrations. Our method employs trajectory sampling, multi-environment parallel rendering, efficient loss optimization, and a dedicated policy exploration mechanism to substantially enhance long-horizon action planning. Notably, veRL is the first to elicit novel out-of-distribution behaviors—such as “pushcut”—during training. Integrated with OpenVLA-OFT, veRL achieves state-of-the-art performance on LIBERO and surpasses the π₀ baseline on RoboTwin 1.0/2.0. Moreover, it demonstrates significantly stronger real-world deployment performance compared to supervised fine-tuning approaches.
📝 Abstract
Vision-Language-Action (VLA) models have recently emerged as a powerful paradigm for robotic manipulation. Despite substantial progress enabled by large-scale pretraining and supervised fine-tuning (SFT), these models face two fundamental challenges: (i) the scarcity and high cost of large-scale human-operated robotic trajectories required for SFT scaling, and (ii) limited generalization to tasks involving distribution shift. Recent breakthroughs in Large Reasoning Models (LRMs) demonstrate that reinforcement learning (RL) can dramatically enhance step-by-step reasoning capabilities, raising a natural question: Can RL similarly improve the long-horizon step-by-step action planning of VLA? In this work, we introduce SimpleVLA-RL, an efficient RL framework tailored for VLA models. Building upon veRL, we introduce VLA-specific trajectory sampling, scalable parallelization, multi-environment rendering, and optimized loss computation. When applied to OpenVLA-OFT, SimpleVLA-RL achieves SoTA performance on LIBERO and even outperforms $π_0$ on RoboTwin 1.0&2.0 with the exploration-enhancing strategies we introduce. SimpleVLA-RL not only reduces dependence on large-scale data and enables robust generalization, but also remarkably surpasses SFT in real-world tasks. Moreover, we identify a novel phenomenon ``pushcut'' during RL training, wherein the policy discovers previously unseen patterns beyond those seen in the previous training process. Github: https://github.com/PRIME-RL/SimpleVLA-RL