SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

📅 2025-09-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-Language-Action (VLA) models face two key challenges: scarcity of human demonstration trajectories and poor cross-task generalization. To address these, this paper introduces the first end-to-end reinforcement learning framework tailored for VLA—veRL—which eliminates reliance on large-scale expert demonstrations. Our method employs trajectory sampling, multi-environment parallel rendering, efficient loss optimization, and a dedicated policy exploration mechanism to substantially enhance long-horizon action planning. Notably, veRL is the first to elicit novel out-of-distribution behaviors—such as “pushcut”—during training. Integrated with OpenVLA-OFT, veRL achieves state-of-the-art performance on LIBERO and surpasses the π₀ baseline on RoboTwin 1.0/2.0. Moreover, it demonstrates significantly stronger real-world deployment performance compared to supervised fine-tuning approaches.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models have recently emerged as a powerful paradigm for robotic manipulation. Despite substantial progress enabled by large-scale pretraining and supervised fine-tuning (SFT), these models face two fundamental challenges: (i) the scarcity and high cost of large-scale human-operated robotic trajectories required for SFT scaling, and (ii) limited generalization to tasks involving distribution shift. Recent breakthroughs in Large Reasoning Models (LRMs) demonstrate that reinforcement learning (RL) can dramatically enhance step-by-step reasoning capabilities, raising a natural question: Can RL similarly improve the long-horizon step-by-step action planning of VLA? In this work, we introduce SimpleVLA-RL, an efficient RL framework tailored for VLA models. Building upon veRL, we introduce VLA-specific trajectory sampling, scalable parallelization, multi-environment rendering, and optimized loss computation. When applied to OpenVLA-OFT, SimpleVLA-RL achieves SoTA performance on LIBERO and even outperforms $π_0$ on RoboTwin 1.0&2.0 with the exploration-enhancing strategies we introduce. SimpleVLA-RL not only reduces dependence on large-scale data and enables robust generalization, but also remarkably surpasses SFT in real-world tasks. Moreover, we identify a novel phenomenon ``pushcut'' during RL training, wherein the policy discovers previously unseen patterns beyond those seen in the previous training process. Github: https://github.com/PRIME-RL/SimpleVLA-RL
Problem

Research questions and friction points this paper is trying to address.

Addressing scarcity of human-operated robotic trajectories for VLA training
Improving generalization of VLA models under distribution shifts
Enhancing long-horizon action planning through reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient RL framework for VLA models
VLA-specific trajectory sampling and parallelization
Exploration-enhancing strategies for improved generalization
🔎 Similar Papers
No similar papers found.
Haozhan Li
Haozhan Li
Tsinghua University
LLM RLVLA RL
Y
Yuxin Zuo
Tsinghua University
Jiale Yu
Jiale Yu
中国科学技术大学
Y
Yuhao Zhang
Shanghai Jiao Tong University
Z
Zhaohui Yang
Shanghai Jiao Tong University
Kaiyan Zhang
Kaiyan Zhang
Tsinghua University
Foundation ModelCollective IntelligenceScientific Intelligence
Xuekai Zhu
Xuekai Zhu
Shanghai Jiao Tong University
Synthetic DataReasoningLanguage Model
Y
Yuchen Zhang
Peking University
T
Tianxing Chen
The University of Hong Kong
Ganqu Cui
Ganqu Cui
Shanghai AI Lab
LLM AlignmentReinforcement Learning
D
Dehui Wang
Shanghai Jiao Tong University
D
Dingxiang Luo
Shanghai Jiao Tong University
Yuchen Fan
Yuchen Fan
Shanghai AI Laboratory & Shanghai Jiao Tong University
NLPLarge Language ModelsEvaluation
Youbang Sun
Youbang Sun
Assistant Researcher, Tsinghua University; Northeastern University; Texas A&M University
Distributed OptimizationMulti-Agent RLRiemannian OptimizationFederated Learning
J
Jia Zeng
Shanghai AI Lab
J
Jiangmiao Pang
Shanghai AI Lab
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models
Y
Yu Wang
Tsinghua University
Y
Yao Mu
Shanghai AI Lab, Shanghai Jiao Tong University
B
Bowen Zhou
Tsinghua University, Shanghai AI Lab
N
Ning Ding
Tsinghua University, Shanghai AI Lab