What Can RL Bring to VLA Generalization? An Empirical Study

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-Language-Action (VLA) models suffer from error accumulation and limited generalization under distributional shift due to supervised fine-tuning (SFT). Method: This work presents the first systematic investigation into how reinforcement learning (RL) enhances VLA generalization. We introduce a multi-dimensional generalization benchmark spanning visual, semantic, and execution dimensions, and propose a lightweight, efficient PPO-based fine-tuning framework—comparing it against SFT as well as LLM-aligned algorithms including DPO and GRPO. Contribution/Results: Experiments demonstrate that PPO significantly improves semantic understanding and execution robustness while matching SFT’s visual robustness. Its advantage stems from a trial-and-error, goal-driven optimization paradigm. Our results empirically validate RL—and particularly PPO—as both effective and superior for enhancing VLA generalization in real-world embodied settings.

Technology Category

Application Category

📝 Abstract
Large Vision-Language Action (VLA) models have shown significant potential for embodied AI. However, their predominant training via supervised fine-tuning (SFT) limits generalization due to susceptibility to compounding errors under distribution shifts. Reinforcement learning (RL) offers a path to overcome these limitations by optimizing for task objectives via trial-and-error, yet a systematic understanding of its specific generalization benefits for VLAs compared to SFT is lacking. To address this, our study introduces a comprehensive benchmark for evaluating VLA generalization and systematically investigates the impact of RL fine-tuning across diverse visual, semantic, and execution dimensions. Our extensive experiments reveal that RL fine-tuning, particularly with PPO, significantly enhances generalization in semantic understanding and execution robustness over SFT, while maintaining comparable visual robustness. We identify PPO as a more effective RL algorithm for VLAs than LLM-derived methods like DPO and GRPO. We also develop a simple recipe for efficient PPO training on VLAs, and demonstrate its practical utility for improving VLA generalization. The project page is at https://rlvla.github.io
Problem

Research questions and friction points this paper is trying to address.

RL improves VLA generalization over supervised fine-tuning
Systematic study compares RL and SFT for VLAs
PPO enhances semantic and execution robustness in VLAs
Innovation

Methods, ideas, or system contributions that make the work stand out.

RL fine-tuning enhances VLA generalization
PPO outperforms DPO and GRPO
Efficient PPO training recipe developed
🔎 Similar Papers
No similar papers found.
Jijia Liu
Jijia Liu
Tsinghua University
F
Feng Gao
Tsinghua University
B
Bingwen Wei
Tsinghua University
X
Xinlei Chen
Tsinghua University
Q
Qingmin Liao
Tsinghua University
Y
Yi Wu
Tsinghua University
C
Chaoyang Yu
Tsinghua University
Y
Yu Wang
Tsinghua University