🤖 AI Summary
Vision-Language-Action (VLA) models suffer from error accumulation and limited generalization under distributional shift due to supervised fine-tuning (SFT).
Method: This work presents the first systematic investigation into how reinforcement learning (RL) enhances VLA generalization. We introduce a multi-dimensional generalization benchmark spanning visual, semantic, and execution dimensions, and propose a lightweight, efficient PPO-based fine-tuning framework—comparing it against SFT as well as LLM-aligned algorithms including DPO and GRPO.
Contribution/Results: Experiments demonstrate that PPO significantly improves semantic understanding and execution robustness while matching SFT’s visual robustness. Its advantage stems from a trial-and-error, goal-driven optimization paradigm. Our results empirically validate RL—and particularly PPO—as both effective and superior for enhancing VLA generalization in real-world embodied settings.
📝 Abstract
Large Vision-Language Action (VLA) models have shown significant potential for embodied AI. However, their predominant training via supervised fine-tuning (SFT) limits generalization due to susceptibility to compounding errors under distribution shifts. Reinforcement learning (RL) offers a path to overcome these limitations by optimizing for task objectives via trial-and-error, yet a systematic understanding of its specific generalization benefits for VLAs compared to SFT is lacking. To address this, our study introduces a comprehensive benchmark for evaluating VLA generalization and systematically investigates the impact of RL fine-tuning across diverse visual, semantic, and execution dimensions. Our extensive experiments reveal that RL fine-tuning, particularly with PPO, significantly enhances generalization in semantic understanding and execution robustness over SFT, while maintaining comparable visual robustness. We identify PPO as a more effective RL algorithm for VLAs than LLM-derived methods like DPO and GRPO. We also develop a simple recipe for efficient PPO training on VLAs, and demonstrate its practical utility for improving VLA generalization. The project page is at https://rlvla.github.io