🤖 AI Summary
Vision-language-action (VLA) models suffer from degraded robustness and generalization due to heterogeneous training data quality. Method: This paper proposes ReinboT, an end-to-end reinforcement learning framework that integrates a novel dense reward prediction module into the VLA architecture—enabling explicit modeling of data quality distribution and optimization of long-horizon returns. ReinboT synergistically combines offline reinforcement learning, multimodal vision-language encoding, and end-to-end action generation. Contribution/Results: Evaluated on the CALVIN benchmark with mixed-quality data, ReinboT achieves state-of-the-art performance, significantly improving few-shot adaptability and out-of-distribution generalization. Its efficacy is further validated on real-world robotic manipulation tasks, demonstrating practical applicability and enhanced reliability under data quality uncertainty.
📝 Abstract
Vision-Language-Action (VLA) models have shown great potential in general robotic decision-making tasks via imitation learning. However, the variable quality of training data often constrains the performance of these models. On the other hand, offline Reinforcement Learning (RL) excels at learning robust policy models from mixed-quality data. In this paper, we introduce Reinforced robot GPT (ReinboT), a novel end-to-end VLA model that integrates the RL principle of maximizing cumulative reward. ReinboT achieves a deeper understanding of the data quality distribution by predicting dense returns that capture the nuances of manipulation tasks. The dense return prediction capability enables the robot to generate more robust decision-making actions, oriented towards maximizing future benefits. Extensive experiments show that ReinboT achieves state-of-the-art performance on the CALVIN mixed-quality dataset and exhibits superior few-shot learning and out-of-distribution generalization capabilities in real-world tasks.