ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning

📅 2025-05-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Vision-language-action (VLA) models suffer from degraded robustness and generalization due to heterogeneous training data quality. Method: This paper proposes ReinboT, an end-to-end reinforcement learning framework that integrates a novel dense reward prediction module into the VLA architecture—enabling explicit modeling of data quality distribution and optimization of long-horizon returns. ReinboT synergistically combines offline reinforcement learning, multimodal vision-language encoding, and end-to-end action generation. Contribution/Results: Evaluated on the CALVIN benchmark with mixed-quality data, ReinboT achieves state-of-the-art performance, significantly improving few-shot adaptability and out-of-distribution generalization. Its efficacy is further validated on real-world robotic manipulation tasks, demonstrating practical applicability and enhanced reliability under data quality uncertainty.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models have shown great potential in general robotic decision-making tasks via imitation learning. However, the variable quality of training data often constrains the performance of these models. On the other hand, offline Reinforcement Learning (RL) excels at learning robust policy models from mixed-quality data. In this paper, we introduce Reinforced robot GPT (ReinboT), a novel end-to-end VLA model that integrates the RL principle of maximizing cumulative reward. ReinboT achieves a deeper understanding of the data quality distribution by predicting dense returns that capture the nuances of manipulation tasks. The dense return prediction capability enables the robot to generate more robust decision-making actions, oriented towards maximizing future benefits. Extensive experiments show that ReinboT achieves state-of-the-art performance on the CALVIN mixed-quality dataset and exhibits superior few-shot learning and out-of-distribution generalization capabilities in real-world tasks.

Problem

Research questions and friction points this paper is trying to address.

Improving robot visual-language manipulation with reinforcement learning

Enhancing decision-making robustness in mixed-quality data scenarios

Achieving superior few-shot learning and generalization in real-world tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates RL for maximizing cumulative rewards

Predicts dense returns for robust actions

Combines VLA with offline reinforcement learning

🔎 Similar Papers

No similar papers found.

Authors to Follow