ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning

📅 2025-05-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language-action (VLA) models suffer from degraded robustness and generalization due to heterogeneous training data quality. Method: This paper proposes ReinboT, an end-to-end reinforcement learning framework that integrates a novel dense reward prediction module into the VLA architecture—enabling explicit modeling of data quality distribution and optimization of long-horizon returns. ReinboT synergistically combines offline reinforcement learning, multimodal vision-language encoding, and end-to-end action generation. Contribution/Results: Evaluated on the CALVIN benchmark with mixed-quality data, ReinboT achieves state-of-the-art performance, significantly improving few-shot adaptability and out-of-distribution generalization. Its efficacy is further validated on real-world robotic manipulation tasks, demonstrating practical applicability and enhanced reliability under data quality uncertainty.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models have shown great potential in general robotic decision-making tasks via imitation learning. However, the variable quality of training data often constrains the performance of these models. On the other hand, offline Reinforcement Learning (RL) excels at learning robust policy models from mixed-quality data. In this paper, we introduce Reinforced robot GPT (ReinboT), a novel end-to-end VLA model that integrates the RL principle of maximizing cumulative reward. ReinboT achieves a deeper understanding of the data quality distribution by predicting dense returns that capture the nuances of manipulation tasks. The dense return prediction capability enables the robot to generate more robust decision-making actions, oriented towards maximizing future benefits. Extensive experiments show that ReinboT achieves state-of-the-art performance on the CALVIN mixed-quality dataset and exhibits superior few-shot learning and out-of-distribution generalization capabilities in real-world tasks.
Problem

Research questions and friction points this paper is trying to address.

Improving robot visual-language manipulation with reinforcement learning
Enhancing decision-making robustness in mixed-quality data scenarios
Achieving superior few-shot learning and generalization in real-world tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates RL for maximizing cumulative rewards
Predicts dense returns for robust actions
Combines VLA with offline reinforcement learning
🔎 Similar Papers
No similar papers found.
H
Hongyin Zhang
Zhejiang University, Hangzhou, China
Zifeng Zhuang
Zifeng Zhuang
Westlake University
Reinforcement Learning
H
Han Zhao
Westlake University, Hangzhou, China
Pengxiang Ding
Pengxiang Ding
Zhejiang University
Human Motion PredictionLarge Language ModelEmbodied AI
H
Hongchao Lu
Westlake University, Hangzhou, China
D
Donglin Wang
Westlake University, Hangzhou, China