RoboGPT-R1: Enhancing Robot Planning with Reinforcement Learning

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited long-horizon reasoning capability of embodied agents in complex real-world environments, this paper proposes a two-stage fine-tuning framework. First, a vision-language model (Qwen2.5-VL-3B) undergoes supervised fine-tuning on expert demonstration trajectories. Second, a physics-aware rule-based reward function is introduced to guide reinforcement learning, enhancing multi-step planning consistency and grounding in physical commonsense. The approach jointly optimizes action feasibility and long-term task success, significantly improving visual-spatial reasoning. On the EmbodiedBench benchmark, our method achieves a 21.33% absolute gain over GPT-4o-mini and outperforms the same-scale Qwen2.5-VL-7B baseline by 20.33%. These results validate the effectiveness of synergistically combining supervised learning with physics-guided reinforcement learning for embodied task planning.

Technology Category

Application Category

📝 Abstract
Improving the reasoning capabilities of embodied agents is crucial for robots to complete complex human instructions in long-view manipulation tasks successfully. Despite the success of large language models and vision language models based on Supervised Fine-Tuning (SFT) in planning tasks, they continue facing challenges in performing long-horizon manipulation tasks in complex real-world environments, owing to their restricted common sense and reasoning capabilities. Considering that aligning general-purpose vision language models to robotic planning tasks via supervised fine-tuning suffers from poor generalization and insufficient physical understanding, we propose RoboGPT-R1, a two-stage fine-tuning framework for embodied planning. In this framework, supervised training acquires foundational knowledge through expert sequences, followed by RL to address the model's shortcomings in visual-spatial understanding and reasoning. To achieve physical understanding and action sequence consistency in multi-step reasoning tasks, we design a rule-based reward function that simultaneously considers long-horizon performance and action constraint in the environment. The reasoning model, trained on Qwen2.5-VL-3B, significantly outperforms the larger-scale model, GPT-4o-mini, by 21.33% and surpasses other work trained on Qwen2.5-VL-7B by 20.33% on the EmbodiedBench benchmark.
Problem

Research questions and friction points this paper is trying to address.

Enhancing robot planning for complex human instructions
Addressing poor generalization in vision language models
Improving visual-spatial understanding through reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage fine-tuning with supervised and reinforcement learning
Rule-based reward function for physical understanding
Training on Qwen2.5-VL-3B outperforms larger models
🔎 Similar Papers
No similar papers found.
J
Jinrui Liu
Institute of Automation, CASIA School of Artificial Intelligence, UCAS China
B
Bingyan Nie
Institute of Automation, CASIA School of Artificial Intelligence, UCAS China
B
Boyu Li
Institute of Automation, CASIA School of Artificial Intelligence, UCAS China
Y
Yaran Chen
Institute of Automation, CASIA School of Artificial Intelligence, UCAS China
Yuze Wang
Yuze Wang
Beihang University
3D VisionComputer GraphicNeural Renderingin-the-wild Reconstruction
S
Shunsen He
Huawei Cloud Technology Co., Ltd China
H
Haoran Li
Institute of Automation, CASIA School of Artificial Intelligence, UCAS China