RoboGPT-R1: Enhancing Robot Planning with Reinforcement Learning

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

To address the limited long-horizon reasoning capability of embodied agents in complex real-world environments, this paper proposes a two-stage fine-tuning framework. First, a vision-language model (Qwen2.5-VL-3B) undergoes supervised fine-tuning on expert demonstration trajectories. Second, a physics-aware rule-based reward function is introduced to guide reinforcement learning, enhancing multi-step planning consistency and grounding in physical commonsense. The approach jointly optimizes action feasibility and long-term task success, significantly improving visual-spatial reasoning. On the EmbodiedBench benchmark, our method achieves a 21.33% absolute gain over GPT-4o-mini and outperforms the same-scale Qwen2.5-VL-7B baseline by 20.33%. These results validate the effectiveness of synergistically combining supervised learning with physics-guided reinforcement learning for embodied task planning.

Technology Category

Application Category

📝 Abstract

Improving the reasoning capabilities of embodied agents is crucial for robots to complete complex human instructions in long-view manipulation tasks successfully. Despite the success of large language models and vision language models based on Supervised Fine-Tuning (SFT) in planning tasks, they continue facing challenges in performing long-horizon manipulation tasks in complex real-world environments, owing to their restricted common sense and reasoning capabilities. Considering that aligning general-purpose vision language models to robotic planning tasks via supervised fine-tuning suffers from poor generalization and insufficient physical understanding, we propose RoboGPT-R1, a two-stage fine-tuning framework for embodied planning. In this framework, supervised training acquires foundational knowledge through expert sequences, followed by RL to address the model's shortcomings in visual-spatial understanding and reasoning. To achieve physical understanding and action sequence consistency in multi-step reasoning tasks, we design a rule-based reward function that simultaneously considers long-horizon performance and action constraint in the environment. The reasoning model, trained on Qwen2.5-VL-3B, significantly outperforms the larger-scale model, GPT-4o-mini, by 21.33% and surpasses other work trained on Qwen2.5-VL-7B by 20.33% on the EmbodiedBench benchmark.

Problem

Research questions and friction points this paper is trying to address.

Enhancing robot planning for complex human instructions

Addressing poor generalization in vision language models

Improving visual-spatial understanding through reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage fine-tuning with supervised and reinforcement learning

Rule-based reward function for physical understanding

Training on Qwen2.5-VL-3B outperforms larger models

🔎 Similar Papers

Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V