🤖 AI Summary
To address the limited long-horizon reasoning capability of embodied agents in complex real-world environments, this paper proposes a two-stage fine-tuning framework. First, a vision-language model (Qwen2.5-VL-3B) undergoes supervised fine-tuning on expert demonstration trajectories. Second, a physics-aware rule-based reward function is introduced to guide reinforcement learning, enhancing multi-step planning consistency and grounding in physical commonsense. The approach jointly optimizes action feasibility and long-term task success, significantly improving visual-spatial reasoning. On the EmbodiedBench benchmark, our method achieves a 21.33% absolute gain over GPT-4o-mini and outperforms the same-scale Qwen2.5-VL-7B baseline by 20.33%. These results validate the effectiveness of synergistically combining supervised learning with physics-guided reinforcement learning for embodied task planning.
📝 Abstract
Improving the reasoning capabilities of embodied agents is crucial for robots to complete complex human instructions in long-view manipulation tasks successfully. Despite the success of large language models and vision language models based on Supervised Fine-Tuning (SFT) in planning tasks, they continue facing challenges in performing long-horizon manipulation tasks in complex real-world environments, owing to their restricted common sense and reasoning capabilities. Considering that aligning general-purpose vision language models to robotic planning tasks via supervised fine-tuning suffers from poor generalization and insufficient physical understanding, we propose RoboGPT-R1, a two-stage fine-tuning framework for embodied planning. In this framework, supervised training acquires foundational knowledge through expert sequences, followed by RL to address the model's shortcomings in visual-spatial understanding and reasoning. To achieve physical understanding and action sequence consistency in multi-step reasoning tasks, we design a rule-based reward function that simultaneously considers long-horizon performance and action constraint in the environment. The reasoning model, trained on Qwen2.5-VL-3B, significantly outperforms the larger-scale model, GPT-4o-mini, by 21.33% and surpasses other work trained on Qwen2.5-VL-7B by 20.33% on the EmbodiedBench benchmark.