🤖 AI Summary
This work addresses the challenge of enabling embodied robots to accurately comprehend spatial references and perform dynamic interactions in 3D physical environments. Methodologically, it introduces the first 3D perception-aware vision-language model supporting multi-step spatial reasoning. Specifically: (1) a decoupled depth encoder explicitly models geometric and semantic features; (2) a large-scale benchmark dataset, RefSpatial, covering 31 spatial relations and enabling up to five-step reasoning, along with its evaluation suite RefSpatial-Bench, is constructed; (3) a metric-sensitive spatial process reward mechanism is proposed, integrated with supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). Experiments demonstrate that the RFT variant outperforms Gemini-2.5-Pro by 17.4% on RefSpatial-Bench; the SFT variant achieves an average spatial understanding success rate of 89.6%; and the model successfully drives real-world robotic platforms—including UR5 and G1—to execute long-horizon, dynamic manipulation tasks in cluttered physical scenes.
📝 Abstract
Spatial referring is a fundamental capability of embodied robots to interact with the 3D physical world. However, even with the powerful pretrained vision language models (VLMs), recent approaches are still not qualified to accurately understand the complex 3D scenes and dynamically reason about the instruction-indicated locations for interaction. To this end, we propose RoboRefer, a 3D-aware VLM that can first achieve precise spatial understanding by integrating a disentangled but dedicated depth encoder via supervised fine-tuning (SFT). Moreover, RoboRefer advances generalized multi-step spatial reasoning via reinforcement fine-tuning (RFT), with metric-sensitive process reward functions tailored for spatial referring tasks. To support SFT and RFT training, we introduce RefSpatial, a large-scale dataset of 20M QA pairs (2x prior), covering 31 spatial relations (vs. 15 prior) and supporting complex reasoning processes (up to 5 steps). In addition, we introduce RefSpatial-Bench, a challenging benchmark filling the gap in evaluating spatial referring with multi-step reasoning. Experiments show that SFT-trained RoboRefer achieves state-of-the-art spatial understanding, with an average success rate of 89.6%. RFT-trained RoboRefer further outperforms all other baselines by a large margin, even surpassing Gemini-2.5-Pro by 17.4% in average accuracy on RefSpatial-Bench. Notably, RoboRefer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (e,g., UR5, G1 humanoid) in cluttered real-world scenes.