🤖 AI Summary
This work addresses the challenges of multi-step metric alignment and dynamic spatial measurement in embodied robotic interaction. Methodologically, we propose the first vision-language model (VLM) framework supporting 3D spatial tracking, featuring a joint 3D spatial encoder–regression decoder architecture, metric-sensitive process reward reinforcement fine-tuning (RFT), and the large-scale TraceSpatial dataset (30M QA pairs) with its benchmark, TraceSpatial-Bench. Our contributions are threefold: (1) the first unified multimodal VLM enabling precise 3D spatial reference, geometric measurement, and long-horizon reasoning; (2) an interpretable intermediate perceptual cue optimization mechanism; and (3) state-of-the-art performance—79.1% average success rate on spatial understanding, measurement, and referencing in real-world cluttered scenes—with TraceSpatial-Bench accuracy surpassing Gemini-2.5-Pro by 36%. The framework has been successfully deployed on UR5 robotic arms and G1 humanoid robots.
📝 Abstract
Spatial tracing, as a fundamental embodied interaction ability for robots, is inherently challenging as it requires multi-step metric-grounded reasoning compounded with complex spatial referring and real-world metric measurement. However, existing methods struggle with this compositional task. To this end, we propose RoboTracer, a 3D-aware VLM that first achieves both 3D spatial referring and measuring via a universal spatial encoder and a regression-supervised decoder to enhance scale awareness during supervised fine-tuning (SFT). Moreover, RoboTracer advances multi-step metric-grounded reasoning via reinforcement fine-tuning (RFT) with metric-sensitive process rewards, supervising key intermediate perceptual cues to accurately generate spatial traces. To support SFT and RFT training, we introduce TraceSpatial, a large-scale dataset of 30M QA pairs, spanning outdoor/indoor/tabletop scenes and supporting complex reasoning processes (up to 9 steps). We further present TraceSpatial-Bench, a challenging benchmark filling the gap to evaluate spatial tracing. Experimental results show that RoboTracer surpasses baselines in spatial understanding, measuring, and referring, with an average success rate of 79.1%, and also achieves SOTA performance on TraceSpatial-Bench by a large margin, exceeding Gemini-2.5-Pro by 36% accuracy. Notably, RoboTracer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (UR5, G1 humanoid) in cluttered real-world scenes.