RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics

📅 2025-12-15

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the challenges of multi-step metric alignment and dynamic spatial measurement in embodied robotic interaction. Methodologically, we propose the first vision-language model (VLM) framework supporting 3D spatial tracking, featuring a joint 3D spatial encoder–regression decoder architecture, metric-sensitive process reward reinforcement fine-tuning (RFT), and the large-scale TraceSpatial dataset (30M QA pairs) with its benchmark, TraceSpatial-Bench. Our contributions are threefold: (1) the first unified multimodal VLM enabling precise 3D spatial reference, geometric measurement, and long-horizon reasoning; (2) an interpretable intermediate perceptual cue optimization mechanism; and (3) state-of-the-art performance—79.1% average success rate on spatial understanding, measurement, and referencing in real-world cluttered scenes—with TraceSpatial-Bench accuracy surpassing Gemini-2.5-Pro by 36%. The framework has been successfully deployed on UR5 robotic arms and G1 humanoid robots.

Technology Category

Application Category

📝 Abstract

Spatial tracing, as a fundamental embodied interaction ability for robots, is inherently challenging as it requires multi-step metric-grounded reasoning compounded with complex spatial referring and real-world metric measurement. However, existing methods struggle with this compositional task. To this end, we propose RoboTracer, a 3D-aware VLM that first achieves both 3D spatial referring and measuring via a universal spatial encoder and a regression-supervised decoder to enhance scale awareness during supervised fine-tuning (SFT). Moreover, RoboTracer advances multi-step metric-grounded reasoning via reinforcement fine-tuning (RFT) with metric-sensitive process rewards, supervising key intermediate perceptual cues to accurately generate spatial traces. To support SFT and RFT training, we introduce TraceSpatial, a large-scale dataset of 30M QA pairs, spanning outdoor/indoor/tabletop scenes and supporting complex reasoning processes (up to 9 steps). We further present TraceSpatial-Bench, a challenging benchmark filling the gap to evaluate spatial tracing. Experimental results show that RoboTracer surpasses baselines in spatial understanding, measuring, and referring, with an average success rate of 79.1%, and also achieves SOTA performance on TraceSpatial-Bench by a large margin, exceeding Gemini-2.5-Pro by 36% accuracy. Notably, RoboTracer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (UR5, G1 humanoid) in cluttered real-world scenes.

Problem

Research questions and friction points this paper is trying to address.

Enhancing robot spatial tracing through multi-step metric reasoning

Improving 3D spatial referring and measurement in vision-language models

Addressing compositional challenges in real-world robotic spatial interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D-aware VLM with spatial encoder and regression decoder

Reinforcement fine-tuning with metric-sensitive process rewards

Large-scale TraceSpatial dataset for training and evaluation

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey